KyotoEBMT v0.4 Documentation

Welcome to the official documentation for KyotoEBMT v0.4 ‘Bourbon’.

The KyotoEBMT translation framework takes a tree-to-tree approach to Machine Translation, employing syntactic dependency analysis for both source and target languages in an attempt to preserve non-local structure. The effectiveness of our system is maximized with online example matching and a flexible decoder. Evaluation demonstrates BLEU scores competitive with state-of-the-art SMT systems such as Moses.

This document explains the usage of the KyotoEBMT system (Richardson et al., 2014). See the project homepage for further details and information on how to obtain KyotoEBMT.

Please send any questions or comments to:

Contributions

We gratefully welcome your contributions to KyotoEBMT.

Bug reports: Please give a detailed explanation of anything you feel is not working correctly.
Fixes: Please send a patch (diff) with a short explanation of your changes. We use the Google C++ Style Guide with some minor variations.
New features: Anything is welcome, but please check first that your feature is not currently being worked on by someone else.

All patches/bug reports should be sent by email to John (see above).

Installation

The installation of KyotoEBMT is performed as follows:

./configure
make
make sample

Use ./configure --with-boost=XXX to specify a non-default location for Boost.
Use 'make debug’ to enable debugging.
On Mac OS X, we recommend installing ICU from source. If you use a MacPorts/Homebrew ICU package, you may need to add include/linker paths manually to ./libs/build_deps.sh to compile KenLM.

Dependencies

KyotoEBMT has the following dependencies. Older versions may also work, but are not tested.

GCC 4.8.1 (https:gcc.gnu.org)
Boost 1.55.0 (http:www.boost.org)
Python 2.7.3 (https:www.python.org)
ICU 52.1 (http:site.icu-project.org)
Moses 3.0 (http:www.statmt.org/moses)
(Included) Google Logging Library (http:code.google.compgoogle-glog)
(Included) KenLM (http:kheafield.comcodekenlm)
(Included) Urwid (http:urwid.org)
(Included) UTF8-CPP (http:utfcpp.sourceforge.net)
(Optional, PRO tuning) MegaM (http:www.umiacs.umd.edu halmegam)
(Optional, parallel training) GXP (http:www.logos.ic.i.u-tokyo.ac.jp/gxp)
(Optional, RNNLM reranking) RNNLM (http:www.fit.vutbr.cz imikolovrnnlm)

Non-included libraries can be obtained from binary repositories or compiled from source. Included libraries are built automatically during KyotoEBMT installation, but can also be installed manually using the ./libs/build_deps.sh script.

Test Run

To confirm the correct installation of KyotoEBMT, the following commands can be run. Note that KyotoEBMT must be run from the directory in which it is located. Note also that you must run 'make sample’ from the top directory after installation in order to run KyotoEBMT with the commands below.

cd src
./KyotoEBMT

The output should be something like this:

||| He also read I が bought new book . ||| f= 0 1 0 1 0 0 0 8 -184.207 -165.786 -12.4542 5.8861 ... 7.33333 7.5 8.5  ||| -62.3375
||| He also read I が bought new a book . ||| f= 0 1 0 1 0 0 0 8 -184.207 -184.207 -14.41 5.8861 ... 7.33333 8 10  ||| -63.4433
||| He also read I が bought is new book . ||| f= 0 1 0 1 0 0 0 8 -184.207 -184.207 -12.6471 5.8861 ... 7.66667 7.5 8.5  ||| -63.7638
...

Preparing training data

The first step for using KyotoEBMT is the preparation of a training corpus. The system requires aligned dependency tree input, so it is necessary first to parse and align your input corpus. Any combination of alignment and parsing tools can be used, so long as the output is formatted as XML (see example below).

KyotoEBMT comes with a sample corpus of 10 parallel Japanese-English sentences, which can be found under ./sample/. Please use this as a reference when preparing your own training data.

<para_sentence id="1">
    <i_sentence></i_sentence>
    <j_sentence></j_sentence>
    <i_data>
        <phrase dpnd="1" cat="NONE">
            <word lem="私/わたし" pos="名詞:普通名詞" content_p="1">私</word>
        </phrase>
        <phrase dpnd="4" cat="体言">
            <word lem="は/は" pos="助詞:副助詞" content_p="0">は</word>
        </phrase>
        <phrase dpnd="3" cat="NONE">
            <word lem="本/ほん" pos="名詞:普通名詞" content_p="1">本</word>
        </phrase>
        <phrase dpnd="4" cat="体言">
            <word lem="を/を" pos="助詞:格助詞" content_p="0">を</word>
        </phrase>
        <phrase dpnd="-1" cat="用言:動">
            <word lem="読む/よむ" pos="動詞" content_p="1">読んだ</word>
        </phrase>
        <phrase dpnd="4" pnum="5" cat="NONE">
            <word lem="。/。" pos="特殊:句点" content_p="0">。</word>
        </phrase>
    </i_data>
    <j_data>
        <phrase dpnd="1" cat="NP">
            <word lem="i" pos="PRP" content_p="1">I</word>
        </phrase>
        <phrase dpnd="-1" cat="S1">
            <word lem="read" pos="VBP" content_p="1">read</word>
        </phrase>
        <phrase dpnd="3" cat="DT">
            <word lem="a" pos="DT" content_p="0">a</word>
        </phrase>
        <phrase dpnd="1" cat="NP">
            <word lem="book" pos="NN" content_p="1">book</word>
        </phrase>
        <phrase dpnd="1" cat=".">
            <word lem="." pos="." content_p="1">.</word>
        </phrase>
    </j_data>
    <match i_p="0" j_p="0">
        <word i_w="0" j_w="0"></word>
    </match>
    <match i_p="2" j_p="2 3">
        <word i_w="2" j_w="2 3"></word>
    </match>
    <match i_p="4" j_p="1">
        <word i_w="4" j_w="1"></word>
    </match>
    <match i_p="5" j_p="4">
        <word i_w="5" j_w="4"></word>
    </match>
</para_sentence>

Currently we require that each word is given by a <word> tag enclosed inside a <phrase> tag. A <phrase> can only contain one <word>. The information required for each phrase/word pair is the word itself (e.g. <word>cat</word>) and the following additional tags. Of these, ‘dpnd’ absolutely must be set correctly, however the others are used only for language-dependent processing and can be set as you wish (or blank).

<phrase>	dpnd	ID of parent in dependency tree (-1=root)
<phrase>	cat	phrase category
<word>	lemma	lemmatized form
<word>	pos	part of speech
<word>	content_p	is content word? (1=yes, 0=no)

Alignment information is stored in the form of <match> tags containing <word> tags with identical alignment information (see example).

Example Database

The next step is to extract examples from the training corpus in order to build the example database. This is performed with the tool TR++ (Cromières and Kurohashi, 2011). The commands below will build a database under database-path from the input file xml-file.

cd TR++
make
./createData++ database-path xml-file

Language Model

KyotoEBMT is designed to be used with a 5-gram language model using modified Kneser-Ney smoothing. The language model makes use of lower-order rest costs, so must be built to include these. For convenience, we have included a tool to build a compatible language model. This requires KenLM to be installed (see here for instructions).

./utils/build_lm.sh input-file output-file

(Optional) Other Data

There are various other sources of information that can be used by KyotoEBMT as features for decoding. Currently supported are the following:

Dictionary

A tab-separated file containing a list of source-target word pairs that will be added to the list of translation hypotheses retrieved from the example database.

バンジョー  banjo
ギター  guitar

Default dependency information

When it is unclear whether an ‘additional’ word should be attached as a left or right child, we try first to use the rules (left or right) in the ‘default dependency’ file. This can be generated automatically from the training corpus XML as follows. The script requires the XML Perl module, available from CPAN.

./utils/build_default_dpnd.pl --trans_dir ja-en --print_all output-prefix < input-file

Lexical translation probability

The lexical translation probabilities obtained from alignment can be used as a feature for decoding. These can be generated from GIZA++ output with the build_lex_trans_prob.pl script as follows.

./utils/build_lex_trans_prob.pl corpus/fr.vcb corpus/en.vcb giza.fr-en/fr-en.n3.final giza.fr-en/fr-en.t3.final lex_i2j > output-file && \
./utils/build_lex_trans_prob.pl corpus/en.vcb corpus/fr.vcb giza.en-fr/en-fr.n3.final giza.en-fr/en-fr.t3.final lex_j2i >> output-file

Translation

The decoder is run with the following command. Note that it is currently necessary to run the decoder from the directory in which it is located.

cd src
./KyotoEBMT [options] input-file

The required input format is XML, similar to that for the training corpus but with only one language and no alignment information. An example is given below:

<?xml version="1.0" encoding="utf-8"?>
<article>
  <sentence id="0">
    <i_data>
      <phrase dpnd="1" cat="NONE">
        <word lem="彼/かれ" pos="名詞:普通名詞" content_p="1">彼</word>
      </phrase>
      <phrase dpnd="8" cat="体言">
        <word lem="も/も" pos="助詞:副助詞" content_p="0">も</word>
      </phrase>
      <phrase dpnd="3" cat="NONE">
        <word lem="私/わたし" pos="名詞:普通名詞" content_p="1">私</word>
      </phrase>
      <phrase dpnd="4" cat="体言">
        <word lem="が/が" pos="助詞:格助詞" content_p="0">が</word>
      </phrase>
      <phrase dpnd="6" cat="NONE">
        <word lem="買う/かう" pos="動詞" content_p="1">買った</word>
      </phrase>
      <phrase dpnd="6" cat="用言:形">
        <word lem="新しい/あたらしい" pos="形容詞" content_p="1">新しい</word>
      </phrase>
      <phrase dpnd="7" cat="NONE">
        <word lem="本/ほん" pos="名詞:普通名詞" content_p="1">本</word>
      </phrase>
      <phrase dpnd="8" cat="体言">
        <word lem="を/を" pos="助詞:格助詞" content_p="0">を</word>
      </phrase>
      <phrase dpnd="-1" cat="用言:動">
        <word lem="読む/よむ" pos="動詞" content_p="1">読んだ</word>
      </phrase>
      <phrase dpnd="8" cat="NONE">
        <word lem="。/。" pos="特殊:句点" content_p="0">。</word>
      </phrase>
    </i_data>
  </sentence>
</article>

A range of options can be specified, corresponding to paths and settings for the decoder. The most common options are given below.

--config	Configuration file.
--nb_threads	Number of threads to use.
--tr_db_filenames	Example database.
--tr_language	Example DB translation direction (1 or 2).
--input_mode	Input mode (e.g. xml/hypothesis).
--default_dpnd	Default dependency file.
--weight	Feature weights file.
--lm	LM file.
--beam_width	Beam width.
--source_language	Source language.
--target_language	Target language.
--dic_words_filename	Standard dictionary.
--dic_punc_filename	Punctuation dictionary.
--lexical_translation_prob_filename	Lexical translation probability table.
--output_mode	Output mode (e.g. tuning/eval/hypothesis).
--n_best_length	Length of n-best list.

For full details, run the command:

./KyotoEBMT -h

Tuning

tuner.py

The feature weights for KyotoEBMT can be tuned using the scripts in the ./tuning directory, in particular with the command-line user interface tuner.py (see screenshot below). The tuning algorithm currently supported is PRO (Hopkins and May, 2011).

From the main menu, the following options are available:

‘Select Data’: Specify train/dev/test data. Default settings are loaded from ./tuning/defaults.ini and these can be selected/modified with the tuning interface. See below for more details.

‘Change Settings’: Set a variety of settings for the decoder/tuning. The default settings are loaded from ./tuning/settings.ini, and like the data settings can be modified temporarily using the tuning interface.

‘Serialize Hypotheses’: Retrieves examples for the given input sentences (dev/test set) and stores the generated initial hypotheses to file. Hypothesis serialization is currently required before tuning.

‘Run Tuning’: Tunes feature weights used in the decoder. Tuning is run on serialized hypothesis files. The n-best lists, weights and various other data for each iteration are stored in the output folder specified.

‘Generate Reranking Data’: Currently not supported.

‘Evaluate’: Performs evaluation (BLEU) on the test set using the tuned weights found in the specified output folder. Make sure the output folder is properly set before evaluation. Evaluation also currently requires serialized hypotheses.

‘Build Web Output’: Outputs translations for the dev set in a format designed for converting to HTML. For details see the section ‘Web Output’.

‘Save Results’: Packages experimental results into a subfolder of the output folder for easier maintenance. Currently not supported.

‘Quit’: Close the tuner.

Warning: The progress bars and various settings are not yet fully implemented. Note: All commands run by the script are logged to ./tuning/tuner.log. This is useful for debugging.

Settings

The settings for the decoder can be specified using the ‘Select Data’ and ‘Change Settings’ options accessible from the main menu of tuner.py. The default settings are loaded from defaults.ini and settings.ini in the same folder as the tuning script. These can be modified directly to make permanent changes, and settings can be temporarily changed (they are not saved) using the UI.

The settings mainly concern locations of external tools and experimental data, and options for the decoder (such as beam width). No guarantees are made for the use of the decoder with non-default settings.

A particularly important file is weights.ini, the file containing initial features and weights to be used for tuning. Features preceded with asterisks (*) will be ignored during tuning.

Hypothesis Serialization

In order to perform tuning efficiently it is necessary to retrieve and store the necessary examples (initial hypotheses) for the development and test sets. Please follow the steps below:

Run tuner (python tuner.py).
Choose 'Serialize Hypotheses (Dev)’.
When complete, return to main menu.
Choose 'Serialize Hypotheses (Test)’.

Tuner Output

The main tuning script tuner.py outputs various files into the specified output folder. Please note that the previous contents of the output folder will be deleted each time the tuning script is run.

The files produced include (for iteration $i):

run$i.pro.dat	Tuning output data (PRO).
run$i.megam.dat	Tuning output data (PRO, optimized with MegaM).
run$i.init.dat	Initial weights for iteration $i.
run$i.weights.dat	Tuned weights file. The run$i.weights.dat files can be used with KyotoEBMT by specifying the –weight option.
run$i.features.dat	Feature values for n-best list.
run$i.scores.dat	Scores for n-best list.
run$i.n_best.dat	n-best output from KyotoEBMT.
run$i.dev.dat	1-best output extracted from n-best output (dev).
run$i.test.dat	1-best output (test).

Reranking (Currently Unsupported)

Run 'Generate Reranking Data’ in tuner (make sure to set paths in data/settings options).
Copy {tuning output directory}/reranker* to ./tuning/reranker/input directory.
Using the feature generators in ./tuning/reranker/feature_*.sh, output feature lists to ./tuning/reranker/features. You only need to generate the features you want to use in reranking.
Modify the first line of code in ./tuning/reranker/reranker.py to list the names of all the features you want to include in reranking.
Run reranker.py.
Happy reranking!

License

KyotoEBMT Example-Based Machine Translation System
Copyright (C) 2014 John Richardson, Fabien Cromieres, Toshiaki Nakazawa and Sadao Kurohashi

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

References

Fabien Cromieres and Sadao Kurohashi. Efficient Retrieval of Tree Translation Examples for Syntax-Based Machine Translation. In Proceedings of EMNLP (2011).

Mark Hopkins and Jonathan May. Tuning as Ranking. In Proceedings of EMNLP (2011).

John Richardson, Fabien Cromieres, Toshiaki Nakazawa and Sadao Kurohashi. KyotoEBMT: An Example-Based Dependency-to-Dependency Translation Framework (System Demonstration). In Proceedings of ACL (2014).