KyotoEBMT v0.4 DocumentationWelcome to the official documentation for KyotoEBMT v0.4 ‘Bourbon’. The KyotoEBMT translation framework takes a tree-to-tree approach to Machine Translation, employing syntactic dependency analysis for both source and target languages in an attempt to preserve non-local structure. The effectiveness of our system is maximized with online example matching and a flexible decoder. Evaluation demonstrates BLEU scores competitive with state-of-the-art SMT systems such as Moses. This document explains the usage of the KyotoEBMT system (Richardson et al., 2014). See the project homepage for further details and information on how to obtain KyotoEBMT. Please send any questions or comments to: ContributionsWe gratefully welcome your contributions to KyotoEBMT.
All patches/bug reports should be sent by email to John (see above). InstallationThe installation of KyotoEBMT is performed as follows: ./configure make make sample
DependenciesKyotoEBMT has the following dependencies. Older versions may also work, but are not tested.
Non-included libraries can be obtained from binary repositories or compiled from source. Included libraries are built automatically during KyotoEBMT installation, but can also be installed manually using the ./libs/build_deps.sh script. Test RunTo confirm the correct installation of KyotoEBMT, the following commands can be run. Note that KyotoEBMT must be run from the directory in which it is located. Note also that you must run 'make sample’ from the top directory after installation in order to run KyotoEBMT with the commands below. cd src ./KyotoEBMT The output should be something like this: 0 ||| He also read I が bought new book . ||| f= 0 1 0 1 0 0 0 8 -184.207 -165.786 -12.4542 5.8861 ... 7.33333 7.5 8.5 ||| -62.3375 0 ||| He also read I が bought new a book . ||| f= 0 1 0 1 0 0 0 8 -184.207 -184.207 -14.41 5.8861 ... 7.33333 8 10 ||| -63.4433 0 ||| He also read I が bought is new book . ||| f= 0 1 0 1 0 0 0 8 -184.207 -184.207 -12.6471 5.8861 ... 7.66667 7.5 8.5 ||| -63.7638 ... Preparing training dataThe first step for using KyotoEBMT is the preparation of a training corpus. The system requires aligned dependency tree input, so it is necessary first to parse and align your input corpus. Any combination of alignment and parsing tools can be used, so long as the output is formatted as XML (see example below). KyotoEBMT comes with a sample corpus of 10 parallel Japanese-English sentences, which can be found under ./sample/. Please use this as a reference when preparing your own training data. <para_sentence id="1"> <i_sentence></i_sentence> <j_sentence></j_sentence> <i_data> <phrase dpnd="1" cat="NONE"> <word lem="私/わたし" pos="名詞:普通名詞" content_p="1">私</word> </phrase> <phrase dpnd="4" cat="体言"> <word lem="は/は" pos="助詞:副助詞" content_p="0">は</word> </phrase> <phrase dpnd="3" cat="NONE"> <word lem="本/ほん" pos="名詞:普通名詞" content_p="1">本</word> </phrase> <phrase dpnd="4" cat="体言"> <word lem="を/を" pos="助詞:格助詞" content_p="0">を</word> </phrase> <phrase dpnd="-1" cat="用言:動"> <word lem="読む/よむ" pos="動詞" content_p="1">読んだ</word> </phrase> <phrase dpnd="4" pnum="5" cat="NONE"> <word lem="。/。" pos="特殊:句点" content_p="0">。</word> </phrase> </i_data> <j_data> <phrase dpnd="1" cat="NP"> <word lem="i" pos="PRP" content_p="1">I</word> </phrase> <phrase dpnd="-1" cat="S1"> <word lem="read" pos="VBP" content_p="1">read</word> </phrase> <phrase dpnd="3" cat="DT"> <word lem="a" pos="DT" content_p="0">a</word> </phrase> <phrase dpnd="1" cat="NP"> <word lem="book" pos="NN" content_p="1">book</word> </phrase> <phrase dpnd="1" cat="."> <word lem="." pos="." content_p="1">.</word> </phrase> </j_data> <match i_p="0" j_p="0"> <word i_w="0" j_w="0"></word> </match> <match i_p="2" j_p="2 3"> <word i_w="2" j_w="2 3"></word> </match> <match i_p="4" j_p="1"> <word i_w="4" j_w="1"></word> </match> <match i_p="5" j_p="4"> <word i_w="5" j_w="4"></word> </match> </para_sentence> Currently we require that each word is given by a <word> tag enclosed inside a <phrase> tag. A <phrase> can only contain one <word>. The information required for each phrase/word pair is the word itself (e.g. <word>cat</word>) and the following additional tags. Of these, ‘dpnd’ absolutely must be set correctly, however the others are used only for language-dependent processing and can be set as you wish (or blank).
Alignment information is stored in the form of <match> tags containing <word> tags with identical alignment information (see example). Example DatabaseThe next step is to extract examples from the training corpus in order to build the example database. This is performed with the tool TR++ (Cromières and Kurohashi, 2011). The commands below will build a database under database-path from the input file xml-file. cd TR++ make ./createData++ database-path xml-file Language ModelKyotoEBMT is designed to be used with a 5-gram language model using modified Kneser-Ney smoothing. The language model makes use of lower-order rest costs, so must be built to include these. For convenience, we have included a tool to build a compatible language model. This requires KenLM to be installed (see here for instructions). ./utils/build_lm.sh input-file output-file (Optional) Other DataThere are various other sources of information that can be used by KyotoEBMT as features for decoding. Currently supported are the following: DictionaryA tab-separated file containing a list of source-target word pairs that will be added to the list of translation hypotheses retrieved from the example database. バンジョー banjo ギター guitar Default dependency informationWhen it is unclear whether an ‘additional’ word should be attached as a left or right child, we try first to use the rules (left or right) in the ‘default dependency’ file. This can be generated automatically from the training corpus XML as follows. The script requires the XML Perl module, available from CPAN. ./utils/build_default_dpnd.pl --trans_dir ja-en --print_all output-prefix < input-file Lexical translation probabilityThe lexical translation probabilities obtained from alignment can be used as a feature for decoding. These can be generated from GIZA++ output with the build_lex_trans_prob.pl script as follows. ./utils/build_lex_trans_prob.pl corpus/fr.vcb corpus/en.vcb giza.fr-en/fr-en.n3.final giza.fr-en/fr-en.t3.final lex_i2j > output-file && \ ./utils/build_lex_trans_prob.pl corpus/en.vcb corpus/fr.vcb giza.en-fr/en-fr.n3.final giza.en-fr/en-fr.t3.final lex_j2i >> output-file TranslationThe decoder is run with the following command. Note that it is currently necessary to run the decoder from the directory in which it is located. cd src ./KyotoEBMT [options] input-file The required input format is XML, similar to that for the training corpus but with only one language and no alignment information. An example is given below: <?xml version="1.0" encoding="utf-8"?> <article> <sentence id="0"> <i_data> <phrase dpnd="1" cat="NONE"> <word lem="彼/かれ" pos="名詞:普通名詞" content_p="1">彼</word> </phrase> <phrase dpnd="8" cat="体言"> <word lem="も/も" pos="助詞:副助詞" content_p="0">も</word> </phrase> <phrase dpnd="3" cat="NONE"> <word lem="私/わたし" pos="名詞:普通名詞" content_p="1">私</word> </phrase> <phrase dpnd="4" cat="体言"> <word lem="が/が" pos="助詞:格助詞" content_p="0">が</word> </phrase> <phrase dpnd="6" cat="NONE"> <word lem="買う/かう" pos="動詞" content_p="1">買った</word> </phrase> <phrase dpnd="6" cat="用言:形"> <word lem="新しい/あたらしい" pos="形容詞" content_p="1">新しい</word> </phrase> <phrase dpnd="7" cat="NONE"> <word lem="本/ほん" pos="名詞:普通名詞" content_p="1">本</word> </phrase> <phrase dpnd="8" cat="体言"> <word lem="を/を" pos="助詞:格助詞" content_p="0">を</word> </phrase> <phrase dpnd="-1" cat="用言:動"> <word lem="読む/よむ" pos="動詞" content_p="1">読んだ</word> </phrase> <phrase dpnd="8" cat="NONE"> <word lem="。/。" pos="特殊:句点" content_p="0">。</word> </phrase> </i_data> </sentence> </article> A range of options can be specified, corresponding to paths and settings for the decoder. The most common options are given below.
For full details, run the command: ./KyotoEBMT -h Tuningtuner.pyThe feature weights for KyotoEBMT can be tuned using the scripts in the ./tuning directory, in particular with the command-line user interface tuner.py (see screenshot below). The tuning algorithm currently supported is PRO (Hopkins and May, 2011). From the main menu, the following options are available:
Warning: The progress bars and various settings are not yet fully implemented. Note: All commands run by the script are logged to ./tuning/tuner.log. This is useful for debugging. SettingsThe settings for the decoder can be specified using the ‘Select Data’ and ‘Change Settings’ options accessible from the main menu of tuner.py. The default settings are loaded from defaults.ini and settings.ini in the same folder as the tuning script. These can be modified directly to make permanent changes, and settings can be temporarily changed (they are not saved) using the UI. The settings mainly concern locations of external tools and experimental data, and options for the decoder (such as beam width). No guarantees are made for the use of the decoder with non-default settings. A particularly important file is weights.ini, the file containing initial features and weights to be used for tuning. Features preceded with asterisks (*) will be ignored during tuning. Hypothesis SerializationIn order to perform tuning efficiently it is necessary to retrieve and store the necessary examples (initial hypotheses) for the development and test sets. Please follow the steps below:
Tuner OutputThe main tuning script tuner.py outputs various files into the specified output folder. Please note that the previous contents of the output folder will be deleted each time the tuning script is run. The files produced include (for iteration $i):
Reranking (Currently Unsupported)
LicenseKyotoEBMT Example-Based Machine Translation System This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details. You should have received a copy of the GNU Affero General Public License along with this program. If not, see <http://www.gnu.org/licenses/>. ReferencesFabien Cromieres and Sadao Kurohashi. Efficient Retrieval of Tree Translation Examples for Syntax-Based Machine Translation. In Proceedings of EMNLP (2011). Mark Hopkins and Jonathan May. Tuning as Ranking. In Proceedings of EMNLP (2011). John Richardson, Fabien Cromieres, Toshiaki Nakazawa and Sadao Kurohashi. KyotoEBMT: An Example-Based Dependency-to-Dependency Translation Framework (System Demonstration). In Proceedings of ACL (2014). |