KyotoEBMT v0.6 Example-Based Machine Translation SystemWelcome to the official documentation for KyotoEBMT v0.6 ‘Whisky’. The KyotoEBMT translation framework takes a tree-to-tree approach to Machine Translation, employing syntactic dependency analysis for both source and target languages in an attempt to preserve non-local structure. The effectiveness of our system is maximized with online example matching and a flexible decoder. Evaluation demonstrates BLEU scores competitive with state-of-the-art SMT systems such as Moses. This document explains the usage of the KyotoEBMT system (Richardson et al., 2014). See the project homepage for further details. The source can be downloaded here. Please send any questions or comments to: ContributionsWe gratefully welcome your contributions to KyotoEBMT.
All patches/bug reports should be sent by email to John (see above). Version History
See here for old documentation (v0.4 ‘Bourbon’, v0.5 ‘Sake’). EvaluationEvaluation results (BLEU) for ASPEC data set (see here).
InstallationThe installation of KyotoEBMT is performed as follows: autoreconf -i ./configure make make sample
DependenciesKyotoEBMT has the following dependencies. Older versions may also work, but are not tested.
Non-included libraries can be obtained from binary repositories or compiled from source. Included libraries are built automatically during KyotoEBMT installation, but can also be installed manually using the libs/build_deps.sh script. Test RunTo confirm the correct installation of KyotoEBMT, the following commands can be run. Note that KyotoEBMT must be run from the directory in which it is located. Note also that you must run 'make sample’ from the top directory after installation in order to run KyotoEBMT with the commands below. bin/KyotoEBMT The output should be something like this: 0 ||| He also read a book bought the i new . ||| f= 0 0 0 7 0 1 ... 8.26565 0 7 0 8 8 10 ||| -174.543 0 ||| He also read a book bought the my new . ||| f= 0 0 0 7 0 1 .. 8.26565 0 7 0 8 7.2 9.2 ||| -176.913 0 ||| He also read the i bought a book new . ||| f= 0 0 0 7 0 1 .. 8.26565 0 7 0 8 8 10 ||| -177.234 Preparing Training DataThe first step required to use KyotoEBMT is the preparation of a training corpus. The system requires aligned dependency trees as input, so it is necessary first to parse and align your input corpus. Any combination of alignment and parsing tools can be used, so long as the output is converted to the format below. KyotoEBMT is distributed with a sample corpus of 10 parallel Japanese-English sentences, which can be found under sample/corpus/. Please use this as a reference when preparing your own training data. In this version (v0.6) we unfortunately do not support automatic parsing and alignment using the Experiment Management System (EMS), however this will be added in the near future. In the meantime, please generate your own parses and alignments in the following formats. Parse FormatThe parse format is similar to the well-known CoNLL format. Each sentence is made up of a list of tokens, one token per line, with a tab-separated list giving the attributes for each token. Sentences are delimited by empty lines. For each sentence a metadata line must be defined. This consists of the sentence ID (anything alphanumeric is fine) and the sentence parse score (used as a feature, can always be set to zero). Token attributes are as follows. They are used as features inside the decoder and may be set as you wish, however the dependencies must be set correctly for any kind of translation. ID DPND CONTENT LEMMA POS CONTENT_P CAT TYPE OTHER
Parse format example: # ID=SAMPLE_TRAIN_00001 SCORE=-15.1648 0 1 i i PRP 1 NP _ NP:PRP/I/0 1 -1 read read VBP 1 S1 _ S1:S:VP:VBP/read/1 2 3 a a DT 0 DT _ DT/a/2 3 1 book book NN 1 NP _ NP:NN/book/3 4 1 . . . -1 . _ ././4 # ID=SAMPLE_TRAIN_00002 SCORE=-15.4925 0 1 i i PRP 1 NP _ NP:PRP/I/0 1 -1 read read VBP 1 S1 _ S1:S:VP:VBP/read/1 2 3 a a DT 0 DT _ DT/a/2 3 1 newspaper newspaper NN 1 NP _ NP:NN/newspaper/3 4 1 . . . -1 . _ ././4 ... Alignment FormatThe alignment format is similar to the Pharaoh format used by Moses. Odd lines consist of the sentence ID beginning with a hash mark. Even lines consist of space-delimited alignments, which are of the format s_1,…,s_m-t_1,…,t_n where source IDs s_1, …, s_m are aligned to target IDs t_1, …, t_n. Alignment format example: # SAMPLE_TRAIN_00001 0-0 4-1 1,3-2 2-3 5-4 # SAMPLE_TRAIN_00002 0-0 4-1 1,3-2 2-3 5-4 ... DictionaryA dictionary file for words and punctuation can be specified (dic/words.SRC-TRG and dic/punc.SRC-TRG in EMS experiment directory, or with --dic_words_filename/--dic_punc_filename command-line options). Dictionaries are tab-separated files containing a list of source-target word pairs that will be added to the list of translation hypotheses retrieved from the example database. Example: バンジョー banjo ギター guitar Experimental Management System (EMS)The KyotoEBMT Experimental Managament System (EMS) can be used to run experiments end-to-end using KyotoEBMT. Please note that we do not include parsers and aligners with this toolkit, however you may use your own tools (e.g. GIZA++). The EMS works like a Makefile, in that it defines a number of targets ('stages’) and dependencies between them. When the user requests a target stage, the EMS will calculate all incomplete required dependencies and execute the required stages. To run the EMS, use the following command (from inside the ems/ directory). After confirming the stages to be run, the EMS will conduct an experiment and return the results. cd ems python ems.py var var is the path to your configuration file. A sample configuration file can be found at ems/var. Experiments can be cloned for future use saving disk space by using the ems/init.py tool. The directory searched for experiments to clone is defined in the ems/clone_dir file. ConfigurationThe settings for the EMS are specified in the var file within the experiment directory. The settings mainly concern locations of external tools and experimental data, and options for the decoder (such as beam width). Please use non-default settings at your own risk! Stages and FilesThe available stages are as follows:
Each stage has corresponding data, which is stored under the experiment root directory ROOT_DIR in a subdirectory with the same name as the stage (e.g. ROOT_DIR/corpus, ROOT_DIR/rerank). Manual OperationThis section describes how to use each section of the translation system without the EMS. Example DatabaseThis step extracts examples from the training corpus in order to build the example database. In order to build the database, the training corpus must first be parsed and aligned. The example database is built with the tool TR++ (Cromières and Kurohashi, 2011). The command below will build a database database from the source parse file source, target parse file target and alignment file align. bin/createData++ database "source|target|align" Language ModelKyotoEBMT is designed to be used with a 5-gram language model using modified Kneser-Ney smoothing. The language model makes use of lower-order rest costs, so must be built to include these. For convenience, we have included a tool to build a compatible language model. This requires KenLM to be installed (see here for instructions). utils/build_lm.sh input-file output-file Lexical Translation ProbabilitiesThe lexical translation probabilities obtained from alignment are used as a feature for decoding. These can be generated from GIZA++ output with the build_lex_trans_prob.pl script as follows. utils/build_lex_trans_prob.pl corpus/fr.vcb corpus/en.vcb giza.fr-en/fr-en.n3.final giza.fr-en/fr-en.t3.final lex_i2j > ltp && \ utils/build_lex_trans_prob.pl corpus/en.vcb corpus/fr.vcb giza.en-fr/en-fr.n3.final giza.en-fr/en-fr.t3.final lex_j2i >> ltp Hypothesis SerializationIn order to perform tuning efficiently it is necessary to retrieve and store the necessary examples (initial hypotheses) in an intermediary file known as a ‘hypothesis file’. To generate hypotheses, use the -m hypothesis command-line option with KyotoEBMT. TranslationThe decoder is run with the following command: bin/KyotoEBMT [options] input-file The input can either be a parsed sentence (the format is the same for training) or an intermediate hypothesis file (add --input_mode hypothesis). A range of options can be specified, corresponding to paths and settings for the decoder. The most common options are given below.
For full details, see: bin/KyotoEBMT -h Tuning and RerankingWe currently support k-best batch MIRA (Cherry and Foster, 2012) (default) and PRO (Hopkins and May, 2011). Tuning is only available when using the EMS. We also support reranking using recurrent neural network language models (RNNLMs) (Mikolov et al., 2010). This is achieved by specifying STAGE_RERANK in the EMS. The files produced during tuning include (for iteration i):
LicenseKyotoEBMT Example-Based Machine Translation System This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details. You should have received a copy of the GNU Affero General Public License along with this program. If not, see <http://www.gnu.org/licenses/>. ReferencesColin Cherry and George Foster. Batch Tuning Strategies for Statistical Machine Translation. In HLT-NAACL (2012). Fabien Cromieres and Sadao Kurohashi. Efficient Retrieval of Tree Translation Examples for Syntax-Based Machine Translation. In Proceedings of EMNLP (2011). Mark Hopkins and Jonathan May. Tuning as Ranking. In Proceedings of EMNLP (2011). Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky and Sanjeev Khudanpur. Recurrent Neural Network Based Language Model. In Proceedings of the 11th Annual Conference of the International Speech Communication Association (2010). John Richardson, Fabien Cromieres, Toshiaki Nakazawa and Sadao Kurohashi. KyotoEBMT: An Example-Based Dependency-to-Dependency Translation Framework (System Demonstration). In Proceedings of ACL (2014). |