KyotoEBMT v0.6 Example-Based Machine Translation System

Welcome to the official documentation for KyotoEBMT v0.6 ‘Whisky’.

The KyotoEBMT translation framework takes a tree-to-tree approach to Machine Translation, employing syntactic dependency analysis for both source and target languages in an attempt to preserve non-local structure. The effectiveness of our system is maximized with online example matching and a flexible decoder. Evaluation demonstrates BLEU scores competitive with state-of-the-art SMT systems such as Moses.

This document explains the usage of the KyotoEBMT system (Richardson et al., 2014). See the project homepage for further details.

The source can be downloaded here.

Please send any questions or comments to:

Image cannot be displayed. 

Contributions

We gratefully welcome your contributions to KyotoEBMT.

  • Bug reports: Please give a detailed explanation of anything you feel is not working correctly.

  • Fixes: Please send a patch (diff) with a short explanation of your changes. We use the Google C++ Style Guide with some minor variations.

  • New features: Anything is welcome, but please check first that your feature is not currently being worked on by someone else.

All patches/bug reports should be sent by email to John (see above).

Version History

  • 2015/03/31 v0.6 ‘Whisky’

    • Experiment Management System (EMS)

    • Example similarity, alignment score, word similarity features

    • Decoding memory and speed optimisations

  • 2014/10/04 v0.5 ‘Sake’ (@WAT2014)

    • Forest/n-best input (currently not supported)

    • Reranking with RNNLM (Mikolov et al., 2010)

  • 2014/06/18 v0.4 ‘Bourbon’ (@ACL2014)

    • Lexical translation probability feature

    • Tuner improvements

    • First public release

  • 2014/02/04 v0.3 ‘Champagne’

    • Many new example features

    • Lattice decoding

    • UI for tuning/serialisation

  • 2013/8/23 v0.2 ‘Mojito’

    • Improved decoding algorithm (also experimental ‘lazy’ algorithm)

    • Siblings supported

    • Language-dependent modules and post-processing

    • Hypothesis serialisation

    • Dependency tree output

  • 2013/5/13 v0.1 ‘Raki’

    • Perl version rewritten in C

    • Tuning scripts, web output, logging

See here for old documentation (v0.4 ‘Bourbon’, v0.5 ‘Sake’).

Evaluation

Evaluation results (BLEU) for ASPEC data set (see here).

JA-EN EN-JA JA-ZH ZH-JA
Moses 18.48 27.33 27.25 33.94
KyotoEBMT 0.4 18.47 26.13 26.54 30.98
KyotoEBMT 0.5 20.05 29.64 26.36 33.22
KyotoEBMT 0.6 21.00 29.13 27.46 34.08

Installation

The installation of KyotoEBMT is performed as follows:

autoreconf -i
./configure
make
make sample
  • Use ./configure --with-boost=XXX to specify a non-default location for Boost.

  • Use 'make debug’ to enable debugging.

  • On Mac OS X, we recommend installing ICU from source. If you use a MacPorts/Homebrew ICU package, you may need to add include/linker paths manually to libs/build_deps.sh to compile KenLM.

Dependencies

KyotoEBMT has the following dependencies. Older versions may also work, but are not tested.

  • GCC 4.8.1 (https://gcc.gnu.org)

  • Boost 1.55.0 (http://www.boost.org)

  • Python 2.7.3 (https://www.python.org)

  • ICU 52.1 (http://site.icu-project.org)

  • Moses 3.0 (http://www.statmt.org/moses)

  • (Included) Google Logging Library (http://code.google.com/p/google-glog)

  • (Included) KenLM (http://kheafield.com/code/kenlm)

  • (Included) Urwid (http://urwid.org)

  • (Included) UTF8-CPP (http://utfcpp.sourceforge.net)

  • (Optional, PRO tuning) MegaM (http://www.umiacs.umd.edu halmegam)

  • (Optional, parallel training) GXP (http://www.logos.ic.i.u-tokyo.ac.jp/gxp)

  • (Optional, RNNLM reranking) RNNLM (http://www.fit.vutbr.cz/ imikolov/rnnlm)

Non-included libraries can be obtained from binary repositories or compiled from source. Included libraries are built automatically during KyotoEBMT installation, but can also be installed manually using the libs/build_deps.sh script.

Test Run

To confirm the correct installation of KyotoEBMT, the following commands can be run. Note that KyotoEBMT must be run from the directory in which it is located. Note also that you must run 'make sample’ from the top directory after installation in order to run KyotoEBMT with the commands below.

bin/KyotoEBMT

The output should be something like this:

0 ||| He also read a book bought the i new .  ||| f= 0 0 0 7 0 1 ... 8.26565 0 7 0 8 8 10  ||| -174.543
0 ||| He also read a book bought the my new .  ||| f= 0 0 0 7 0 1 .. 8.26565 0 7 0 8 7.2 9.2  ||| -176.913
0 ||| He also read the i bought a book new .  ||| f= 0 0 0 7 0 1 .. 8.26565 0 7 0 8 8 10  ||| -177.234

Preparing Training Data

The first step required to use KyotoEBMT is the preparation of a training corpus. The system requires aligned dependency trees as input, so it is necessary first to parse and align your input corpus. Any combination of alignment and parsing tools can be used, so long as the output is converted to the format below.

KyotoEBMT is distributed with a sample corpus of 10 parallel Japanese-English sentences, which can be found under sample/corpus/. Please use this as a reference when preparing your own training data.

In this version (v0.6) we unfortunately do not support automatic parsing and alignment using the Experiment Management System (EMS), however this will be added in the near future. In the meantime, please generate your own parses and alignments in the following formats.

Parse Format

The parse format is similar to the well-known CoNLL format. Each sentence is made up of a list of tokens, one token per line, with a tab-separated list giving the attributes for each token. Sentences are delimited by empty lines.

For each sentence a metadata line must be defined. This consists of the sentence ID (anything alphanumeric is fine) and the sentence parse score (used as a feature, can always be set to zero).

Token attributes are as follows. They are used as features inside the decoder and may be set as you wish, however the dependencies must be set correctly for any kind of translation.

ID DPND CONTENT LEMMA POS CONTENT_P CAT TYPE OTHER
ID token ID (1, 2, …, n for n tokens)
DPND ID of parent in dependency tree (-1=root)
CONTENT surface form of token
LEMMA lemmatized form
POS part of speech
CONTENT_P is this a content word? (1=yes, 0=no, -1=punctuation)
CAT phrase category
TYPE phrase type
OTHER any other information

Parse format example:

# ID=SAMPLE_TRAIN_00001 SCORE=-15.1648
0	1	i	i	PRP	1	NP	_	NP:PRP/I/0
1	-1	read	read	VBP	1	S1	_	S1:S:VP:VBP/read/1
2	3	a	a	DT	0	DT	_	DT/a/2
3	1	book	book	NN	1	NP	_	NP:NN/book/3
4	1	.	.	.	-1	.	_	././4

# ID=SAMPLE_TRAIN_00002 SCORE=-15.4925
0	1	i	i	PRP	1	NP	_	NP:PRP/I/0
1	-1	read	read	VBP	1	S1	_	S1:S:VP:VBP/read/1
2	3	a	a	DT	0	DT	_	DT/a/2
3	1	newspaper	newspaper	NN	1	NP	_	NP:NN/newspaper/3
4	1	.	.	.	-1	.	_	././4

...

Alignment Format

The alignment format is similar to the Pharaoh format used by Moses. Odd lines consist of the sentence ID beginning with a hash mark. Even lines consist of space-delimited alignments, which are of the format s_1,…,s_m-t_1,…,t_n where source IDs s_1, …, s_m are aligned to target IDs t_1, …, t_n.

Alignment format example:

# SAMPLE_TRAIN_00001
0-0 4-1 1,3-2 2-3 5-4
# SAMPLE_TRAIN_00002
0-0 4-1 1,3-2 2-3 5-4
...

Dictionary

A dictionary file for words and punctuation can be specified (dic/words.SRC-TRG and dic/punc.SRC-TRG in EMS experiment directory, or with --dic_words_filename/--dic_punc_filename command-line options). Dictionaries are tab-separated files containing a list of source-target word pairs that will be added to the list of translation hypotheses retrieved from the example database.

Example:

バンジョー  banjo
ギター  guitar

Experimental Management System (EMS)

The KyotoEBMT Experimental Managament System (EMS) can be used to run experiments end-to-end using KyotoEBMT. Please note that we do not include parsers and aligners with this toolkit, however you may use your own tools (e.g. GIZA++).

The EMS works like a Makefile, in that it defines a number of targets ('stages’) and dependencies between them. When the user requests a target stage, the EMS will calculate all incomplete required dependencies and execute the required stages.

To run the EMS, use the following command (from inside the ems/ directory). After confirming the stages to be run, the EMS will conduct an experiment and return the results.

cd ems
python ems.py var

var is the path to your configuration file. A sample configuration file can be found at ems/var.

Experiments can be cloned for future use saving disk space by using the ems/init.py tool. The directory searched for experiments to clone is defined in the ems/clone_dir file.

Configuration

The settings for the EMS are specified in the var file within the experiment directory. The settings mainly concern locations of external tools and experimental data, and options for the decoder (such as beam width). Please use non-default settings at your own risk!

Stages and Files

The available stages are as follows:

Stage Description Dependencies
STAGE_CORPUS Prepare training corpus. None
STAGE_DIC Prepare dictionary. None
STAGE_PARSE Parse training corpus. STAGE_CORPUS
STAGE_SEGMENT Segmented parsed corpus. STAGE_PARSE
STAGE_ALIGN Align training corpus. STAGE_SEGMENT
STAGE_LM Train language model. STAGE_SEGMENT
STAGE_EDB Build example database. STAGE_PARSE STAGE_ALIGN
STAGE_HYP Build initial hypotheses. STAGE_EDB STAGE_DIC
STAGE_TUNING Tune on development set. STAGE_HYP STAGE_LM
STAGE_EVAL Translate and evaluate on test set. STAGE_TUNING
STAGE_RNNLM Build RNNLM model. STAGE_EVAL
STAGE_RERANK Reranking with RNNLM. STAGE_EVAL

Each stage has corresponding data, which is stored under the experiment root directory ROOT_DIR in a subdirectory with the same name as the stage (e.g. ROOT_DIR/corpus, ROOT_DIR/rerank).

Manual Operation

This section describes how to use each section of the translation system without the EMS.

Example Database

This step extracts examples from the training corpus in order to build the example database. In order to build the database, the training corpus must first be parsed and aligned. The example database is built with the tool TR++ (Cromières and Kurohashi, 2011). The command below will build a database database from the source parse file source, target parse file target and alignment file align.

bin/createData++ database "source|target|align"

Language Model

KyotoEBMT is designed to be used with a 5-gram language model using modified Kneser-Ney smoothing. The language model makes use of lower-order rest costs, so must be built to include these. For convenience, we have included a tool to build a compatible language model. This requires KenLM to be installed (see here for instructions).

utils/build_lm.sh input-file output-file

Lexical Translation Probabilities

The lexical translation probabilities obtained from alignment are used as a feature for decoding. These can be generated from GIZA++ output with the build_lex_trans_prob.pl script as follows.

utils/build_lex_trans_prob.pl corpus/fr.vcb corpus/en.vcb giza.fr-en/fr-en.n3.final giza.fr-en/fr-en.t3.final lex_i2j > ltp && \
utils/build_lex_trans_prob.pl corpus/en.vcb corpus/fr.vcb giza.en-fr/en-fr.n3.final giza.en-fr/en-fr.t3.final lex_j2i >> ltp

Hypothesis Serialization

In order to perform tuning efficiently it is necessary to retrieve and store the necessary examples (initial hypotheses) in an intermediary file known as a ‘hypothesis file’. To generate hypotheses, use the -m hypothesis command-line option with KyotoEBMT.

Translation

The decoder is run with the following command:

bin/KyotoEBMT [options] input-file

The input can either be a parsed sentence (the format is the same for training) or an intermediate hypothesis file (add --input_mode hypothesis).

A range of options can be specified, corresponding to paths and settings for the decoder. The most common options are given below.

--config Configuration file.
--nb_threads Number of threads to use.
--tr_db_filenames Example database.
--tr_language Example DB translation direction (1 or 2).
--weight Feature weights file.
--lm LM file.
--beam_width Beam width.
--source_language Source language (enjazh).
--target_language Target language (enjazh).
--dic_words_filename Standard dictionary.
--dic_punc_filename Punctuation dictionary.
--lexical_translation_prob_filename Lexical translation probability table.
--output_mode Output mode (e.g. tuning/eval/hypothesis).
--n_best_length Length of n-best list.

For full details, see:

bin/KyotoEBMT -h

Tuning and Reranking

We currently support k-best batch MIRA (Cherry and Foster, 2012) (default) and PRO (Hopkins and May, 2011). Tuning is only available when using the EMS.

We also support reranking using recurrent neural network language models (RNNLMs) (Mikolov et al., 2010). This is achieved by specifying STAGE_RERANK in the EMS.

The files produced during tuning include (for iteration i):

tuning/weights.i Weights file for $ith iteration.
tuning/weights.opt Best weights file for all iterations.
tuning/features.i Feature values for n-best list.
tuning/scores.i Scores for n-best list.
tuning/n_best.i n-best output from KyotoEBMT.
tuning/output.i 1-best output extracted from n-best list.
tuning/{init,mira,pro,megam}.i Intermediate data for MIRA/PRO.

License

KyotoEBMT Example-Based Machine Translation System
Copyright (C) 2014 John Richardson, Fabien Cromieres, Toshiaki Nakazawa and Sadao Kurohashi

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

References

Colin Cherry and George Foster. Batch Tuning Strategies for Statistical Machine Translation. In HLT-NAACL (2012).

Fabien Cromieres and Sadao Kurohashi. Efficient Retrieval of Tree Translation Examples for Syntax-Based Machine Translation. In Proceedings of EMNLP (2011).

Mark Hopkins and Jonathan May. Tuning as Ranking. In Proceedings of EMNLP (2011).

Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky and Sanjeev Khudanpur. Recurrent Neural Network Based Language Model. In Proceedings of the 11th Annual Conference of the International Speech Communication Association (2010).

John Richardson, Fabien Cromieres, Toshiaki Nakazawa and Sadao Kurohashi. KyotoEBMT: An Example-Based Dependency-to-Dependency Translation Framework (System Demonstration). In Proceedings of ACL (2014).