KyotoEBMT Tutorial

Part 0: Installation

First of all, let's install KyotoEBMT.

Dependencies

KyotoEBMT has the following dependencies. Older versions may also work, but are not tested.

The following are useful tools for parsing/alignment/tuning, and are required to run the tutorial.

The following optional dependencies can be installed to enable more advanced features:

  • (Optional, parallel training) GXP

  • (Optional, RNNLM reranking) RNNLM

  • (Optional, bilingual RNNLM reranking) GroundHog

Note that on Mac OS X, we recommend installing ICU from source. If you use a MacPorts/Homebrew ICU package, you may need to add include/linker paths manually to libs/build_deps.sh to compile KenLM.

Installation

The installation of KyotoEBMT is performed as follows:

autoreconf -i
./configure
make
  • Use ./configure --with-boost=XXX to specify a non-default location for Boost.

  • Use 'make debug’ to enable debugging.

Part 1: Sample translation system

Before running KyotoEBMT we need to generate a small example database and language model for testing. We provide a tiny sample corpus, from which an end-to-end system can be built and run with make sample. The translation model will be built in the sample directory.

The sample (Japanese–English) is designed to run using Stanford CoreNLP for English preprocessing and JUMAN/KNP for Japanese preprocessing. Alignment is performed with MGIZA. Please install the prerequisite tools and set their paths at the top of ems/scripts/parse_{stanford,knp}.sh (parsers) and sample/var (the rest).

# Edit paths
vim ems/tools/parse_stanford.sh
vim ems/tools/parse_knp.sh
vim sample/var
# Build model
make sample

To check the final translations, see sample/eval/output. The BLEU score will be displayed in the EMS, however for the tiny sample data set expect it to be zero.

Congratulations, you have successfully performed an end-to-end experiment with KyotoEBMT!

Just running the decoder

Once you have built the sample data, you can run the decoder on its own as follows:

bin/KyotoEBMT

The output should be something like this:

0 ||| He also read a cheap fantastic new book . ||| f= 0 1 0 ... 7 7 9  ||| -135.045
0 ||| He also read a book cheap fantastic new . ||| f= 0 0 0 ... 7 7 9  ||| -135.598
0 ||| He also read cheap fantastic a new book . ||| f= 0 1 0 ... 7 7 9  ||| -137.7

Each line shows the sentence ID, followed by the translation, the feature values and the model score.

Part 2: Your own experiment with the EMS

Experimental Management System (EMS)

The KyotoEBMT Experimental Managament System (EMS) can be used to run experiments end-to-end using KyotoEBMT.

The EMS works like a Makefile, in that it defines a number of targets ('stages’) and dependencies between them. When the user requests a target stage, the EMS will calculate all incomplete required dependencies and execute the required stages.

First we should prepare the necessary input for the EMS. All files used in a translation experiment are stored under an experiment directory. We will call ours ~/my_experiment.

# Make experiment directory
mkdir ~/my_experiment
# Copy some initial files from the sample
cp -r sample/corpus sample/dic sample/weights sample/var ~/my_experiment

Setting Up Your Corpus

The corpus files are named corpus/{train,dev,test}/{E,F} for language pair E/F (e.g. ja/en). Each pair of files should contain aligned sentences, one sentence per line. Before each sentence, a line containing the sentence ID should be included, prefixed by a hash sign. Please edit the sample files you copied to the corpus directory.

# ID1
sentence1
# ID2
sentence2
...

Setting up Parser and Aligner

In the earlier example we used MGIZA for alignment, KNP for Japanese parsing and Stanford CoreNLP for English parsing. It is of course possible to use your own tools instead. Any combination of alignment and parsing tools can be used, so long as the output is converted to the KyotoEBMT format.

If you want to use your own parsing/alignment tools, please put the converted output of your tools into align and parse directories using the same filenames as created in sample. When you run the EMS, it will detect these files and continue from the next relevant stage. See the advanced usage page for instructions on preparing your own input files.

Running the Experiment

We now are ready to run the EMS.

cd ems
python ems.py ~/my_experiment

The EMS will be run using the configuration file ~/my_experiment/var and all output will be stored under ~/my_experiment.

Experiments can be cloned for future use (also saving disk space) by using the ems/init.py tool. The directory searched for experiments to clone is defined in the ems/clone_dir file. For more details of the configuation file and more advanced options, please see the advanced documentation page.