KyotoEBMT TutorialPart 0: InstallationFirst of all, let's install KyotoEBMT. DependenciesKyotoEBMT has the following dependencies. Older versions may also work, but are not tested.
The following are useful tools for parsing/alignment/tuning, and are required to run the tutorial. The following optional dependencies can be installed to enable more advanced features:
Note that on Mac OS X, we recommend installing ICU from source. If you use a MacPorts/Homebrew ICU package, you may need to add include/linker paths manually to libs/build_deps.sh to compile KenLM. InstallationThe installation of KyotoEBMT is performed as follows: autoreconf -i ./configure make
Part 1: Sample translation systemBefore running KyotoEBMT we need to generate a small example database and language model for testing. We provide a tiny sample corpus, from which an end-to-end system can be built and run with make sample. The translation model will be built in the sample directory. The sample (Japanese–English) is designed to run using Stanford CoreNLP for English preprocessing and JUMAN/KNP for Japanese preprocessing. Alignment is performed with MGIZA. Please install the prerequisite tools and set their paths at the top of ems/scripts/parse_{stanford,knp}.sh (parsers) and sample/var (the rest). # Edit paths vim ems/tools/parse_stanford.sh vim ems/tools/parse_knp.sh vim sample/var # Build model make sample To check the final translations, see sample/eval/output. The BLEU score will be displayed in the EMS, however for the tiny sample data set expect it to be zero. Congratulations, you have successfully performed an end-to-end experiment with KyotoEBMT! Just running the decoderOnce you have built the sample data, you can run the decoder on its own as follows: bin/KyotoEBMT The output should be something like this: 0 ||| He also read a cheap fantastic new book . ||| f= 0 1 0 ... 7 7 9 ||| -135.045 0 ||| He also read a book cheap fantastic new . ||| f= 0 0 0 ... 7 7 9 ||| -135.598 0 ||| He also read cheap fantastic a new book . ||| f= 0 1 0 ... 7 7 9 ||| -137.7 Each line shows the sentence ID, followed by the translation, the feature values and the model score. Part 2: Your own experiment with the EMSExperimental Management System (EMS)The KyotoEBMT Experimental Managament System (EMS) can be used to run experiments end-to-end using KyotoEBMT. The EMS works like a Makefile, in that it defines a number of targets ('stages’) and dependencies between them. When the user requests a target stage, the EMS will calculate all incomplete required dependencies and execute the required stages. First we should prepare the necessary input for the EMS. All files used in a translation experiment are stored under an experiment directory. We will call ours ~/my_experiment. # Make experiment directory mkdir ~/my_experiment # Copy some initial files from the sample cp -r sample/corpus sample/dic sample/weights sample/var ~/my_experiment Setting Up Your CorpusThe corpus files are named corpus/{train,dev,test}/{E,F} for language pair E/F (e.g. ja/en). Each pair of files should contain aligned sentences, one sentence per line. Before each sentence, a line containing the sentence ID should be included, prefixed by a hash sign. Please edit the sample files you copied to the corpus directory. # ID1 sentence1 # ID2 sentence2 ... Setting up Parser and AlignerIn the earlier example we used MGIZA for alignment, KNP for Japanese parsing and Stanford CoreNLP for English parsing. It is of course possible to use your own tools instead. Any combination of alignment and parsing tools can be used, so long as the output is converted to the KyotoEBMT format. If you want to use your own parsing/alignment tools, please put the converted output of your tools into align and parse directories using the same filenames as created in sample. When you run the EMS, it will detect these files and continue from the next relevant stage. See the advanced usage page for instructions on preparing your own input files. Running the ExperimentWe now are ready to run the EMS. cd ems python ems.py ~/my_experiment The EMS will be run using the configuration file ~/my_experiment/var and all output will be stored under ~/my_experiment. Experiments can be cloned for future use (also saving disk space) by using the ems/init.py tool. The directory searched for experiments to clone is defined in the ems/clone_dir file. For more details of the configuation file and more advanced options, please see the advanced documentation page. |