KyotoEBMT Advanced UsageBuilding the Example DatabaseThe example database is a core component of the KyotoEBMT translation system. It can be trained from parallel sentences that have been dependency parsed and aligned. The tool to create the database is TR++/createData++, which is called automatically by the EMS in STAGE_EDB. It can also be called manually using the following command, from the source parse file source, target parse file target and alignment file align. # Usage bin/createData++ [database output path] [source language] [target language] ["source|target|align"] # Example bin/createData++ edb ja en "parse/train/ja|parse/train/en|align/ja-en" KyotoEBMT is distributed with a sample corpus of parallel Japanese-English sentences under sample, from which parses, alignments and other relevant files will be generated using the make sample command. Please use this data as a reference when preparing your own. Parse FormatThe parse format is used for parsed training data and can also be used as tree input for the decoder (at tuning/test time). In order to use tree input with the decoder, specify --input_mode tree. The parse format is similar to the well-known CoNLL format. Each sentence is made up of a list of tokens, one token per line, with a tab-separated list giving the attributes for each token. Sentences are delimited by empty lines. For each sentence a metadata line must be defined. This consists of the sentence ID (anything alphanumeric is fine) and the sentence parse score (used as a feature, can always be set to zero). Token attributes are as follows. They are used as features inside the decoder and may be set as you wish, however the dependencies must be set correctly for any kind of translation. ID DPND CONTENT LEMMA POS CONTENT_P CAT TYPE OTHER
Parse format example: # ID=SAMPLE_TRAIN_00001 SCORE=-15.1648 0 1 i i PRP 1 NP _ NP:PRP/I/0 1 -1 read read VBP 1 S1 _ S1:S:VP:VBP/read/1 2 3 a a DT 0 DT _ DT/a/2 3 1 book book NN 1 NP _ NP:NN/book/3 4 1 . . . -1 . _ ././4 # ID=SAMPLE_TRAIN_00002 SCORE=-15.4925 0 1 i i PRP 1 NP _ NP:PRP/I/0 1 -1 read read VBP 1 S1 _ S1:S:VP:VBP/read/1 2 3 a a DT 0 DT _ DT/a/2 3 1 newspaper newspaper NN 1 NP _ NP:NN/newspaper/3 4 1 . . . -1 . _ ././4 ... Alignment FormatThe alignment format is similar to the Pharaoh format used by Moses. Odd lines consist of the sentence ID beginning with a hash mark. Even lines consist of space-delimited alignments, which are of the format s_1,…,s_m-t_1,…,t_n where source IDs s_1, …, s_m are aligned to target IDs t_1, …, t_n. Please note that we only support 'rectangular’ alignments: if we have 0-0, 0-1 and 1-0, we also require 1-1. Alignment format example: # SAMPLE_TRAIN_00001 0-0 4-1 1,3-2 2-3 5-4 # SAMPLE_TRAIN_00002 0-0 4-1 1,3-2 2-3 5-4 ... Hypothesis SerializationHypothesis files represent the content of the initial translation hypotheses used for translation. They are used in particular for tuning to save time (as we do not need to retrieve the translation examples for each iteration). The EMS generates them during STAGE_HYP. Hypothesis files can be output by specifying --output_mode hyp to the decoder. This will not perform translation and just output this intermediate file. The translation can then be resumed by using --input_mode hyp and specifying a hypothesis file to translate. Modifying hypothesis files directly is an easy way to add features and make other such simple adjustments without modifying the decoder's source code. The serialization format is designed to be human-readable but reasonably efficient. The general outline of a generated file is: input_sentence_id <input_tree> START list of feature names Pattern1 <list_of_hypothesis> Pattern2 <list_of_hypothesis> ... END A pattern is represented by a line: P|pattern|root|parent where pattern is the space-separated list of input positions (or spans for input forest), root is the input position at the root of the pattern, and parent the position that is the parent of the root. An hypothesis is represented by one line: score|t_string|dpnd|parent_bond_relationship|additionals_list|root_pos|bond_pos_map|features_values|tm_id|i_t_alignment|pos_list|cat_list|type_list|s_bond|t_bond|tm_counter The most important fields are:
The other fields are mainly experimental. Please note that the loading code is not flexible at all; even a trailing space is enough to generate a parsing error. We use the following escaping rules: space->\_, \->\\, |->\p, EOL->\n, empty_string->\0. Translation/DecodingBasic UsageThe decoder is run with the following command: bin/KyotoEBMT [options] input-file The input can either be a parsed sentence (--input_mode tree for tree input, --input_mode forest for forest input) or an intermediate hypothesis file (--input_mode hypothesis). A range of options can be specified, corresponding to paths and settings for the decoder. The most common options are given below.
For full details, run the command below. Please be warned that some of the options given in the full list are still experimental or not supported. bin/KyotoEBMT -h Forest InputDependency forests can be used as an alternative input format for the decoder by specifying --input_mode forest. Currently we do not support using forests for model training. The format is similar to the tree parse format, however there are a number of differences. The forest format consists of a metadata line similar to the one in the tree format, followed by three parts (list of nodes, hyperedges, and edges), separated with blank lines. The format for each part is as follows: List of nodesOne line for each forest node. The same word may appear in several lines. Node attributes are as follows: SPAN_ID SPAN WORD_NUMBER CONTENT LEMMA POS CONTENT_P CAT TYPE OTHER
List of hyperedgesOne line for each hyperedge in the forest. A hyperedge is made of dependencies between one (parent) node and all its chidren. Hyperedge attributes are as follows: PARENT CHILDREN SCORE
List of edge scoresOne line for each dependency (between one parent and one child node) in the forest. Edge attributes are as follows: PARENT CHILD SCORE
Example forest# ID=SENT00001 0 0,1 0 this this DT 0 NP _ NP:DT/This/0 1 0,1 0 this this RB 1 RB _ RB/This/0 2 0,1 0 this this DT 0 DT _ DT/This/0 3 0,4 0 this this DT 0 S1 _ S1:SBAR:DT/This/0 4 0,4 1 is be AUX 1 S1 _ S1:S:VP:AUX/is/1 5 0,4 1 is be AUX 1 S1 _ S1:SBAR:S:VP:AUX/is/1 6 1,4 1 is be AUX 1 S _ S:VP:AUX/is/1 7 2,3 2 a a DT 0 DT _ DT/a/2 8 2,4 3 pen pen NN 1 NP _ NP:NN/pen/3 3 6 0.0335287500 4 0,8 0.1571000000 4 2,8 0.1345562500 4 1,8 0.0972187500 5 0,8 0.0066725000 6 8 0.0335287500 8 7 0.1340837500 4 0 0.0598812500 5 0 0.0033362500 4 1 0.0000000000 4 2 0.0373375000 3 6 0.0335287500 8 7 0.1340837500 4 8 0.0972187500 5 8 0.0033362500 6 8 0.0335287500 Experimental Management System (EMS)ConfigurationThe settings for the EMS are specified in the var file within the experiment directory. The settings mainly concern locations of external tools and experimental data, and options for the decoder (such as beam width). Please use non-default settings at your own risk! Important settings include:
Stages and FilesPartial experiments can be performed by specifying the TARGET_STAGE option in the EMS configuration file. The available stages are as follows:
Each stage has corresponding data, which is stored under the experiment root directory ROOT_DIR in a subdirectory with the same name as the stage (e.g. ROOT_DIR/corpus, ROOT_DIR/rerank). TuningWe currently support k-best batch MIRA (Cherry and Foster, 2012) (default) and PRO (Hopkins and May, 2011). Tuning is only available when using the EMS. The files produced during tuning include (for iteration i):
RerankingWe also support reranking using recurrent neural network language models (RNNLMs) (Mikolov et al., 2010). This is achieved by specifying STAGE_RERANK in the EMS. FeaturesLanguage ModelKyotoEBMT is designed to be used with a 5-gram language model using modified Kneser-Ney smoothing. The language model makes use of lower-order rest costs, so must be built to include these. The EMS supports automatically LM generation using KenLM. Lexical Translation ProbabilitiesThe lexical translation probabilities obtained from alignment are used as a feature for decoding. These are generated automatically from GIZA++ output by the EMS. Default Dependency FeatureWhen it is unclear whether an ‘additional’ word should be attached as a left or right child, we try first to use the rules (left or right) in the ‘default dependency’ file. This is generated automatically from the training corpus by the EMS. Optional FeaturesWhile KyotoEBMT is designed to work with minimal features, you may wish to add some of the optional features to improve translation quality. These include the following: DictionaryA dictionary file for words and punctuation can be specified (dic/words.SRC-TRG and dic/punc.SRC-TRG in the EMS experiment directory, or with --dic_words_filename/--dic_punc_filename command-line options). Dictionaries are tab-separated files containing a list of source-target word pairs that will be added to the list of translation hypotheses retrieved from the example database. Example: バンジョー banjo ギター guitar For most experiments a dictionary is not necessary, and an empty file can be used. Word SimilarityCurrently unsupported. Example ReliabilityCurrently unsupported. Kanji DictionaryCurrently unsupported. Full List of FeaturesBelow is the full list of features.
|