KyotoEBMT Advanced Usage

Building the Example Database

The example database is a core component of the KyotoEBMT translation system. It can be trained from parallel sentences that have been dependency parsed and aligned. The tool to create the database is TR++/createData++, which is called automatically by the EMS in STAGE_EDB. It can also be called manually using the following command, from the source parse file source, target parse file target and alignment file align.

# Usage
bin/createData++ [database output path] [source language] [target language] ["source|target|align"]
# Example
bin/createData++ edb ja en "parse/train/ja|parse/train/en|align/ja-en"

KyotoEBMT is distributed with a sample corpus of parallel Japanese-English sentences under sample, from which parses, alignments and other relevant files will be generated using the make sample command. Please use this data as a reference when preparing your own.

Parse Format

The parse format is used for parsed training data and can also be used as tree input for the decoder (at tuning/test time). In order to use tree input with the decoder, specify --input_mode tree.

The parse format is similar to the well-known CoNLL format. Each sentence is made up of a list of tokens, one token per line, with a tab-separated list giving the attributes for each token. Sentences are delimited by empty lines.

For each sentence a metadata line must be defined. This consists of the sentence ID (anything alphanumeric is fine) and the sentence parse score (used as a feature, can always be set to zero).

Token attributes are as follows. They are used as features inside the decoder and may be set as you wish, however the dependencies must be set correctly for any kind of translation.

ID DPND CONTENT LEMMA POS CONTENT_P CAT TYPE OTHER
ID token ID (1, 2, …, n for n tokens)
DPND ID of parent in dependency tree (-1=root)
CONTENT surface form of token
LEMMA lemmatized form
POS part of speech
CONTENT_P is this a content word? (1=yes, 0=no, -1=punctuation)
CAT phrase category
TYPE phrase type
OTHER any other information

Parse format example:

# ID=SAMPLE_TRAIN_00001 SCORE=-15.1648
0	1	i	i	PRP	1	NP	_	NP:PRP/I/0
1	-1	read	read	VBP	1	S1	_	S1:S:VP:VBP/read/1
2	3	a	a	DT	0	DT	_	DT/a/2
3	1	book	book	NN	1	NP	_	NP:NN/book/3
4	1	.	.	.	-1	.	_	././4

# ID=SAMPLE_TRAIN_00002 SCORE=-15.4925
0	1	i	i	PRP	1	NP	_	NP:PRP/I/0
1	-1	read	read	VBP	1	S1	_	S1:S:VP:VBP/read/1
2	3	a	a	DT	0	DT	_	DT/a/2
3	1	newspaper	newspaper	NN	1	NP	_	NP:NN/newspaper/3
4	1	.	.	.	-1	.	_	././4

...

Alignment Format

The alignment format is similar to the Pharaoh format used by Moses. Odd lines consist of the sentence ID beginning with a hash mark. Even lines consist of space-delimited alignments, which are of the format s_1,…,s_m-t_1,…,t_n where source IDs s_1, …, s_m are aligned to target IDs t_1, …, t_n. Please note that we only support 'rectangular’ alignments: if we have 0-0, 0-1 and 1-0, we also require 1-1.

Alignment format example:

# SAMPLE_TRAIN_00001
0-0 4-1 1,3-2 2-3 5-4
# SAMPLE_TRAIN_00002
0-0 4-1 1,3-2 2-3 5-4
...

Hypothesis Serialization

Hypothesis files represent the content of the initial translation hypotheses used for translation. They are used in particular for tuning to save time (as we do not need to retrieve the translation examples for each iteration). The EMS generates them during STAGE_HYP.

Hypothesis files can be output by specifying --output_mode hyp to the decoder. This will not perform translation and just output this intermediate file. The translation can then be resumed by using --input_mode hyp and specifying a hypothesis file to translate.

Modifying hypothesis files directly is an easy way to add features and make other such simple adjustments without modifying the decoder's source code.

The serialization format is designed to be human-readable but reasonably efficient. The general outline of a generated file is:

input_sentence_id
<input_tree>
START
list of feature names
Pattern1
<list_of_hypothesis>
Pattern2
<list_of_hypothesis>
...
END

A pattern is represented by a line:

P|pattern|root|parent

where pattern is the space-separated list of input positions (or spans for input forest), root is the input position at the root of the pattern, and parent the position that is the parent of the root.

An hypothesis is represented by one line:

score|t_string|dpnd|parent_bond_relationship|additionals_list|root_pos|bond_pos_map|features_values|tm_id|i_t_alignment|pos_list|cat_list|type_list|s_bond|t_bond|tm_counter

The most important fields are:

  • t_string (space-delimited list of target tokens, with [Xn] meaning non-terminal for input position n)

  • dpnd (space-delimited list of target-side dependencies)

  • additionals_list (space-delimited list of flexible position non-terminals, where each non-terminal is of the form input_position:parent_position:pre_insertion_positions_list:post_insertion_positions_list)

  • features_values (list of feature values)

  • tm_id (ID of translation example used to build the hypothesis)

  • i_t_alignment (list of input-target alignments, where each link is of the form N;x;y where x and y are lists of input/target positions respectively)

The other fields are mainly experimental. Please note that the loading code is not flexible at all; even a trailing space is enough to generate a parsing error.

We use the following escaping rules: space->\_, \->\\, |->\p, EOL->\n, empty_string->\0.

Translation/Decoding

Basic Usage

The decoder is run with the following command:

bin/KyotoEBMT [options] input-file

The input can either be a parsed sentence (--input_mode tree for tree input, --input_mode forest for forest input) or an intermediate hypothesis file (--input_mode hypothesis).

A range of options can be specified, corresponding to paths and settings for the decoder. The most common options are given below.

--input Input parse filename.
--input_mode Input mode (treeforesthypothesis).
--output_mode Output mode (tuning/eval/hypothesis).
--source_language Source language (enjazh).
--target_language Target language (enjazh).
--tr_db_filenames Example database.
--weight Feature weights file.
--lm LM file.
--nb_threads Number of threads to use.
--fast_mode Fast but lower translation quality.
--beam_width Beam width.
--n_best_length Length of n-best list.

For full details, run the command below. Please be warned that some of the options given in the full list are still experimental or not supported.

bin/KyotoEBMT -h

Forest Input

Dependency forests can be used as an alternative input format for the decoder by specifying --input_mode forest. Currently we do not support using forests for model training.

The format is similar to the tree parse format, however there are a number of differences. The forest format consists of a metadata line similar to the one in the tree format, followed by three parts (list of nodes, hyperedges, and edges), separated with blank lines. The format for each part is as follows:

List of nodes

One line for each forest node. The same word may appear in several lines. Node attributes are as follows:

SPAN_ID SPAN WORD_NUMBER CONTENT LEMMA POS CONTENT_P CAT TYPE OTHER

  • SPAN_ID: Unique identifier for the forest node (0, 1, …, n-1)

  • SPAN: A pair of token positions “a,b” where a and b are respectively the lowest and highest position of tokens in the subtree under (and including) this node

  • WORD_NUMBER: Position of the corresponding word in the sentence (0, 1, …, n-1)

  • CONTENT, …, OTHER: Same as tree format

List of hyperedges

One line for each hyperedge in the forest. A hyperedge is made of dependencies between one (parent) node and all its chidren.

Hyperedge attributes are as follows:

PARENT CHILDREN SCORE

  • PARENT: SPAN_ID of the parent node

  • CHILDREN: List of the children SPAN_IDs (comma-separated)

  • SCORE: Score of the hyperedge

List of edge scores

One line for each dependency (between one parent and one child node) in the forest.

Edge attributes are as follows:

PARENT CHILD SCORE

  • PARENT: SPAN_ID of the parent node

  • CHILD: SPAN_ID of the child

  • SCORE: Score of the dependency edge

Example forest

# ID=SENT00001
0	0,1	0	this	this	DT	0	NP	_	NP:DT/This/0
1	0,1	0	this	this	RB	1	RB	_	RB/This/0
2	0,1	0	this	this	DT	0	DT	_	DT/This/0
3	0,4	0	this	this	DT	0	S1	_	S1:SBAR:DT/This/0
4	0,4	1	is	be	AUX	1	S1	_	S1:S:VP:AUX/is/1
5	0,4	1	is	be	AUX	1	S1	_	S1:SBAR:S:VP:AUX/is/1
6	1,4	1	is	be	AUX	1	S	_	S:VP:AUX/is/1
7	2,3	2	a	a	DT	0	DT	_	DT/a/2
8	2,4	3	pen	pen	NN	1	NP	_	NP:NN/pen/3

3	6	0.0335287500
4	0,8	0.1571000000
4	2,8	0.1345562500
4	1,8	0.0972187500
5	0,8	0.0066725000
6	8	0.0335287500
8	7	0.1340837500

4	0	0.0598812500
5	0	0.0033362500
4	1	0.0000000000
4	2	0.0373375000
3	6	0.0335287500
8	7	0.1340837500
4	8	0.0972187500
5	8	0.0033362500
6	8	0.0335287500

Experimental Management System (EMS)

Configuration

The settings for the EMS are specified in the var file within the experiment directory. The settings mainly concern locations of external tools and experimental data, and options for the decoder (such as beam width). Please use non-default settings at your own risk!

Important settings include:

Option Description
TARGET_STAGE Final translation stage to reach.
FORCE_COMPLETE_STAGES Override dependency checks for specified stage.
DRY_RUN Print commands instead of running commands.
AUTOMATIC Say 'yes’ automatically to all prompts.
N_BEST Length of n-best list for tuning.
RERANK_SIZE Length of n-best list for reranking.
BEAM_WIDTH Decoding beam width.
EBMT_OPTIONS Additional decoder options.
NB_THREADS Number of threads for decoding.
USE_GXP Use GXP tool for cluster-level training.
LANG_S Source language code (enjazh).
LANG_T Target language code (enjazh).
INPUT_MODE Set input mode (tree/forest).

Stages and Files

Partial experiments can be performed by specifying the TARGET_STAGE option in the EMS configuration file. The available stages are as follows:

Stage Description Dependencies
STAGE_CORPUS Prepare training corpus. None
STAGE_DIC Prepare dictionary. None
STAGE_PARSE Parse training corpus. STAGE_CORPUS
STAGE_SEGMENT Segmented parsed corpus. STAGE_PARSE
STAGE_ALIGN Align training corpus. STAGE_SEGMENT
STAGE_LM Train language model. STAGE_SEGMENT
STAGE_EDB Build example database. STAGE_PARSE STAGE_ALIGN
STAGE_HYP Build initial hypotheses. STAGE_EDB STAGE_DIC
STAGE_TUNING Tune on development set. STAGE_HYP STAGE_LM
STAGE_EVAL Translate and evaluate on test set. STAGE_TUNING
STAGE_RNNLM Build RNNLM model. STAGE_EVAL
STAGE_RERANK Reranking with RNNLM. STAGE_EVAL

Each stage has corresponding data, which is stored under the experiment root directory ROOT_DIR in a subdirectory with the same name as the stage (e.g. ROOT_DIR/corpus, ROOT_DIR/rerank).

Tuning

We currently support k-best batch MIRA (Cherry and Foster, 2012) (default) and PRO (Hopkins and May, 2011). Tuning is only available when using the EMS.

The files produced during tuning include (for iteration i):

tuning/weights.i Weights file for ith iteration.
tuning/weights.opt Best weights file for all iterations.
tuning/features.i Feature values for n-best list.
tuning/scores.i Scores for n-best list.
tuning/n_best.i n-best output from KyotoEBMT.
tuning/output.i 1-best output extracted from n-best list.
tuning/{init,mira,pro,megam}.i Intermediate data for MIRA/PRO.

Reranking

We also support reranking using recurrent neural network language models (RNNLMs) (Mikolov et al., 2010). This is achieved by specifying STAGE_RERANK in the EMS.

Features

Language Model

KyotoEBMT is designed to be used with a 5-gram language model using modified Kneser-Ney smoothing. The language model makes use of lower-order rest costs, so must be built to include these. The EMS supports automatically LM generation using KenLM.

Lexical Translation Probabilities

The lexical translation probabilities obtained from alignment are used as a feature for decoding. These are generated automatically from GIZA++ output by the EMS.

Default Dependency Feature

When it is unclear whether an ‘additional’ word should be attached as a left or right child, we try first to use the rules (left or right) in the ‘default dependency’ file. This is generated automatically from the training corpus by the EMS.

Optional Features

While KyotoEBMT is designed to work with minimal features, you may wish to add some of the optional features to improve translation quality. These include the following:

Dictionary

A dictionary file for words and punctuation can be specified (dic/words.SRC-TRG and dic/punc.SRC-TRG in the EMS experiment directory, or with --dic_words_filename/--dic_punc_filename command-line options). Dictionaries are tab-separated files containing a list of source-target word pairs that will be added to the list of translation hypotheses retrieved from the example database.

Example:

バンジョー  banjo
ギター  guitar

For most experiments a dictionary is not necessary, and an empty file can be used.

Word Similarity

Currently unsupported.

Example Reliability

Currently unsupported.

Kanji Dictionary

Currently unsupported.

Full List of Features

Below is the full list of features.

FeatureDescription
NULL_contentNumber of null-aligned content words in source side of example.
NULL_functionNumber of null-aligned function words in source side of example.
abnormal_child_bondNumber of bonds with different source and target tree structures.
additional_in_approved_positionEqual to one unless a subtree is inserted in an unspecified position.
alternative_word_penaltyNumber of times an optional word rewriting rule (e.g. 'is’ -> 'are’) is used.
both_ROSDo the example and input subtree both contain the sentence root?
child_bond_countNumber of child bonds (references) available.
child_bond_similaritySimilarity of the nodes in source and input for child bonds.
content_to_content_match+1 if a content word is aligned to a content word.
content_to_function_match+1 if a content word is aligned to a function word.
different_parent_catDifferent parent bond category in input and source.
empty_hypDoes the example have no target words?
example_freqNumber of TMs creating this hypothesis.
example_penalty+1 for each example.
example_reliabilityReliability (alignment probability) of training sentence used.
function_to_content_match+1 if a function word is aligned to a content word.
function_to_function_match+1 if a function word is aligned to a function word.
inconsistent_child_bond_position+1 if the order of a child bond among siblings is different to in the input sentence.
lex_trans_prob_s_tLexical translation probability of a source word given target.
lex_trans_prob_t_sLexical translation probability of a target word given source.
lmTotal LM score of translation.
log_pattern_frequencyLog of number of examples matching pattern.
nb_additionalsNumber of additionals.
nb_optionalsNumber of optionals in initial hypotheses.
nb_optionals_skippedNumber of optionals removed during decoding.
no_redundant_childAre there no children in the input not covered by the source?
numeral_mismatchNumber of mismatched numerals in example (e.g. 5 not 7).
one_side_ROSDoes either (but not both) example or input subtree contain the sentence root?
oovIs the hypothesis OOV?
parent_bond_LDLanguage dependent score for parent bond (e.g. POS equality).
parent_bond_availableIs it possible to use the parent bond?
parent_bond_similaritySimilarity of the nodes in source and input for parent bonds.
parse_scoreSum of the edge scores in the example.
pr_mismatch_addSame as pr_mismatch_ref, but for the case when the subtree is inserted in an additional position.
pr_mismatch_refNumber of times a subtree inserted in a reference position had an incompatible parent_bond_relationship.
ratio_optionalsNumber of optionals divided by length of example.
real_oovNumber of OOV words in output sentence (for target LM).
real_target_sizeLength of output sentence (no optionals).
redundant_child_in_inputNumber of redundant children in input.
redundant_child_in_sourceNumber of redundant children in source.
root_LDA combination of language dependent features for the example root.
same_parent_catSame parent bond category in input and source.
same_root_contentAre root words of input and source the same?
siblingNumber of sibling examples.
sizePattern size squared divided by input length squared.
size_x_log_pattern_frequencyLength of example multiplied by log_pattern_frequency.
source_sibling+1 if the pattern for the input is sibling.
t_string_frequencyProportion of examples for a given pattern having the same t_string.
target_sibling+1 if target side of the example is sibling.
trans_lm_probTranslation probability for optional words.
trans_probTranslation probability.
trans_prob_multiplied_by_sizeTranslation probability multiplied by pattern size.