KyotoEBMT Advanced Usage

Building the Example Database

The example database is a core component of the KyotoEBMT translation system. It can be trained from parallel sentences that have been dependency parsed and aligned. The tool to create the database is TR++/createData++, which is called automatically by the EMS in STAGE_EDB. It can also be called manually using the following command, from the source parse file source, target parse file target and alignment file align.

# Usage
bin/createData++ [database output path] [source language] [target language] ["source|target|align"]
# Example
bin/createData++ edb ja en "parse/train/ja|parse/train/en|align/ja-en"

KyotoEBMT is distributed with a sample corpus of parallel Japanese-English sentences under sample, from which parses, alignments and other relevant files will be generated using the make sample command. Please use this data as a reference when preparing your own.

Parse Format

The parse format is used for parsed training data and can also be used as tree input for the decoder (at tuning/test time). In order to use tree input with the decoder, specify --input_mode tree.

The parse format is similar to the well-known CoNLL format. Each sentence is made up of a list of tokens, one token per line, with a tab-separated list giving the attributes for each token. Sentences are delimited by empty lines.

For each sentence a metadata line must be defined. This consists of the sentence ID (anything alphanumeric is fine) and the sentence parse score (used as a feature, can always be set to zero).

Token attributes are as follows. They are used as features inside the decoder and may be set as you wish, however the dependencies must be set correctly for any kind of translation.

ID DPND CONTENT LEMMA POS CONTENT_P CAT TYPE OTHER

ID	token ID (1, 2, …, n for n tokens)
DPND	ID of parent in dependency tree (-1=root)
CONTENT	surface form of token
LEMMA	lemmatized form
POS	part of speech
CONTENT_P	is this a content word? (1=yes, 0=no, -1=punctuation)
CAT	phrase category
TYPE	phrase type
OTHER	any other information

Parse format example:

# ID=SAMPLE_TRAIN_00001 SCORE=-15.1648
1	i	i	PRP	1	NP	_	NP:PRP/I/0
-1	read	read	VBP	1	S1	_	S1:S:VP:VBP/read/1
3	a	a	DT	0	DT	_	DT/a/2
1	book	book	NN	1	NP	_	NP:NN/book/3
1	.	.	.	-1	.	_	././4

# ID=SAMPLE_TRAIN_00002 SCORE=-15.4925
1	i	i	PRP	1	NP	_	NP:PRP/I/0
-1	read	read	VBP	1	S1	_	S1:S:VP:VBP/read/1
3	a	a	DT	0	DT	_	DT/a/2
1	newspaper	newspaper	NN	1	NP	_	NP:NN/newspaper/3
1	.	.	.	-1	.	_	././4

...

Alignment Format

The alignment format is similar to the Pharaoh format used by Moses. Odd lines consist of the sentence ID beginning with a hash mark. Even lines consist of space-delimited alignments, which are of the format s_1,…,s_m-t_1,…,t_n where source IDs s_1, …, s_m are aligned to target IDs t_1, …, t_n. Please note that we only support 'rectangular’ alignments: if we have 0-0, 0-1 and 1-0, we also require 1-1.

Alignment format example:

# SAMPLE_TRAIN_00001
0-0 4-1 1,3-2 2-3 5-4
# SAMPLE_TRAIN_00002
0-0 4-1 1,3-2 2-3 5-4
...

Hypothesis Serialization

Hypothesis files represent the content of the initial translation hypotheses used for translation. They are used in particular for tuning to save time (as we do not need to retrieve the translation examples for each iteration). The EMS generates them during STAGE_HYP.

Hypothesis files can be output by specifying --output_mode hyp to the decoder. This will not perform translation and just output this intermediate file. The translation can then be resumed by using --input_mode hyp and specifying a hypothesis file to translate.

Modifying hypothesis files directly is an easy way to add features and make other such simple adjustments without modifying the decoder's source code.

The serialization format is designed to be human-readable but reasonably efficient. The general outline of a generated file is:

input_sentence_id
<input_tree>
START
list of feature names
Pattern1
<list_of_hypothesis>
Pattern2
<list_of_hypothesis>
...
END

A pattern is represented by a line:

P|pattern|root|parent

where pattern is the space-separated list of input positions (or spans for input forest), root is the input position at the root of the pattern, and parent the position that is the parent of the root.

An hypothesis is represented by one line:

score|t_string|dpnd|parent_bond_relationship|additionals_list|root_pos|bond_pos_map|features_values|tm_id|i_t_alignment|pos_list|cat_list|type_list|s_bond|t_bond|tm_counter

The most important fields are:

t_string (space-delimited list of target tokens, with [Xn] meaning non-terminal for input position n)
dpnd (space-delimited list of target-side dependencies)
additionals_list (space-delimited list of flexible position non-terminals, where each non-terminal is of the form input_position:parent_position:pre_insertion_positions_list:post_insertion_positions_list)
features_values (list of feature values)
tm_id (ID of translation example used to build the hypothesis)
i_t_alignment (list of input-target alignments, where each link is of the form N;x;y where x and y are lists of input/target positions respectively)

The other fields are mainly experimental. Please note that the loading code is not flexible at all; even a trailing space is enough to generate a parsing error.

We use the following escaping rules: space->\_, \->\\, |->\p, EOL->\n, empty_string->\0.

Translation/Decoding

Basic Usage

The decoder is run with the following command:

bin/KyotoEBMT [options] input-file

The input can either be a parsed sentence (--input_mode tree for tree input, --input_mode forest for forest input) or an intermediate hypothesis file (--input_mode hypothesis).

A range of options can be specified, corresponding to paths and settings for the decoder. The most common options are given below.

--input	Input parse filename.
--input_mode	Input mode (treeforesthypothesis).
--output_mode	Output mode (tuning/eval/hypothesis).
--source_language	Source language (enjazh).
--target_language	Target language (enjazh).
--tr_db_filenames	Example database.
--weight	Feature weights file.
--lm	LM file.
--nb_threads	Number of threads to use.
--fast_mode	Fast but lower translation quality.
--beam_width	Beam width.
--n_best_length	Length of n-best list.

For full details, run the command below. Please be warned that some of the options given in the full list are still experimental or not supported.

bin/KyotoEBMT -h

Forest Input

Dependency forests can be used as an alternative input format for the decoder by specifying --input_mode forest. Currently we do not support using forests for model training.

The format is similar to the tree parse format, however there are a number of differences. The forest format consists of a metadata line similar to the one in the tree format, followed by three parts (list of nodes, hyperedges, and edges), separated with blank lines. The format for each part is as follows:

List of nodes

One line for each forest node. The same word may appear in several lines. Node attributes are as follows:

SPAN_ID SPAN WORD_NUMBER CONTENT LEMMA POS CONTENT_P CAT TYPE OTHER

SPAN_ID: Unique identifier for the forest node (0, 1, …, n-1)
SPAN: A pair of token positions “a,b” where a and b are respectively the lowest and highest position of tokens in the subtree under (and including) this node
WORD_NUMBER: Position of the corresponding word in the sentence (0, 1, …, n-1)
CONTENT, …, OTHER: Same as tree format

List of hyperedges

One line for each hyperedge in the forest. A hyperedge is made of dependencies between one (parent) node and all its chidren.

Hyperedge attributes are as follows:

PARENT CHILDREN SCORE

PARENT: SPAN_ID of the parent node
CHILDREN: List of the children SPAN_IDs (comma-separated)
SCORE: Score of the hyperedge

List of edge scores

One line for each dependency (between one parent and one child node) in the forest.

Edge attributes are as follows:

PARENT CHILD SCORE

PARENT: SPAN_ID of the parent node
CHILD: SPAN_ID of the child
SCORE: Score of the dependency edge

Example forest

# ID=SENT00001
0,1	0	this	this	DT	0	NP	_	NP:DT/This/0
0,1	0	this	this	RB	1	RB	_	RB/This/0
0,1	0	this	this	DT	0	DT	_	DT/This/0
0,4	0	this	this	DT	0	S1	_	S1:SBAR:DT/This/0
0,4	1	is	be	AUX	1	S1	_	S1:S:VP:AUX/is/1
0,4	1	is	be	AUX	1	S1	_	S1:SBAR:S:VP:AUX/is/1
1,4	1	is	be	AUX	1	S	_	S:VP:AUX/is/1
2,3	2	a	a	DT	0	DT	_	DT/a/2
2,4	3	pen	pen	NN	1	NP	_	NP:NN/pen/3

6	0.0335287500
0,8	0.1571000000
2,8	0.1345562500
1,8	0.0972187500
0,8	0.0066725000
8	0.0335287500
7	0.1340837500

0	0.0598812500
0	0.0033362500
1	0.0000000000
2	0.0373375000
6	0.0335287500
7	0.1340837500
8	0.0972187500
8	0.0033362500
8	0.0335287500

Experimental Management System (EMS)

Configuration

The settings for the EMS are specified in the var file within the experiment directory. The settings mainly concern locations of external tools and experimental data, and options for the decoder (such as beam width). Please use non-default settings at your own risk!

Important settings include:

Option	Description
TARGET_STAGE	Final translation stage to reach.
FORCE_COMPLETE_STAGES	Override dependency checks for specified stage.
DRY_RUN	Print commands instead of running commands.
AUTOMATIC	Say 'yes’ automatically to all prompts.
N_BEST	Length of n-best list for tuning.
RERANK_SIZE	Length of n-best list for reranking.
BEAM_WIDTH	Decoding beam width.
EBMT_OPTIONS	Additional decoder options.
NB_THREADS	Number of threads for decoding.
USE_GXP	Use GXP tool for cluster-level training.
LANG_S	Source language code (enjazh).
LANG_T	Target language code (enjazh).
INPUT_MODE	Set input mode (tree/forest).

Stages and Files

Partial experiments can be performed by specifying the TARGET_STAGE option in the EMS configuration file. The available stages are as follows:

Stage	Description	Dependencies
STAGE_CORPUS	Prepare training corpus.	None
STAGE_DIC	Prepare dictionary.	None
STAGE_PARSE	Parse training corpus.	STAGE_CORPUS
STAGE_SEGMENT	Segmented parsed corpus.	STAGE_PARSE
STAGE_ALIGN	Align training corpus.	STAGE_SEGMENT
STAGE_LM	Train language model.	STAGE_SEGMENT
STAGE_EDB	Build example database.	STAGE_PARSE STAGE_ALIGN
STAGE_HYP	Build initial hypotheses.	STAGE_EDB STAGE_DIC
STAGE_TUNING	Tune on development set.	STAGE_HYP STAGE_LM
STAGE_EVAL	Translate and evaluate on test set.	STAGE_TUNING
STAGE_RNNLM	Build RNNLM model.	STAGE_EVAL
STAGE_RERANK	Reranking with RNNLM.	STAGE_EVAL

Each stage has corresponding data, which is stored under the experiment root directory ROOT_DIR in a subdirectory with the same name as the stage (e.g. ROOT_DIR/corpus, ROOT_DIR/rerank).

Tuning

We currently support k-best batch MIRA (Cherry and Foster, 2012) (default) and PRO (Hopkins and May, 2011). Tuning is only available when using the EMS.

The files produced during tuning include (for iteration i):

tuning/weights.i	Weights file for ith iteration.
tuning/weights.opt	Best weights file for all iterations.
tuning/features.i	Feature values for n-best list.
tuning/scores.i	Scores for n-best list.
tuning/n_best.i	n-best output from KyotoEBMT.
tuning/output.i	1-best output extracted from n-best list.
tuning/{init,mira,pro,megam}.i	Intermediate data for MIRA/PRO.

Reranking

We also support reranking using recurrent neural network language models (RNNLMs) (Mikolov et al., 2010). This is achieved by specifying STAGE_RERANK in the EMS.

Features

Language Model

KyotoEBMT is designed to be used with a 5-gram language model using modified Kneser-Ney smoothing. The language model makes use of lower-order rest costs, so must be built to include these. The EMS supports automatically LM generation using KenLM.

Lexical Translation Probabilities

The lexical translation probabilities obtained from alignment are used as a feature for decoding. These are generated automatically from GIZA++ output by the EMS.

Default Dependency Feature

When it is unclear whether an ‘additional’ word should be attached as a left or right child, we try first to use the rules (left or right) in the ‘default dependency’ file. This is generated automatically from the training corpus by the EMS.

Optional Features

While KyotoEBMT is designed to work with minimal features, you may wish to add some of the optional features to improve translation quality. These include the following:

Dictionary

A dictionary file for words and punctuation can be specified (dic/words.SRC-TRG and dic/punc.SRC-TRG in the EMS experiment directory, or with --dic_words_filename/--dic_punc_filename command-line options). Dictionaries are tab-separated files containing a list of source-target word pairs that will be added to the list of translation hypotheses retrieved from the example database.

Example:

バンジョー  banjo
ギター  guitar

For most experiments a dictionary is not necessary, and an empty file can be used.

Word Similarity

Currently unsupported.

Example Reliability

Currently unsupported.

Kanji Dictionary

Currently unsupported.

Full List of Features

Below is the full list of features.

Feature	Description
NULL_content	Number of null-aligned content words in source side of example.
NULL_function	Number of null-aligned function words in source side of example.
abnormal_child_bond	Number of bonds with different source and target tree structures.
additional_in_approved_position	Equal to one unless a subtree is inserted in an unspecified position.
alternative_word_penalty	Number of times an optional word rewriting rule (e.g. 'is’ -> 'are’) is used.
both_ROS	Do the example and input subtree both contain the sentence root?
child_bond_count	Number of child bonds (references) available.
child_bond_similarity	Similarity of the nodes in source and input for child bonds.
content_to_content_match	+1 if a content word is aligned to a content word.
content_to_function_match	+1 if a content word is aligned to a function word.
different_parent_cat	Different parent bond category in input and source.
empty_hyp	Does the example have no target words?
example_freq	Number of TMs creating this hypothesis.
example_penalty	+1 for each example.
example_reliability	Reliability (alignment probability) of training sentence used.
function_to_content_match	+1 if a function word is aligned to a content word.
function_to_function_match	+1 if a function word is aligned to a function word.
inconsistent_child_bond_position	+1 if the order of a child bond among siblings is different to in the input sentence.
lex_trans_prob_s_t	Lexical translation probability of a source word given target.
lex_trans_prob_t_s	Lexical translation probability of a target word given source.
lm	Total LM score of translation.
log_pattern_frequency	Log of number of examples matching pattern.
nb_additionals	Number of additionals.
nb_optionals	Number of optionals in initial hypotheses.
nb_optionals_skipped	Number of optionals removed during decoding.
no_redundant_child	Are there no children in the input not covered by the source?
numeral_mismatch	Number of mismatched numerals in example (e.g. 5 not 7).
one_side_ROS	Does either (but not both) example or input subtree contain the sentence root?
oov	Is the hypothesis OOV?
parent_bond_LD	Language dependent score for parent bond (e.g. POS equality).
parent_bond_available	Is it possible to use the parent bond?
parent_bond_similarity	Similarity of the nodes in source and input for parent bonds.
parse_score	Sum of the edge scores in the example.
pr_mismatch_add	Same as pr_mismatch_ref, but for the case when the subtree is inserted in an additional position.
pr_mismatch_ref	Number of times a subtree inserted in a reference position had an incompatible parent_bond_relationship.
ratio_optionals	Number of optionals divided by length of example.
real_oov	Number of OOV words in output sentence (for target LM).
real_target_size	Length of output sentence (no optionals).
redundant_child_in_input	Number of redundant children in input.
redundant_child_in_source	Number of redundant children in source.
root_LD	A combination of language dependent features for the example root.
same_parent_cat	Same parent bond category in input and source.
same_root_content	Are root words of input and source the same?
sibling	Number of sibling examples.
size	Pattern size squared divided by input length squared.
size_x_log_pattern_frequency	Length of example multiplied by log_pattern_frequency.
source_sibling	+1 if the pattern for the input is sibling.
t_string_frequency	Proportion of examples for a given pattern having the same t_string.
target_sibling	+1 if target side of the example is sibling.
trans_lm_prob	Translation probability for optional words.
trans_prob	Translation probability.
trans_prob_multiplied_by_size	Translation probability multiplied by pattern size.