-
Notifications
You must be signed in to change notification settings - Fork 36
A Step by step manual of NiuTrans.Hierarchy
-
The NiuTrans system is a "data-driven" MT system which requries "data" for training and/or tuning the system. It requries users to prepare the following data files before running the system.
a).Training data: bilingual sentence-pairs and word alignments.
b).Tuning data: source sentences with one or more reference translations.
c).Test data: some new sentences.
d).Evaluation data: reference translations of test sentences.
In the NiuTrans package, some sample files are offered for experimenting with the system and studying the format requirement. They are located in "NiuTrans/sample-data/sample-submission-version".
sample-submission-version/
-- TM-training-set/ # word-aligned bilingual corpus (100,000 sentence-pairs)
-- chinese.txt # source sentences
-- english.txt # target sentences (case-removed)
-- Alignment.txt # word alignments of the sentence-pairs
-- LM-training-set/
-- e.lm.txt # monolingual corpus for training language model (100K target sentences)
-- Dev-set/
-- Niu.dev.txt # development dataset for weight tuning (400 sentences)
-- Test-set/
-- Niu.test.txt # test dataset (1K sentences)
-- Reference-for-evaluation/
-- Niu.test.reference # references of the test sentences (1K sentences)
-- description-of-the-sample-data # a description of the sample data
-
Format: please unpack "NiuTrans/sample-data/sample.tar.gz", and refer to "description-of-the-sample-data" to find more information about data format.
-
In the following, the above data files are used to illustrate how to run the NiuTrans system (e.g. how to train MT models, tune feature weights, and decode test sentences).
- Instructions (perl is required. Also, Cygwin is required for Windows users)
$> cd NiuTrans/sample-data/
$> tar xzf sample.tar.gz
$> cd ../
$> mkdir work/model.hierarchy/ -p
$> cd scripts/
$> perl NiuTrans-hierarchy-train-model.pl \
-src ../sample-data/sample-submission-version/TM-training-set/chinese.txt \
-tgt ../sample-data/sample-submission-version/TM-training-set/english.txt \
-aln ../sample-data/sample-submission-version/TM-training-set/Alignment.txt \
-out ../work/model.hierarchy/hierarchy.rule.table
"-out" specifies the generated hierarchy rule table.
"-src", "-tgt" and "-aln" specify the source sentences, the target sentences and the alignments between them (one sentence per line).
- Output: one file are generated and placed in "NiuTrans/work/model.hierarchy/":
- hierarchy.rule.table # hierarchy rule table
- Note: Please enter the "scripts/" directory before running the script "NiuTrans-hierarchy-train-model.pl".
- Instructions
$> cd ../
$> mkdir work/lm/
$> cd scripts/
$> perl NiuTrans-training-ngram-LM.pl \
-corpus ../sample-data/sample-submission-version/LM-training-set/e.lm.txt \
-ngram 3 \
-vocab ../work/lm/lm.vocab \
-lmbin ../work/lm/lm.trie.data
"-ngram" specifies the order of n-gram LM. E.g. "-ngram 3" indicates a 3-gram language model.
"-vocab" specifies where the target-side vocabulary is generated.
"-lmbin" specifies where the language model file is generated.
- Output: two files are generated and placed in "NiuTrans/work/lm/":
- lm.vocab # target-side vocabulary
- lm.trie.data # binary-encoded language model
- Instructions
$> cd scripts/
$> perl NiuTrans-hierarchy-generate-mert-config.pl \
-rule ../work/model.hierarchy/hierarchy.rule.table \
-lmdir ../work/lm/ \
-nref 1 \
-ngram 3 \
-out ../work/NiuTrans.hierarchy.user.config
"-rule" specifies the hierarchy rule table.
"-lmdir" specifies the directory that holds the n-gram language model and the target-side vocabulary.
"-ngram" specifies the order of n-gram language model.
"-out" specifies the output (i.e. a config file).
- Output: a config file is generated and placed in "NiuTrans/work/":
- NiuTrans.hierarchy.user.config # configuration file for MERT and decoding
-
Using development dataset and test dataset to filter rules. If you are not interested in this step, jump to step 6 weight tuning.
-
Instructions (perl is required)
$> cd ..
$> cat sample-data/sample-submission-version/Dev-set/Niu.dev.txt \
sample-data/sample-submission-version/Reference-for-evaluation/Niu.test.reference \
> sample-data/sample-submission-version/Dev-set/Niu.dev.and.test.txt
$> bin/NiuTrans.PhraseExtractor --FILPD \
-dev sample-data/sample-submission-version/Dev-set/Niu.dev.and.test.txt \
-in work/model.hierarchy/hierarchy.rule.table \
-out work/model.hierarchy/hierarchy.rule.table.filterDevAndTest \
-maxlen 10 \
-rnum 1
$> vim work/NiuTrans.hierarchy.user.config
param="SCFG-Rule-Set" value="../work/model.hierarchy/hierarchy.rule.table"
modified to
param="SCFG-Rule-Set" value="../work/model.hierarchy/hierarchy.rule.table.filterDevAndTest"
“-dev” specifies the development dataset(or tuning set) for weight tuning, here we merge the development dataset and test dataset.
“-in” specifies the inputted hierarchical rule.
“-out” specifies the outputted hierarchical rule which is filtered.
“-maxlen” specifies the maximum length of rule in hierarchical rule.
“-rnum” specifies how many reference translations per source-sentence are provided.
- Output: “NiuTrans/work/NiuTrans.hierarchy.user.config” is rewritten.
- Instructions (perl is required)
$> perl NiuTrans-hierarchy-mert-model.pl \
-config ../work/NiuTrans.hierarchy.user.config \
-dev ../sample-data/sample-submission-version/Dev-set/Niu.dev.txt \
-nref 1 \
-round 2 \
-log ../work/mert-model.log
"-config" specifies the configuration file generated in the previous steps.
"-dev" specifies the development dataset (or tuning set) for weight tuning.
"-nref" specifies how many reference translations per source-sentence are provided
"-round" specifies how many rounds the MERT performs (by default, 1 round = 15 MERT iterations).
"-log" specifies the log file generated by MERT.
- Output: the optimized feature weights are recorded in the configuration file "NiuTrans/work/NiuTrans.hierarchy.user.config". They will then be used in decoding the test sentences.
- Instructions (perl is required)
$> perl NiuTrans-hierarchy-decoder-model.pl \
-config ../work/NiuTrans.hierarchy.user.config \
-test ../sample-data/sample-submission-version/Test-set/Niu.test.txt \
-output 1best.out
"-config" specifies the configuration file.
"-test" specifies the test dataset (one sentence per line).
"-output" specifies the translation result file (the result is dumped to "stdout" if this option is not specified).
- Output: a new file is generated in "NiuTrans/scripts/":
- 1best.out # 1-best translation of the test sentences
- Instructions (perl is required)
$> perl NiuTrans-generate-xml-for-mteval.pl \
-1f 1best.out \
-tf ../sample-data/sample-submission-version/Reference-for-evaluation/Niu.test.reference \
-rnum 1
$> perl mteval-v13a.pl \
-r ref.xml \
-s src.xml \
-t tst.xml
"-1f" specifies the file of the 1-best translations of the test dataset.
"-tf" specifies the file of the source sentences and their reference translations of the test dataset.
"-r" specifies the file of the reference translations.
"-s" specifies the file of source sentence.
"-t" specifies the file of (1-best) translations generated by the MT system.
-
Output: The IBM-version BLEU score is displayed. If everything goes well, you will obtain a score of about 0.2417 (0.2386) for the sample data set.
-
Note: script mteval-v13a.pl relies on the package XML::Parser. If XML::Parser is not installed on your system, please follow the following commands to install it.
$> su root
$> tar xzf XML-Parser-2.41.tar.gz
$> cd XML-Parser-2.41/
$> perl Makefile.PL
$> make install