Skip to content

A Step by step manual of NiuTrans.Hierarchy

liyinqiao edited this page May 14, 2018 · 1 revision

1.Data Preparation

  • The NiuTrans system is a "data-driven" MT system which requries "data" for training and/or tuning the system. It requries users to prepare the following data files before running the system.

    a).Training data: bilingual sentence-pairs and word alignments.

    b).Tuning data: source sentences with one or more reference translations.

    c).Test data: some new sentences.

    d).Evaluation data: reference translations of test sentences.

In the NiuTrans package, some sample files are offered for experimenting with the system and studying the format requirement. They are located in "NiuTrans/sample-data/sample-submission-version".

sample-submission-version/
  -- TM-training-set/                   # word-aligned bilingual corpus (100,000 sentence-pairs)
       -- chinese.txt                   # source sentences
       -- english.txt                   # target sentences (case-removed)
       -- Alignment.txt                 # word alignments of the sentence-pairs
  -- LM-training-set/
       -- e.lm.txt                      # monolingual corpus for training language model (100K target sentences)
  -- Dev-set/
       -- Niu.dev.txt                   # development dataset for weight tuning (400 sentences)
  -- Test-set/
       -- Niu.test.txt                  # test dataset (1K sentences)
  -- Reference-for-evaluation/
       -- Niu.test.reference            # references of the test sentences (1K sentences)
  -- description-of-the-sample-data     # a description of the sample data
  • Format: please unpack "NiuTrans/sample-data/sample.tar.gz", and refer to "description-of-the-sample-data" to find more information about data format.

  • In the following, the above data files are used to illustrate how to run the NiuTrans system (e.g. how to train MT models, tune feature weights, and decode test sentences).

2. Obtaining Hierarchy Rules

  • Instructions (perl is required. Also, Cygwin is required for Windows users)
$> cd NiuTrans/sample-data/
$> tar xzf sample.tar.gz
$> cd ../
$> mkdir work/model.hierarchy/ -p
$> cd scripts/
$> perl NiuTrans-hierarchy-train-model.pl \
        -src ../sample-data/sample-submission-version/TM-training-set/chinese.txt \
        -tgt ../sample-data/sample-submission-version/TM-training-set/english.txt \
        -aln ../sample-data/sample-submission-version/TM-training-set/Alignment.txt \
        -out ../work/model.hierarchy/hierarchy.rule.table

"-out" specifies the generated hierarchy rule table.

"-src", "-tgt" and "-aln" specify the source sentences, the target sentences and the alignments between them (one sentence per line).

  • Output: one file are generated and placed in "NiuTrans/work/model.hierarchy/":
- hierarchy.rule.table                     # hierarchy rule table
  • Note: Please enter the "scripts/" directory before running the script "NiuTrans-hierarchy-train-model.pl".

3.Training n-gram language model

  • Instructions
$> cd ../
$> mkdir work/lm/
$> cd scripts/
$> perl NiuTrans-training-ngram-LM.pl \
        -corpus ../sample-data/sample-submission-version/LM-training-set/e.lm.txt \
        -ngram  3 \
        -vocab  ../work/lm/lm.vocab \
        -lmbin  ../work/lm/lm.trie.data

"-ngram" specifies the order of n-gram LM. E.g. "-ngram 3" indicates a 3-gram language model.

"-vocab" specifies where the target-side vocabulary is generated.

"-lmbin" specifies where the language model file is generated.

  • Output: two files are generated and placed in "NiuTrans/work/lm/":
- lm.vocab                           # target-side vocabulary
- lm.trie.data                       # binary-encoded language model

4.Generating Configuration File

  • Instructions
$> cd scripts/
$> perl NiuTrans-hierarchy-generate-mert-config.pl \
        -rule  ../work/model.hierarchy/hierarchy.rule.table \
        -lmdir ../work/lm/ \
        -nref  1 \
        -ngram 3 \
        -out   ../work/NiuTrans.hierarchy.user.config

"-rule" specifies the hierarchy rule table.

"-lmdir" specifies the directory that holds the n-gram language model and the target-side vocabulary.

"-ngram" specifies the order of n-gram language model.

"-out" specifies the output (i.e. a config file).

  • Output: a config file is generated and placed in "NiuTrans/work/":
- NiuTrans.hierarchy.user.config           # configuration file for MERT and decoding

5. Rule Filtering

  • Using development dataset and test dataset to filter rules. If you are not interested in this step, jump to step 6 weight tuning.

  • Instructions (perl is required)

$> cd ..
$> cat sample-data/sample-submission-version/Dev-set/Niu.dev.txt \
       sample-data/sample-submission-version/Reference-for-evaluation/Niu.test.reference \
       > sample-data/sample-submission-version/Dev-set/Niu.dev.and.test.txt
$> bin/NiuTrans.PhraseExtractor --FILPD \
        -dev     sample-data/sample-submission-version/Dev-set/Niu.dev.and.test.txt \
        -in      work/model.hierarchy/hierarchy.rule.table \
        -out     work/model.hierarchy/hierarchy.rule.table.filterDevAndTest \
        -maxlen  10 \
        -rnum    1
$> vim work/NiuTrans.hierarchy.user.config
   param="SCFG-Rule-Set"          value="../work/model.hierarchy/hierarchy.rule.table"
   modified to
   param="SCFG-Rule-Set"          value="../work/model.hierarchy/hierarchy.rule.table.filterDevAndTest"   

“-dev” specifies the development dataset(or tuning set) for weight tuning, here we merge the development dataset and test dataset.

“-in” specifies the inputted hierarchical rule.

“-out” specifies the outputted hierarchical rule which is filtered.

“-maxlen” specifies the maximum length of rule in hierarchical rule.

“-rnum” specifies how many reference translations per source-sentence are provided.

  • Output: “NiuTrans/work/NiuTrans.hierarchy.user.config” is rewritten.

6.Weight Tuning

  • Instructions (perl is required)
$> perl NiuTrans-hierarchy-mert-model.pl \
        -config ../work/NiuTrans.hierarchy.user.config \
        -dev    ../sample-data/sample-submission-version/Dev-set/Niu.dev.txt \
        -nref   1 \
        -round  2 \
        -log    ../work/mert-model.log

"-config" specifies the configuration file generated in the previous steps.

"-dev" specifies the development dataset (or tuning set) for weight tuning.

"-nref" specifies how many reference translations per source-sentence are provided

"-round" specifies how many rounds the MERT performs (by default, 1 round = 15 MERT iterations).

"-log" specifies the log file generated by MERT.

  • Output: the optimized feature weights are recorded in the configuration file "NiuTrans/work/NiuTrans.hierarchy.user.config". They will then be used in decoding the test sentences.

7.Decoding Test Sentences

  • Instructions (perl is required)
$> perl NiuTrans-hierarchy-decoder-model.pl \
        -config ../work/NiuTrans.hierarchy.user.config \
        -test   ../sample-data/sample-submission-version/Test-set/Niu.test.txt \
        -output 1best.out

"-config" specifies the configuration file.

"-test" specifies the test dataset (one sentence per line).

"-output" specifies the translation result file (the result is dumped to "stdout" if this option is not specified).

  • Output: a new file is generated in "NiuTrans/scripts/":
- 1best.out                          # 1-best translation of the test sentences

8. Evaluation

  • Instructions (perl is required)
$> perl NiuTrans-generate-xml-for-mteval.pl \
        -1f   1best.out \
        -tf   ../sample-data/sample-submission-version/Reference-for-evaluation/Niu.test.reference \
        -rnum 1
$> perl mteval-v13a.pl \
        -r ref.xml \
        -s src.xml \
        -t tst.xml

"-1f" specifies the file of the 1-best translations of the test dataset.

"-tf" specifies the file of the source sentences and their reference translations of the test dataset.

"-r" specifies the file of the reference translations.

"-s" specifies the file of source sentence.

"-t" specifies the file of (1-best) translations generated by the MT system.

  • Output: The IBM-version BLEU score is displayed. If everything goes well, you will obtain a score of about 0.2417 (0.2386) for the sample data set.

  • Note: script mteval-v13a.pl relies on the package XML::Parser. If XML::Parser is not installed on your system, please follow the following commands to install it.

$> su root
$> tar xzf XML-Parser-2.41.tar.gz
$> cd XML-Parser-2.41/
$> perl Makefile.PL
$> make install