-
Notifications
You must be signed in to change notification settings - Fork 36
Use Your Own Data Linux Only (for Chinese English and English Chinese translation tasks)
Overview of data preprocessing for bilingual data, development data, test data and monolingual data (for language model training)
Here is a quick overview of the main steps in data preprocessing. Detailed descriptions of each step are presented in the following sections.
- Processing Bilingual Data
1: use NiuTrans-clear.illegal.char.pl to clean the data (see Step 2 for more details)
2: use data pre-processing scripts to handle Chinese and English-sentence files respectively
(see Steps 4 and 5 for more details)
NOTE: in case of bilingual data preprocessing, it is suggested to use the NE generalization option
(i.e., replace number, time and date entities with symbols), and turn off the NE translation option,
in other words, set "-method" in NiuTrans-running-segmenter.pl to "01".
3: use word-alignment tools to generate word-to-word alignments for the bilingual data (see Step 7 for more details)
- Processing Development Data
1: Process the source language sentences with "NiuTrans-running-segmenter.pl", and set "-method" to "11" to generalize
and translate number, time and date NEs (see Steps 4 and 5 for more details).
2: Process the target language sentences with "NiuTrans-running-segmenter.pl", and set "-method" to "00" to inactivate
the NE symbol generalization and translation functions (see Steps 4 and 5 for more details).
3: Generate the development data file (see Step 6)
- Processing Test Data
1: Process the source language sentences with "NiuTrans-running-segmenter.pl", and set "-method" to "11" to generalize and
translate number, time and date NEs (see Steps 4 and 5 for more details). This is in principle the same as the first step
used in "Processing Development Data".
- Processing Monolingual Data (for language model training)
1: Merge the target-side of the bilingual data and additional (large-scale) target-language data
2: Process these sentences with "NiuTrans-running-segmenter.pl", and set "-method" to "01" to generalize number,
time and date NEs but do not translate them (see Steps 4 and 5 for more details).
-
NiuTrans allows users to use their own bilingual data to build a machine translation system (Currently for Chinese-English and English-Chinese translation tasks only). Here we provide 1k sample sentences to show how to prepare your own data for running NiuTrans. The sample data is placed in "NiuTrans/sample-data/sample-submission-version" in the NiuTrans package.
sample-submission-version/ -- Raw-data/ # Original bilingual dataset (1K sentence-pairs) -- chinese.txt # source sentences -- english.txt # target sentences
-
Format: please unpack "NiuTrans/sample-data/sample.tar.gz" and check "description-of-the-sample-data" for more information about data format.
-
Function: remove meaningless markups and characters for Chinese-English sentence pairs
-
Instructions (cygwin is required for Windows users)
$> cd NiuTrans/sample-data/
$> tar xzvf sample.tar.gz # 如果“sample.tar.gz”已经解压缩,则忽略此步
$> cd NiuTrans/scripts/
$> mkdir ../work/preprocessing -p
$> perl NiuTrans-clear.illegal.char.pl \
-src ../sample-data/sample-submission-version/Raw-data/chinese.raw.txt \
-tgt ../sample-data/sample-submission-version/Raw-data/english.raw.txt \
-outSrc ../work/preprocessing/chinese.clean.txt \
-outTgt ../work/preprocessing/english.clean.txt
**For the NiuTrans-clear.illegal.char.pl script: **
"-src" specifies Chinese-sentence file
"-tgt" specifies English-sentence file
"-outSrc" specifies cleaned Chinese-sentence file
"-outTgt" specifies cleaned English-sentence file
- Output: two files are generated in “/NiuTrans/work/preprocessing”
- chinese.clean.txt # Chinese
- english.clean.txt # English
-
Function: remove meaningless markups and characters for Chinese/English
-
Instructions (perl is required. Also, cygwin is required for Windows users)
$> cd NiuTrans/sample-data/
$> tar xzvf sample.tar.gz # skip this step if sample.tar.gz has already been unpacked
$> cd NiuTrans/scripts/
$> mkdir ../work/preprocessing -p
$> perl NiuTrans-monolingual.clear.illegal.char.pl \
-tgt ../sample-data/sample-submission-version/Raw-data/chinese.raw.txt \
-outTgt ../work/preprocessing/chinese.mono.clean.txt \
-lang zh
**For the NiuTrans-monolingual.clear.illegal.char.pl script: **
"-tgt" specifies the input file (one sentence per line)
"-outTgt" specifies the output file
"-lang" specifies the language: "zh" = Chinese, "en" = English
- Output: a file is generated in "/NiuTrans/work/preprocessing"
- chinese.mono.clean.txt # the result file
-
Function: remove meaningless markups and characters for Chinese/English
-
Instructions (perl is required. Also, cygwin is required for Windows users)
$> perl NiuTrans-running-segmenter.pl \
-lang ch \
-input ../work/preprocessing/chinese.clean.txt \
-output ../work/preprocessing/chinese.clean.txt.prepro \
-method 01
For the NiuTrans-running-segmenter.pl script:
"-lang" specifies the language. In the case of Chinese language, the value of "-lang" should be "ch"
"-input" the input file (one sentence per line)
"-output" the output file
"-method" specifies the options used in NE recognition and generalization. When "00" is set, no NEs are generalized as symbols and no NE translations are provided; when "01" is set, the nubmer, time and date entities are generalized as symbols but no translations are provided for these entities;when "11" is set, all nubmer, time and date entities are generalized as symbols and translations are provided.
- Output: a file is generated in "/NiuTrans/work/preprocessing"
- chinese.clean.txt.prepro # the result file
-
Function: English tokenization, named entity (NE) recognition, named entity generalization (i.e., NEs are replaced with symbols) and named entity translation (for date, time, and number entities)
-
Instructions (perl is required. Also, cygwin is required for Windows users)
$> perl NiuTrans-running-segmenter.pl \ # 英文预处理
-lang en \
-input ../work/preprocessing/english.clean.txt \
-output ../work/preprocessing/english.clean.txt.prepro \
-method 01
**For the NiuTrans-running-segmenter.pl script: **
"-lang" specifies the language. In the case of English language, the value of "-lang" should be "en"
"-input" the input file (one sentence per line)
"-output" the output file
"-method" specifies the options used in NE recognition and generalization. When "00" is set, no NEs are generalized as symbols and no NE translations are provided; when "01" is set, the nubmer, time and date entities are generalized as symbols but no translations are provided for these entities;when "11" is set, all nubmer, time and date entities are generalized as symbols and translations are provided.
- Output: a file is generated in "/NiuTrans/work/preprocessing"
- english.clean.txt.prepro # the result file
- Instructions (perl is required. Also, cygwin is required for Windows users)
$> cd NiuTrans/scripts/
$> perl NiuTrans-dev-merge.pl \
source-sentence file \
reference-translation file1 \
reference-translation file2 \
reference-translation file3 \
reference-translation file4 \
... \
> ../work/preprocessing/dev.txt
-
Note: the format of development data file is: one line source-sentence, followed by a blank line, followed by reference translation for several lines, followed by the next source-sentence and so forth.
-
Output: a file is generated in "/NiuTrans/work/preprocessing"
- dev.txt # the development data file
-
Download:http://code.google.com/p/giza-pp/downloads/detail?name=giza-pp-v1.0.7.tar.gz,move "giza-pp-v1.0.7.tar.gz" to "/NiuTrans/tools"
-
Instructions (perl is required)
$> cd NiuTrans/tools/
$> tar xzvf giza-pp-v1.0.7.tar.gz
$> cd giza-pp
$> make
$> cp GIZA++-v2/GIZA++ \ # copy "GIZA++","snt2cooc.out","plain2snt.out","mkcls" to "/NiuTrans/bin"
GIZA++-v2/snt2cooc.out \
GIZA++-v2/plain2snt.out \
mkcls-v2/mkcls \
../../bin
$> cd ../../scripts
$> mkdir ../work/wordalignment/ -p
$> nohup nice perl NiuTrans-running-GIZA++.pl \
-src ../work/preprocessing/chinese.clean.txt.prepro \
-tgt ../work/preprocessing/english.clean.txt.prepro \
-out ../work/wordalignment/alignment.txt \
-tmpdir ../work/wordalignment/ &
"-src" specifies the file of source sentences.
"-tgt" specifies the file of target sentences.
"-out" specifies the file of word-alignment result.
- Output: a new file is generated in "/NiuTrans/work/wordalignment":
- alignment.txt # word-alignment file