Skip to content

Use Your Own Data Linux Only (for Chinese English and English Chinese translation tasks)

liyinqiao edited this page May 14, 2018 · 1 revision

Overview of data preprocessing for bilingual data, development data, test data and monolingual data (for language model training)

Here is a quick overview of the main steps in data preprocessing. Detailed descriptions of each step are presented in the following sections.

  • Processing Bilingual Data
	1: use NiuTrans-clear.illegal.char.pl to clean the data (see Step 2 for more details)
	
	2: use data pre-processing scripts to handle Chinese and English-sentence files respectively 
	(see Steps 4 and 5 for more details)

	NOTE: in case of bilingual data preprocessing, it is suggested to use the NE generalization option 
	(i.e., replace number, time and date entities with symbols), and turn off the NE translation option, 
	in other words, set "-method" in NiuTrans-running-segmenter.pl to "01".

	3: use word-alignment tools to generate word-to-word alignments for the bilingual data (see Step 7 for more details)
  • Processing Development Data
	1: Process the source language sentences with "NiuTrans-running-segmenter.pl", and set "-method" to "11" to generalize 
	and translate number, time and date NEs (see Steps 4 and 5 for more details).

	2: Process the target language sentences with "NiuTrans-running-segmenter.pl", and set "-method" to "00" to inactivate 
	the NE symbol generalization and translation functions (see Steps 4 and 5 for more details).

	3: Generate the development data file (see Step 6)
  • Processing Test Data
	1: Process the source language sentences with "NiuTrans-running-segmenter.pl", and set "-method" to "11" to generalize and 
	translate number, time and date NEs (see Steps 4 and 5 for more details). This is in principle the same as the first step 
	used in "Processing Development Data".
  • Processing Monolingual Data (for language model training)
	1: Merge the target-side of the bilingual data and additional (large-scale) target-language data

	2: Process these sentences with "NiuTrans-running-segmenter.pl", and set "-method" to "01" to generalize number, 
	time and date NEs but do not translate them (see Steps 4 and 5 for more details).

1.Sample Data

  • NiuTrans allows users to use their own bilingual data to build a machine translation system (Currently for Chinese-English and English-Chinese translation tasks only). Here we provide 1k sample sentences to show how to prepare your own data for running NiuTrans. The sample data is placed in "NiuTrans/sample-data/sample-submission-version" in the NiuTrans package.

    sample-submission-version/ -- Raw-data/ # Original bilingual dataset (1K sentence-pairs) -- chinese.txt # source sentences -- english.txt # target sentences

  • Format: please unpack "NiuTrans/sample-data/sample.tar.gz" and check "description-of-the-sample-data" for more information about data format.

2.Cleaning Bilingual Training Data

  • Function: remove meaningless markups and characters for Chinese-English sentence pairs

  • Instructions (cygwin is required for Windows users)

$> cd NiuTrans/sample-data/
$> tar xzvf sample.tar.gz                      # 如果“sample.tar.gz”已经解压缩,则忽略此步
$> cd NiuTrans/scripts/
$> mkdir ../work/preprocessing -p
$> perl NiuTrans-clear.illegal.char.pl \
        -src    ../sample-data/sample-submission-version/Raw-data/chinese.raw.txt \
        -tgt    ../sample-data/sample-submission-version/Raw-data/english.raw.txt \
        -outSrc ../work/preprocessing/chinese.clean.txt \
        -outTgt ../work/preprocessing/english.clean.txt

**For the NiuTrans-clear.illegal.char.pl script: **

"-src" specifies Chinese-sentence file

"-tgt" specifies English-sentence file

"-outSrc" specifies cleaned Chinese-sentence file

"-outTgt" specifies cleaned English-sentence file

  • Output: two files are generated in “/NiuTrans/work/preprocessing”
- chinese.clean.txt                            # Chinese
- english.clean.txt                            # English

3.Cleaning Monolingual Data

  • Function: remove meaningless markups and characters for Chinese/English

  • Instructions (perl is required. Also, cygwin is required for Windows users)

$> cd NiuTrans/sample-data/
$> tar xzvf sample.tar.gz                      # skip this step if sample.tar.gz has already been unpacked
$> cd NiuTrans/scripts/
$> mkdir ../work/preprocessing -p
$> perl NiuTrans-monolingual.clear.illegal.char.pl \
		-tgt    ../sample-data/sample-submission-version/Raw-data/chinese.raw.txt \
		-outTgt ../work/preprocessing/chinese.mono.clean.txt \
		-lang zh

**For the NiuTrans-monolingual.clear.illegal.char.pl script: **

"-tgt" specifies the input file (one sentence per line)

"-outTgt" specifies the output file

"-lang" specifies the language: "zh" = Chinese, "en" = English

  • Output: a file is generated in "/NiuTrans/work/preprocessing"
- chinese.mono.clean.txt                            # the result file

4.Data Preprocessing for Chinese

  • Function: remove meaningless markups and characters for Chinese/English

  • Instructions (perl is required. Also, cygwin is required for Windows users)

	$> perl NiuTrans-running-segmenter.pl \        
		-lang   ch \
		-input  ../work/preprocessing/chinese.clean.txt \
		-output ../work/preprocessing/chinese.clean.txt.prepro \
		-method 01

For the NiuTrans-running-segmenter.pl script:

"-lang" specifies the language. In the case of Chinese language, the value of "-lang" should be "ch"

"-input" the input file (one sentence per line)

"-output" the output file

"-method" specifies the options used in NE recognition and generalization. When "00" is set, no NEs are generalized as symbols and no NE translations are provided; when "01" is set, the nubmer, time and date entities are generalized as symbols but no translations are provided for these entities;when "11" is set, all nubmer, time and date entities are generalized as symbols and translations are provided.

  • Output: a file is generated in "/NiuTrans/work/preprocessing"
- chinese.clean.txt.prepro                     # the result file

5.Data Preprocessing for English

  • Function: English tokenization, named entity (NE) recognition, named entity generalization (i.e., NEs are replaced with symbols) and named entity translation (for date, time, and number entities)

  • Instructions (perl is required. Also, cygwin is required for Windows users)

	$> perl NiuTrans-running-segmenter.pl \        # 英文预处理
		-lang   en \
		-input  ../work/preprocessing/english.clean.txt \
		-output ../work/preprocessing/english.clean.txt.prepro \
		-method 01

**For the NiuTrans-running-segmenter.pl script: **

"-lang" specifies the language. In the case of English language, the value of "-lang" should be "en"

"-input" the input file (one sentence per line)

"-output" the output file

"-method" specifies the options used in NE recognition and generalization. When "00" is set, no NEs are generalized as symbols and no NE translations are provided; when "01" is set, the nubmer, time and date entities are generalized as symbols but no translations are provided for these entities;when "11" is set, all nubmer, time and date entities are generalized as symbols and translations are provided.

  • Output: a file is generated in "/NiuTrans/work/preprocessing"
- english.clean.txt.prepro                     # the result file

6.Generation of Development Data File

  • Instructions (perl is required. Also, cygwin is required for Windows users)
$> cd NiuTrans/scripts/
$> perl NiuTrans-dev-merge.pl \                   
		source-sentence file  \
        reference-translation file1 \
        reference-translation file2 \
        reference-translation file3 \
        reference-translation file4 \
		...   \
		> ../work/preprocessing/dev.txt
  • Note: the format of development data file is: one line source-sentence, followed by a blank line, followed by reference translation for several lines, followed by the next source-sentence and so forth.

  • Output: a file is generated in "/NiuTrans/work/preprocessing"

	- dev.txt                                          # the development data file

7.Word Alignment

$> cd NiuTrans/tools/
$> tar xzvf giza-pp-v1.0.7.tar.gz
$> cd giza-pp
$> make
$> cp GIZA++-v2/GIZA++ \                          # copy "GIZA++","snt2cooc.out","plain2snt.out","mkcls" to "/NiuTrans/bin"
		GIZA++-v2/snt2cooc.out \                  
		GIZA++-v2/plain2snt.out \
		mkcls-v2/mkcls \
		../../bin                                      
$> cd ../../scripts
$> mkdir ../work/wordalignment/ -p
$> nohup nice perl NiuTrans-running-GIZA++.pl \  
		-src    ../work/preprocessing/chinese.clean.txt.prepro \
		-tgt    ../work/preprocessing/english.clean.txt.prepro \
		-out    ../work/wordalignment/alignment.txt \
		-tmpdir ../work/wordalignment/ &

"-src" specifies the file of source sentences.

"-tgt" specifies the file of target sentences.

"-out" specifies the file of word-alignment result.

  • Output: a new file is generated in "/NiuTrans/work/wordalignment":
- alignment.txt                                # word-alignment file