corpus/Makefile.wordalign
includes the batch commands for running word alignment on OPUS corpora and extracting phrase translation tables from the aligned data. Start by creating a sub directory wordalign
in your corpus directory (OPUSHOME/corpus/<mycorpus>/wordalign
) that contains
# -*-makefile-*-
#
all:
${MAKE} all-wordalign
include ../Makefile.def
include ../../Makefile.def
include ../../Makefile.wordalign
All commands need to be started inside of the wordalign directory.
Running all word alignments, phrase table extractions for all language pairs:
make wordalign-all
make cleanup-all
Submitting jobs for all language pairs running alignment and phrase-table extraction in different steps:
make submit-all
ATTENTION: This may create many jobs!
All data files must be converted to XML, sentence aligned and tokenized (in OPUSHOME/corpus/<mycorpus>/xml
). Sentence alignment is stored in XCES align format and follows the naming conventions in OPUS.
The following commands can be used to start the complete alignment and extraction process for selected (German - English in the example below) or all language pairs:
make SRC=de TRG=en wordalign
: align & extract phrase-table for "de-en"make wordalign-all
: run wordalign for all language pairsmake cleanup-all
: cleanup for all language pairs
You can run individual steps for one bitext (set SRC and TRG like above). You must run the commands in the order given below!
make wordalign-prepare
: convert and preprocess bitextmake wordalign-align
: run word-alignermake wordalign-pt
: extract phrase-tablemake wordalign-filter
: significance filtering
Start jobs on HPC cluster for a selected language pair (set language pair with SRC and TRG):
make submit-wordalign
This will trigger all sub-tasks below by submitting new jobs at the end of the previous task. Each sub-task may require different resources (in terms of memory or walltime).
make submit-wordalign-prepare
make submit-wordalign-align
make submit-wordalign-pt
make submit-wordalign-filter
Start jobs on HPC cluster for all language pairs:
make submit-all
: run ACTION (default=wordalign) for all bitextsmake submit-all-prepare
: convert & preprocess all bitextsmake submit-all-align
: word-align all bitextmake submit-all-pt
: extract all phrase tablesmake submit-all-filter
: significance filtering for all bitexts