Below you will find the data and nmtpy configurations for LIUM's WMT17 News Translation systems (see paper):
@InProceedings{garciamartinez-EtAl:2017:WMT,
author = {Garc\'{i}a-Mart\'{i}nez, Mercedes and
Caglayan, Ozan and Aransa, Walid
and Bardet, Adrien and Bougares, Fethi
and Barrault, Lo\"{i}c},
title = {LIUM Machine Translation Systems for WMT17 News Translation Task},
booktitle = {Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers},
month = {September},
year = {2017},
address = {Copenhagen, Denmark},
publisher = {Association for Computational Linguistics},
pages = {288--295},
url = {http://www.aclweb.org/anthology/W17-4726.pdf}
}
(Note: Turkish side of the corpora below is tokenized with a slightly modified version of Moses tokenizer which handles apostrophes correctly for Turkish.)
-
Download (13M) our normalized/tokenized/length-filtered version of officially provided SETIMES2 with ~200K sentences.
-
Download joint BPE (16K merge ops) trained on bitext.
-
The exact incremental subsamples of 150K, 700K, 1M and 1.7M (~all news2016) parallel back-translation corpora used in the paper where the target (TR) side samples are from monolingual Turkish data
news.2016.shuffled
. The sentences are translated into EN with a single TR->EN NMT system (~14 BLEU on newstest2016): -
Ready to use BPE-ized subsamples as they are used in the paper (cf. Table 3):
- (System B0) BPE-ized, (only) SETIMES2-200K (~200K total) corpora (14M)
- (System B1) BPE-ized, (only) BT-1M (~1M total) corpora (58M)
- (System B2) BPE-ized, SETIMES2-200K+BT-150K (~350K total) corpora (21M)
- (System B4) BPE-ized, SETIMES2-200K+BT-700K (~900K total) corpora (51M)
- (System B6) BPE-ized, SETIMES2-200K+BT-1M (~1.2M total) corpora (72M)
- (System B8) BPE-ized, SETIMES2-200K+BT-1.7M (~1.9M total) corpora (112M)