Skip to content

Latest commit

 

History

History
30 lines (19 loc) · 1.2 KB

README.md

File metadata and controls

30 lines (19 loc) · 1.2 KB

Tatoeba Challenge - released test data

This directory contains released test data sets prepared for the Tatoeba MT Challenge. The various releases are more or less incremental and later releases contain previously released examples.

The data files follow the following naming convention:

test/<VERSION_NUMBER>tatoeba-test-<VERSION_NUMBER>.<SRC>-<TRG>.txt

TAB-separated files contain separate columns for:

  • source language code
  • target language code
  • source language text
  • target language text

The directories also contain subsets of the released language pairs covering sub-languages or language variants if they are available for the particular languages involved. For example, the released test data set for eng-hbs includes the examples for individual languages like eng-hrv and eng-bos_Latn and language variants of Serbian with different scripts eng-srp_Latn and eng-srp_Cyrl:

v2021-08-07/tatoeba-test-v2021-08-07.eng-hbs.txt
v2021-08-07/tatoeba-test-v2021-08-07.eng-bos_Latn.txt
v2021-08-07/tatoeba-test-v2021-08-07.eng-srp_Cyrl.txt
v2021-08-07/tatoeba-test-v2021-08-07.eng-srp_Latn.txt

Note that all sentence pairs from the last three files are also contained in the first one.