Skip to content

Commit

Permalink
corrections to the README file
Browse files Browse the repository at this point in the history
  • Loading branch information
jorgtied committed Oct 9, 2023
1 parent 8b14a10 commit 2b57ae8
Show file tree
Hide file tree
Showing 5 changed files with 29 additions and 34 deletions.
7 changes: 4 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -372,7 +372,7 @@ release-tag:
@echo "* [Test data](${TATOEBA_DATAURL}-devtest/test-${VERSION}.tar) (${VERSION})" >> ${DATADIR}/Releases.md
@echo "* [Development data](${TATOEBA_DATAURL}-devtest/dev-${VERSION}.tar) (${VERSION})" >> ${DATADIR}/Releases.md
@echo "* [Bilingual training data](README-${VERSION}.md) (${VERSION}), language-pair specific downloads" >> ${DATADIR}/Releases.md
@echo "* [Extra bilingual training data](data/subsets/NoTestData-${VERSION}.md) (${VERSION}), language-pair specific downloads" >> ${DATADIR}/Releases.md
@echo "* [Extra bilingual training data](subsets/NoTestData-${VERSION}.md) (${VERSION}), language-pair specific downloads" >> ${DATADIR}/Releases.md
@echo "" >> ${DATADIR}/Releases.md
git add ${TESTDATADIR}/*/*.txt
git add ${DEVDATADIR}/*/*.txt
Expand Down Expand Up @@ -411,14 +411,15 @@ README.md: README.template ${TESTDATADIR}-${VERSION} ${DEVDATADIR}-${VERSION}
-e 's/%%TRAINSET_RELEASE%%/${TRAINSET_VERSION}/g' \
-e 's/%%EXTRATRAINSET_RELEASE%%/${EXTRATRAINSET_VERSION}/g' \
-e 's/%%MONO_RELEASE%%/${MONO_VERSION}/g' \
-e 's/%%NR_BITEXTS%%/${shell tail -n +2 ${RELEASEDIR}/released-bitexts.txt | wc -l | numfmt --grouping}/' \
-e 's/%%NR_BITEXTS%%/${shell tail -n +2 ${RELEASEDIR}/released-bitexts.txt | cut -f10 | grep . | wc -l | numfmt --grouping}/' \
-e 's/%%NR_LANGPAIRS%%/${shell tail -n +2 ${RELEASEDIR}/released-bitexts.txt | wc -l | numfmt --grouping}/' \
-e 's/%%NR_LANGS%%/${shell tail -n +2 ${RELEASEDIR}/released-bitexts.txt | cut -f1 | tr '-' ' ' | tr ' ' "\n" | sort -u | wc -l}/' \
-e 's/%%NR_TEST_LANGPAIRS%%/${shell cat ${RELEASEDIR}/released-bitexts-min200.txt | wc -l}/' \
-e 's/%%NR_TEST_LANGS%%/${shell cut -f1 ${RELEASEDIR}/released-bitexts-min200.txt | tr '-' ' ' | tr ' ' "\n" | sort -u | wc -l}/' \
-e 's/%%TRAIN_SIZE%%/${shell tail -n +2 ${RELEASEDIR}/released-bitexts.txt | awk "{SUM+=\$$10}END{print SUM}" | numfmt --to=si}/' \
< $< > $@
tail -n +2 ${RELEASEDIR}/released-bitexts.txt | awk "{SUM+=\$$10}END{print SUM}" | numfmt --to=si

cp $@ README-${VERSION}.md



Expand Down
18 changes: 8 additions & 10 deletions README-v2023-09-26.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,13 @@

# The Tatoeba Translation Challenge (v2023-09-26)

This is a challenge set for machine translation that contains 32G translation units in 4,024 bitexts covering 487 languages. The package includes a release of 657 test sets derived from [Tatoeba.org](https://tatoeba.org) that cover 138 languages.
This is a challenge set for machine translation that contains 32G translation units in 2,539 bitexts. The whole data set covers 487 languages linked to each other in 4,024 language pairs. The package includes a release of 657 test sets derived from [Tatoeba.org](https://tatoeba.org) that cover 138 languages. Training data is compiled from various sources collected within the [OPUS project](https://opus.nlpl.eu).

* Benchmark for realistic low-resource scenarios
* Benchmark for realistic low-resource scenarios ([Release history](data/Releases.md))
* [Training](data/README.md), [development](https://object.pouta.csc.fi/Tatoeba-Challenge-devtest/dev.tar) and [test data](https://object.pouta.csc.fi/Tatoeba-Challenge-devtest/test.tar)
* [Baseline models](results/tatoeba-models-all.md) and [results](results/tatoeba-results-all.md) ([training procedures](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/doc/TatoebaChallenge.md))
* [Ideal for multilingual models and transfer learning](results/tatoeba-results-langgroup.md)
* New: [The OPUS-MT leaderboard](https://opus.nlpl.eu/dashboard/)
* New: [The status of available NMT models on a map](https://opus.nlpl.eu/NMT-map/Tatoeba/all/src2trg/) (for release v2020-07-28)
* [Baseline models](results/tatoeba-models-all.md) and [results](results/tatoeba-results-sorted-langpair.md) ([training procedures](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/doc/TatoebaChallenge.md))
* [The OPUS-MT leaderboard](https://opus.nlpl.eu/dashboard/)
* [The status of available NMT models on a map](https://opus.nlpl.eu/NMT-map/Tatoeba/all/src2trg/) (for release v2020-07-28)

[![NMT map](images/NMT-map-small.png)](https://opus.nlpl.eu/NMT-map/Tatoeba-all/src2trg/)

Expand All @@ -29,9 +28,8 @@ This is a challenge set for machine translation that contains 32G translation un
* [Extra bilingual training data](data/subsets/NoTestData-v2023-09-26.md), language-pair specific downloads
* [Monolingual data sets](data/MonolingualData.md), [with document boundaries](data/Wiki.md), [de-duplicated and shuffled](data/Wiki.md)
* [Incrementally updated development and test data](https://object.pouta.csc.fi/Tatoeba-Challenge-devtest/devtest.tar), ([here for individual language pairs](data/devtest))
* [Release history](data/Releases.md)
* NEW: [Automatically translated monolingual data](data/Backtranslations.md)
* NEW: [Pre-trained sentence piece models](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/tatoeba/SentencePieceModels.md)
* [Automatically translated monolingual data](data/Backtranslations.md)
* [Pre-trained sentence piece models](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/tatoeba/SentencePieceModels.md)

The latest release also includes some parallel data sets in the same language in order to test paraphrase models. Note, however, that the support for paraphrasing is really limited in our data sets.

Expand Down Expand Up @@ -62,7 +60,7 @@ Please, cite the following paper if you use data and models from this distributi

## Data releases

The current release includes data for 4,024 language pairs covering 487 languages.
The current release includes data for 2,539 language pairs covering 487 languages.
The data sets are released per language pair with the following structure (using deu-eng as an example):

```
Expand Down
18 changes: 8 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,13 @@

# The Tatoeba Translation Challenge (v2023-09-26)

This is a challenge set for machine translation that contains 32G translation units in 4,024 bitexts covering 487 languages. The package includes a release of 657 test sets derived from [Tatoeba.org](https://tatoeba.org) that cover 138 languages.
This is a challenge set for machine translation that contains 32G translation units in 2,539 bitexts. The whole data set covers 487 languages linked to each other in 4,024 language pairs. The package includes a release of 657 test sets derived from [Tatoeba.org](https://tatoeba.org) that cover 138 languages. Training data is compiled from various sources collected within the [OPUS project](https://opus.nlpl.eu).

* Benchmark for realistic low-resource scenarios
* Benchmark for realistic low-resource scenarios ([Release history](data/Releases.md))
* [Training](data/README.md), [development](https://object.pouta.csc.fi/Tatoeba-Challenge-devtest/dev.tar) and [test data](https://object.pouta.csc.fi/Tatoeba-Challenge-devtest/test.tar)
* [Baseline models](results/tatoeba-models-all.md) and [results](results/tatoeba-results-all.md) ([training procedures](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/doc/TatoebaChallenge.md))
* [Ideal for multilingual models and transfer learning](results/tatoeba-results-langgroup.md)
* New: [The OPUS-MT leaderboard](https://opus.nlpl.eu/dashboard/)
* New: [The status of available NMT models on a map](https://opus.nlpl.eu/NMT-map/Tatoeba/all/src2trg/) (for release v2020-07-28)
* [Baseline models](results/tatoeba-models-all.md) and [results](results/tatoeba-results-sorted-langpair.md) ([training procedures](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/doc/TatoebaChallenge.md))
* [The OPUS-MT leaderboard](https://opus.nlpl.eu/dashboard/)
* [The status of available NMT models on a map](https://opus.nlpl.eu/NMT-map/Tatoeba/all/src2trg/) (for release v2020-07-28)

[![NMT map](images/NMT-map-small.png)](https://opus.nlpl.eu/NMT-map/Tatoeba-all/src2trg/)

Expand All @@ -29,9 +28,8 @@ This is a challenge set for machine translation that contains 32G translation un
* [Extra bilingual training data](data/subsets/NoTestData-v2023-09-26.md), language-pair specific downloads
* [Monolingual data sets](data/MonolingualData.md), [with document boundaries](data/Wiki.md), [de-duplicated and shuffled](data/Wiki.md)
* [Incrementally updated development and test data](https://object.pouta.csc.fi/Tatoeba-Challenge-devtest/devtest.tar), ([here for individual language pairs](data/devtest))
* [Release history](data/Releases.md)
* NEW: [Automatically translated monolingual data](data/Backtranslations.md)
* NEW: [Pre-trained sentence piece models](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/tatoeba/SentencePieceModels.md)
* [Automatically translated monolingual data](data/Backtranslations.md)
* [Pre-trained sentence piece models](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/tatoeba/SentencePieceModels.md)

The latest release also includes some parallel data sets in the same language in order to test paraphrase models. Note, however, that the support for paraphrasing is really limited in our data sets.

Expand Down Expand Up @@ -62,7 +60,7 @@ Please, cite the following paper if you use data and models from this distributi

## Data releases

The current release includes data for 4,024 language pairs covering 487 languages.
The current release includes data for 2,539 language pairs covering 487 languages.
The data sets are released per language pair with the following structure (using deu-eng as an example):

```
Expand Down
16 changes: 7 additions & 9 deletions README.template
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,13 @@

# The Tatoeba Translation Challenge (%%RELEASE%%)

This is a challenge set for machine translation that contains %%TRAIN_SIZE%% translation units in %%NR_BITEXTS%% bitexts covering %%NR_LANGS%% languages. The package includes a release of %%NR_TEST_LANGPAIRS%% test sets derived from [Tatoeba.org](https://tatoeba.org) that cover %%NR_TEST_LANGS%% languages.
This is a challenge set for machine translation that contains %%TRAIN_SIZE%% translation units in %%NR_BITEXTS%% bitexts. The whole data set covers %%NR_LANGS%% languages linked to each other in %%NR_LANGPAIRS%% language pairs. The package includes a release of %%NR_TEST_LANGPAIRS%% test sets derived from [Tatoeba.org](https://tatoeba.org) that cover %%NR_TEST_LANGS%% languages. Training data is compiled from various sources collected within the [OPUS project](https://opus.nlpl.eu).

* Benchmark for realistic low-resource scenarios
* Benchmark for realistic low-resource scenarios ([Release history](data/Releases.md))
* [Training](data/README.md), [development](https://object.pouta.csc.fi/Tatoeba-Challenge-devtest/dev.tar) and [test data](https://object.pouta.csc.fi/Tatoeba-Challenge-devtest/test.tar)
* [Baseline models](results/tatoeba-models-all.md) and [results](results/tatoeba-results-all.md) ([training procedures](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/doc/TatoebaChallenge.md))
* [Ideal for multilingual models and transfer learning](results/tatoeba-results-langgroup.md)
* New: [The OPUS-MT leaderboard](https://opus.nlpl.eu/dashboard/)
* New: [The status of available NMT models on a map](https://opus.nlpl.eu/NMT-map/Tatoeba/all/src2trg/) (for release v2020-07-28)
* [Baseline models](results/tatoeba-models-all.md) and [results](results/tatoeba-results-sorted-langpair.md) ([training procedures](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/doc/TatoebaChallenge.md))
* [The OPUS-MT leaderboard](https://opus.nlpl.eu/dashboard/)
* [The status of available NMT models on a map](https://opus.nlpl.eu/NMT-map/Tatoeba/all/src2trg/) (for release v2020-07-28)

[![NMT map](images/NMT-map-small.png)](https://opus.nlpl.eu/NMT-map/Tatoeba-all/src2trg/)

Expand All @@ -29,9 +28,8 @@ This is a challenge set for machine translation that contains %%TRAIN_SIZE%% tra
* [Extra bilingual training data](data/subsets/NoTestData-%%EXTRATRAINSET_RELEASE%%.md), language-pair specific downloads
* [Monolingual data sets](data/MonolingualData.md), [with document boundaries](data/Wiki.md), [de-duplicated and shuffled](data/Wiki.md)
* [Incrementally updated development and test data](https://object.pouta.csc.fi/Tatoeba-Challenge-devtest/devtest.tar), ([here for individual language pairs](data/devtest))
* [Release history](data/Releases.md)
* NEW: [Automatically translated monolingual data](data/Backtranslations.md)
* NEW: [Pre-trained sentence piece models](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/tatoeba/SentencePieceModels.md)
* [Automatically translated monolingual data](data/Backtranslations.md)
* [Pre-trained sentence piece models](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/tatoeba/SentencePieceModels.md)

The latest release also includes some parallel data sets in the same language in order to test paraphrase models. Note, however, that the support for paraphrasing is really limited in our data sets.

Expand Down
4 changes: 2 additions & 2 deletions data/Releases.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,12 @@
* [Test data](https://object.pouta.csc.fi/Tatoeba-Challenge-devtest/test-v2021-08-07.tar) (v2021-08-07)
* [Development data](https://object.pouta.csc.fi/Tatoeba-Challenge-devtest/dev-v2021-08-07.tar) (v2021-08-07)
* [Bilingual training data](README-v2021-08-07.md) (v2021-08-07), language-pair specific downloads
* [Extra bilingual training data](data/subsets/NoTestData-v2021-08-07.md) (v2021-08-07), language-pair specific downloads
* [Extra bilingual training data](subsets/NoTestData-v2021-08-07.md) (v2021-08-07), language-pair specific downloads

# Release v2023-09-26

* [Test data](https://object.pouta.csc.fi/Tatoeba-Challenge-devtest/test-v2023-09-26.tar) (v2023-09-26)
* [Development data](https://object.pouta.csc.fi/Tatoeba-Challenge-devtest/dev-v2023-09-26.tar) (v2023-09-26)
* [Bilingual training data](README-v2023-09-26.md) (v2023-09-26), language-pair specific downloads
* [Extra bilingual training data](data/subsets/NoTestData-v2023-09-26.md) (v2023-09-26), language-pair specific downloads
* [Extra bilingual training data](subsets/NoTestData-v2023-09-26.md) (v2023-09-26), language-pair specific downloads

0 comments on commit 2b57ae8

Please sign in to comment.