corrections to the README file

Helsinki-NLP · Oct 9, 2023 · 2b57ae8 · 2b57ae8
1 parent 8b14a10
commit 2b57ae8
Show file tree

Hide file tree

Showing 5 changed files with 29 additions and 34 deletions.
diff --git a/Makefile b/Makefile
@@ -372,7 +372,7 @@ release-tag:
 	@echo "* [Test data](${TATOEBA_DATAURL}-devtest/test-${VERSION}.tar) (${VERSION})" >> ${DATADIR}/Releases.md
 	@echo "* [Development data](${TATOEBA_DATAURL}-devtest/dev-${VERSION}.tar) (${VERSION})" >> ${DATADIR}/Releases.md
 	@echo "* [Bilingual training data](README-${VERSION}.md) (${VERSION}), language-pair specific downloads" >> ${DATADIR}/Releases.md
-	@echo "* [Extra bilingual training data](data/subsets/NoTestData-${VERSION}.md) (${VERSION}), language-pair specific downloads" >> ${DATADIR}/Releases.md
+	@echo "* [Extra bilingual training data](subsets/NoTestData-${VERSION}.md) (${VERSION}), language-pair specific downloads" >> ${DATADIR}/Releases.md
 	@echo ""                             >> ${DATADIR}/Releases.md
 	git add ${TESTDATADIR}/*/*.txt
 	git add ${DEVDATADIR}/*/*.txt
@@ -411,14 +411,15 @@ README.md: README.template ${TESTDATADIR}-${VERSION} ${DEVDATADIR}-${VERSION}
 		-e 's/%%TRAINSET_RELEASE%%/${TRAINSET_VERSION}/g' \
 		-e 's/%%EXTRATRAINSET_RELEASE%%/${EXTRATRAINSET_VERSION}/g' \
 		-e 's/%%MONO_RELEASE%%/${MONO_VERSION}/g' \
-		-e 's/%%NR_BITEXTS%%/${shell tail -n +2 ${RELEASEDIR}/released-bitexts.txt | wc -l | numfmt --grouping}/' \
+		-e 's/%%NR_BITEXTS%%/${shell tail -n +2 ${RELEASEDIR}/released-bitexts.txt | cut -f10 | grep . | wc -l | numfmt --grouping}/' \
+		-e 's/%%NR_LANGPAIRS%%/${shell tail -n +2 ${RELEASEDIR}/released-bitexts.txt | wc -l | numfmt --grouping}/' \
 		-e 's/%%NR_LANGS%%/${shell tail -n +2 ${RELEASEDIR}/released-bitexts.txt | cut -f1 | tr '-' ' ' | tr ' ' "\n" | sort -u | wc -l}/' \
 		-e 's/%%NR_TEST_LANGPAIRS%%/${shell cat ${RELEASEDIR}/released-bitexts-min200.txt | wc -l}/' \
 		-e 's/%%NR_TEST_LANGS%%/${shell cut -f1 ${RELEASEDIR}/released-bitexts-min200.txt  | tr '-' ' ' | tr ' ' "\n" | sort -u | wc -l}/' \
 		-e 's/%%TRAIN_SIZE%%/${shell tail -n +2 ${RELEASEDIR}/released-bitexts.txt  | awk "{SUM+=\$$10}END{print SUM}" | numfmt --to=si}/' \
 	< $< > $@
 	tail -n +2 ${RELEASEDIR}/released-bitexts.txt  | awk "{SUM+=\$$10}END{print SUM}" | numfmt --to=si
-
+	cp $@ README-${VERSION}.md
 
 
 

diff --git a/README-v2023-09-26.md b/README-v2023-09-26.md
@@ -2,14 +2,13 @@
 
 # The Tatoeba Translation Challenge (v2023-09-26)
 
-This is a challenge set for machine translation that contains 32G translation units in 4,024 bitexts covering 487 languages. The package includes a release of 657 test sets derived from [Tatoeba.org](https://tatoeba.org) that cover 138 languages.
+This is a challenge set for machine translation that contains 32G translation units in 2,539 bitexts. The whole data set covers 487 languages linked to each other in 4,024 language pairs. The package includes a release of 657 test sets derived from [Tatoeba.org](https://tatoeba.org) that cover 138 languages. Training data is compiled from various sources collected within the [OPUS project](https://opus.nlpl.eu).
 
-* Benchmark for realistic low-resource scenarios
+* Benchmark for realistic low-resource scenarios ([Release history](data/Releases.md))
 * [Training](data/README.md), [development](https://object.pouta.csc.fi/Tatoeba-Challenge-devtest/dev.tar) and [test data](https://object.pouta.csc.fi/Tatoeba-Challenge-devtest/test.tar) 
-* [Baseline models](results/tatoeba-models-all.md) and [results](results/tatoeba-results-all.md) ([training procedures](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/doc/TatoebaChallenge.md))
-* [Ideal for multilingual models and transfer learning](results/tatoeba-results-langgroup.md)
-* New: [The OPUS-MT leaderboard](https://opus.nlpl.eu/dashboard/)
-* New: [The status of available NMT models on a map](https://opus.nlpl.eu/NMT-map/Tatoeba/all/src2trg/) (for release v2020-07-28)
+* [Baseline models](results/tatoeba-models-all.md) and [results](results/tatoeba-results-sorted-langpair.md) ([training procedures](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/doc/TatoebaChallenge.md))
+* [The OPUS-MT leaderboard](https://opus.nlpl.eu/dashboard/)
+* [The status of available NMT models on a map](https://opus.nlpl.eu/NMT-map/Tatoeba/all/src2trg/) (for release v2020-07-28)
 
 [![NMT map](images/NMT-map-small.png)](https://opus.nlpl.eu/NMT-map/Tatoeba-all/src2trg/)
 
@@ -29,9 +28,8 @@ This is a challenge set for machine translation that contains 32G translation un
 * [Extra bilingual training data](data/subsets/NoTestData-v2023-09-26.md), language-pair specific downloads
 * [Monolingual data sets](data/MonolingualData.md), [with document boundaries](data/Wiki.md), [de-duplicated and shuffled](data/Wiki.md)
 * [Incrementally updated development and test data](https://object.pouta.csc.fi/Tatoeba-Challenge-devtest/devtest.tar), ([here for individual language pairs](data/devtest))
-* [Release history](data/Releases.md)
-* NEW: [Automatically translated monolingual data](data/Backtranslations.md)
-* NEW: [Pre-trained sentence piece models](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/tatoeba/SentencePieceModels.md)
+* [Automatically translated monolingual data](data/Backtranslations.md)
+* [Pre-trained sentence piece models](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/tatoeba/SentencePieceModels.md)
 
 The latest release also includes some parallel data sets in the same language in order to test paraphrase models. Note, however, that the support for paraphrasing is really limited in our data sets.
 
@@ -62,7 +60,7 @@ Please, cite the following paper if you use data and models from this distributi
 
 ## Data releases
 
-The current release includes data for 4,024 language pairs covering 487 languages.
+The current release includes data for 2,539 language pairs covering 487 languages.
 The data sets are released per language pair with the following structure (using deu-eng as an example):
 
 ```

diff --git a/README.md b/README.md
@@ -2,14 +2,13 @@
 
 # The Tatoeba Translation Challenge (v2023-09-26)
 
-This is a challenge set for machine translation that contains 32G translation units in 4,024 bitexts covering 487 languages. The package includes a release of 657 test sets derived from [Tatoeba.org](https://tatoeba.org) that cover 138 languages.
+This is a challenge set for machine translation that contains 32G translation units in 2,539 bitexts. The whole data set covers 487 languages linked to each other in 4,024 language pairs. The package includes a release of 657 test sets derived from [Tatoeba.org](https://tatoeba.org) that cover 138 languages. Training data is compiled from various sources collected within the [OPUS project](https://opus.nlpl.eu).
 
-* Benchmark for realistic low-resource scenarios
+* Benchmark for realistic low-resource scenarios ([Release history](data/Releases.md))
 * [Training](data/README.md), [development](https://object.pouta.csc.fi/Tatoeba-Challenge-devtest/dev.tar) and [test data](https://object.pouta.csc.fi/Tatoeba-Challenge-devtest/test.tar) 
-* [Baseline models](results/tatoeba-models-all.md) and [results](results/tatoeba-results-all.md) ([training procedures](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/doc/TatoebaChallenge.md))
-* [Ideal for multilingual models and transfer learning](results/tatoeba-results-langgroup.md)
-* New: [The OPUS-MT leaderboard](https://opus.nlpl.eu/dashboard/)
-* New: [The status of available NMT models on a map](https://opus.nlpl.eu/NMT-map/Tatoeba/all/src2trg/) (for release v2020-07-28)
+* [Baseline models](results/tatoeba-models-all.md) and [results](results/tatoeba-results-sorted-langpair.md) ([training procedures](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/doc/TatoebaChallenge.md))
+* [The OPUS-MT leaderboard](https://opus.nlpl.eu/dashboard/)
+* [The status of available NMT models on a map](https://opus.nlpl.eu/NMT-map/Tatoeba/all/src2trg/) (for release v2020-07-28)
 
 [![NMT map](images/NMT-map-small.png)](https://opus.nlpl.eu/NMT-map/Tatoeba-all/src2trg/)
 
@@ -29,9 +28,8 @@ This is a challenge set for machine translation that contains 32G translation un
 * [Extra bilingual training data](data/subsets/NoTestData-v2023-09-26.md), language-pair specific downloads
 * [Monolingual data sets](data/MonolingualData.md), [with document boundaries](data/Wiki.md), [de-duplicated and shuffled](data/Wiki.md)
 * [Incrementally updated development and test data](https://object.pouta.csc.fi/Tatoeba-Challenge-devtest/devtest.tar), ([here for individual language pairs](data/devtest))
-* [Release history](data/Releases.md)
-* NEW: [Automatically translated monolingual data](data/Backtranslations.md)
-* NEW: [Pre-trained sentence piece models](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/tatoeba/SentencePieceModels.md)
+* [Automatically translated monolingual data](data/Backtranslations.md)
+* [Pre-trained sentence piece models](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/tatoeba/SentencePieceModels.md)
 
 The latest release also includes some parallel data sets in the same language in order to test paraphrase models. Note, however, that the support for paraphrasing is really limited in our data sets.
 
@@ -62,7 +60,7 @@ Please, cite the following paper if you use data and models from this distributi
 
 ## Data releases
 
-The current release includes data for 4,024 language pairs covering 487 languages.
+The current release includes data for 2,539 language pairs covering 487 languages.
 The data sets are released per language pair with the following structure (using deu-eng as an example):
 
 ```

diff --git a/README.template b/README.template
@@ -2,14 +2,13 @@
 
 # The Tatoeba Translation Challenge (%%RELEASE%%)
 
-This is a challenge set for machine translation that contains %%TRAIN_SIZE%% translation units in %%NR_BITEXTS%% bitexts covering %%NR_LANGS%% languages. The package includes a release of %%NR_TEST_LANGPAIRS%% test sets derived from [Tatoeba.org](https://tatoeba.org) that cover %%NR_TEST_LANGS%% languages.
+This is a challenge set for machine translation that contains %%TRAIN_SIZE%% translation units in %%NR_BITEXTS%% bitexts. The whole data set covers %%NR_LANGS%% languages linked to each other in %%NR_LANGPAIRS%% language pairs. The package includes a release of %%NR_TEST_LANGPAIRS%% test sets derived from [Tatoeba.org](https://tatoeba.org) that cover %%NR_TEST_LANGS%% languages. Training data is compiled from various sources collected within the [OPUS project](https://opus.nlpl.eu).
 
-* Benchmark for realistic low-resource scenarios
+* Benchmark for realistic low-resource scenarios ([Release history](data/Releases.md))
 * [Training](data/README.md), [development](https://object.pouta.csc.fi/Tatoeba-Challenge-devtest/dev.tar) and [test data](https://object.pouta.csc.fi/Tatoeba-Challenge-devtest/test.tar) 
-* [Baseline models](results/tatoeba-models-all.md) and [results](results/tatoeba-results-all.md) ([training procedures](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/doc/TatoebaChallenge.md))
-* [Ideal for multilingual models and transfer learning](results/tatoeba-results-langgroup.md)
-* New: [The OPUS-MT leaderboard](https://opus.nlpl.eu/dashboard/)
-* New: [The status of available NMT models on a map](https://opus.nlpl.eu/NMT-map/Tatoeba/all/src2trg/) (for release v2020-07-28)
+* [Baseline models](results/tatoeba-models-all.md) and [results](results/tatoeba-results-sorted-langpair.md) ([training procedures](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/doc/TatoebaChallenge.md))
+* [The OPUS-MT leaderboard](https://opus.nlpl.eu/dashboard/)
+* [The status of available NMT models on a map](https://opus.nlpl.eu/NMT-map/Tatoeba/all/src2trg/) (for release v2020-07-28)
 
 [![NMT map](images/NMT-map-small.png)](https://opus.nlpl.eu/NMT-map/Tatoeba-all/src2trg/)
 
@@ -29,9 +28,8 @@ This is a challenge set for machine translation that contains %%TRAIN_SIZE%% tra
 * [Extra bilingual training data](data/subsets/NoTestData-%%EXTRATRAINSET_RELEASE%%.md), language-pair specific downloads
 * [Monolingual data sets](data/MonolingualData.md), [with document boundaries](data/Wiki.md), [de-duplicated and shuffled](data/Wiki.md)
 * [Incrementally updated development and test data](https://object.pouta.csc.fi/Tatoeba-Challenge-devtest/devtest.tar), ([here for individual language pairs](data/devtest))
-* [Release history](data/Releases.md)
-* NEW: [Automatically translated monolingual data](data/Backtranslations.md)
-* NEW: [Pre-trained sentence piece models](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/tatoeba/SentencePieceModels.md)
+* [Automatically translated monolingual data](data/Backtranslations.md)
+* [Pre-trained sentence piece models](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/tatoeba/SentencePieceModels.md)
 
 The latest release also includes some parallel data sets in the same language in order to test paraphrase models. Note, however, that the support for paraphrasing is really limited in our data sets.
 

diff --git a/data/Releases.md b/data/Releases.md
@@ -14,12 +14,12 @@
 * [Test data](https://object.pouta.csc.fi/Tatoeba-Challenge-devtest/test-v2021-08-07.tar) (v2021-08-07)
 * [Development data](https://object.pouta.csc.fi/Tatoeba-Challenge-devtest/dev-v2021-08-07.tar) (v2021-08-07)
 * [Bilingual training data](README-v2021-08-07.md) (v2021-08-07), language-pair specific downloads
-* [Extra bilingual training data](data/subsets/NoTestData-v2021-08-07.md) (v2021-08-07), language-pair specific downloads
+* [Extra bilingual training data](subsets/NoTestData-v2021-08-07.md) (v2021-08-07), language-pair specific downloads
 
 # Release v2023-09-26
 
 * [Test data](https://object.pouta.csc.fi/Tatoeba-Challenge-devtest/test-v2023-09-26.tar) (v2023-09-26)
 * [Development data](https://object.pouta.csc.fi/Tatoeba-Challenge-devtest/dev-v2023-09-26.tar) (v2023-09-26)
 * [Bilingual training data](README-v2023-09-26.md) (v2023-09-26), language-pair specific downloads
-* [Extra bilingual training data](data/subsets/NoTestData-v2023-09-26.md) (v2023-09-26), language-pair specific downloads
+* [Extra bilingual training data](subsets/NoTestData-v2023-09-26.md) (v2023-09-26), language-pair specific downloads