You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
About the duplicate entry: It's kind of intentional as https://opus.nlpl.eu/ELRC-wikipedia_health-v1.php combines all bitexts of COVID-19 related translations between English and other languages adding all language pairs pivoted by English. Maybe not the cleanest way to also include the English parts in this corpus again but, on the other hand, this nicely creates a multi-parallel corpus with all languages properly linked with each other. I am not sure whether I should change this or not.
I by chance noticed this, but all data formats for this particular dataset seem to end with spaces at the end of the lines. The original source files, from https://www.elrc-share.eu/repository/browse/covid-19-health-wikipedia-dataset-bilingual-en-zh/c6236d148de811ea913100155d026706c2a9a16f8fc74d0487006e8379d322a0/, don't seem to have this issue.
Also, these might be duplicates. The samples are different, but en-zh tmx is exactly the same except for the creation header:
I haven't checked all other ELRC imported datasets, but another en-zh didn't seem to have this issue.
The text was updated successfully, but these errors were encountered: