Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All sentences in ELRC-{3056-,}wikipedia_health zh-en end with spaces, possibly duplicates #4

Open
jelmervdl opened this issue Mar 22, 2023 · 1 comment

Comments

@jelmervdl
Copy link

I by chance noticed this, but all data formats for this particular dataset seem to end with spaces at the end of the lines. The original source files, from https://www.elrc-share.eu/repository/browse/covid-19-health-wikipedia-dataset-bilingual-en-zh/c6236d148de811ea913100155d026706c2a9a16f8fc74d0487006e8379d322a0/, don't seem to have this issue.

Also, these might be duplicates. The samples are different, but en-zh tmx is exactly the same except for the creation header:

I haven't checked all other ELRC imported datasets, but another en-zh didn't seem to have this issue.

@jorgtied
Copy link
Member

About the duplicate entry: It's kind of intentional as https://opus.nlpl.eu/ELRC-wikipedia_health-v1.php combines all bitexts of COVID-19 related translations between English and other languages adding all language pairs pivoted by English. Maybe not the cleanest way to also include the English parts in this corpus again but, on the other hand, this nicely creates a multi-parallel corpus with all languages properly linked with each other. I am not sure whether I should change this or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants