Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where are the missing language pairs? #20

Open
icaswell opened this issue Mar 16, 2021 · 1 comment
Open

Where are the missing language pairs? #20

icaswell opened this issue Mar 16, 2021 · 1 comment

Comments

@icaswell
Copy link

icaswell commented Mar 16, 2021

There seem to be 417 language varieties represented in https://opus.nlpl.eu/JW300.php. This would imply 417C2 = 86,736 undirected language pairs. However, I only count 54,376 of them, and the paper confirms this number. Do you know where the missing 32,360 language pairs are, and would you be willing to provide them?

I notice that the adjacency matrix seems to have only one fully connected component, so e.g. although ady has no parallel data with en, it has parallel data with "jw_rmv", which has parallel data with en. So it seems likely that ady and en can be aligned. Just to demonstrate that it's conceptually possible, I found these two pairs in the respective corpora:

jw_rmv: Пала со амэ подаса дума андэ авэр статья ?
ady: Сыда къыкІэлъыкІорэ статьям щызэхэтфыщтыр ?

jw_rmv: Пала со амэ подаса дума андэ авэр статья ?
en: What will we consider in the following article ?

Implication: the following is a sentence pair between English and Adyghe:

ady: Сыда къыкІэлъыкІорэ статьям щызэхэтфыщтыр ?
en: What will we consider in the following article ?

(Interestingly, jw_rmv, which actually seems to be Vlax Romany in Cyrillic script, is the one language that is aligned with the most other languages -- more than English!)

@icaswell
Copy link
Author

Useful theorem: every language in JW300 is parallel with English, Assamese, or both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant