Where are the missing language pairs? #20

icaswell · 2021-03-16T01:35:51Z

There seem to be 417 language varieties represented in https://opus.nlpl.eu/JW300.php. This would imply 417C2 = 86,736 undirected language pairs. However, I only count 54,376 of them, and the paper confirms this number. Do you know where the missing 32,360 language pairs are, and would you be willing to provide them?

I notice that the adjacency matrix seems to have only one fully connected component, so e.g. although ady has no parallel data with en, it has parallel data with "jw_rmv", which has parallel data with en. So it seems likely that ady and en can be aligned. Just to demonstrate that it's conceptually possible, I found these two pairs in the respective corpora:

jw_rmv: Пала со амэ подаса дума андэ авэр статья ?
ady: Сыда къыкІэлъыкІорэ статьям щызэхэтфыщтыр ?

jw_rmv: Пала со амэ подаса дума андэ авэр статья ?
en: What will we consider in the following article ?

Implication: the following is a sentence pair between English and Adyghe:

ady: Сыда къыкІэлъыкІорэ статьям щызэхэтфыщтыр ?
en: What will we consider in the following article ?

(Interestingly, jw_rmv, which actually seems to be Vlax Romany in Cyrillic script, is the one language that is aligned with the most other languages -- more than English!)

icaswell · 2021-03-25T22:49:24Z

Useful theorem: every language in JW300 is parallel with English, Assamese, or both.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where are the missing language pairs? #20

Where are the missing language pairs? #20

icaswell commented Mar 16, 2021 •

edited

Loading

icaswell commented Mar 25, 2021

Where are the missing language pairs? #20

Where are the missing language pairs? #20

Comments

icaswell commented Mar 16, 2021 • edited Loading

icaswell commented Mar 25, 2021

icaswell commented Mar 16, 2021 •

edited

Loading