Skip to content

ganeshvictory/samananthar

Repository files navigation

samananthar

Samananthar v1
Samanantar dataset which is the largest publicly available parallel corpora collection for Indic languages.
Below is the link provide to download the parallel corpus data of required English-X (indic-language) :
https://indicnlp.ai4bharat.org/samanantar/
In our work we considered the below language pairs :

Samanantar En-X(indic) corpus Number of sentence pairs
English-Hindi (en-hi) 8.56 million
English-Telugu (en-te) 4.82 million

Before getting into the wider spectrum of work we performed two levels of filtration over the two language pair corpora to get the filtered data.

Filtration levels :

Filter 1 : Extraction of sentences between sentence (word) length of threshold 5 to 100. Below are the number of sentences extracted in each language pair.

Samanantar En-X(indic) corpus Number of sentence pairs
English-Hindi (en-hi) 79,61,610
English-Telugu (en-te) 30,36,823

Filter 2 : Removing the corresponding parallel sentences from both English-Hindi (en-hi) files which has english text in hindi file and junk in both english text file and hindi text files. Similarly removing the corresponding parallel sentences from both English-Telugu (en-te) files which has english text in telugu file and junk in both english text file and hindi text files. \

Samanantar En-X(indic) corpus Number of clean sentence pairs Number of pruned sentence pairs
English-Hindi (en-hi) 76,22,024 3,39,586
English-Telugu (en-te) 29,68,429 68,394

Filter 3 : By finding “average for Bleu, Chrf++ scores”, we extracted sentences which are greater than the average scores from En-Hi and En-Te corpora. The below number of sentences are only on 15,000 sentence pairs only.

Samanantar En-X(indic) corpus Number of clean sentence pairs Number of pruned sentence pairs
English-Hindi (en-hi) 2312 12,688
English-Telugu (en-te) 1985 13,015

By finding the average Cosine similarity scores, we extracted sentences from En-Hi and En-Te corpora whose **average values are greater than the average value.

Samanantar En-X(indic) corpus Number of sentence pairs
English-Hindi (en-hi) 98,149 (calculated scores only for 1,50,000 sentence pairs only)
English-Telugu (en-te) 1,92,291 (calculated scores only for 3,26,912 sentence pairs only)

Below are the number of sentences extracted from above filter3 i.e sentences greater than the average value of chrf++ and bleu scores and extracted sentences whose LaBSE Cosine similarity score is greater than it's average LaBSE cosine similarity score.

Samanantar En-X(indic) corpus Number of clean sentence pairs Number of pruned sentences pairs
English-Hindi (en-hi) 2061 251
English-Telugu (en-te) 1607 378

Link for all filtered text is mentioned in the below link :
https://drive.google.com/drive/folders/1er8tHCn99ETbQabD7MaA_KLBgTn8d5IC?usp=sharing

About

Samananthar v1

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages