samananthar

Samananthar v1
Samanantar dataset which is the largest publicly available parallel corpora collection for Indic languages.
Below is the link provide to download the parallel corpus data of required English-X (indic-language) :
https://indicnlp.ai4bharat.org/samanantar/
In our work we considered the below language pairs :

Samanantar En-X(indic) corpus	Number of sentence pairs
English-Hindi (en-hi)	8.56 million
English-Telugu (en-te)	4.82 million

Before getting into the wider spectrum of work we performed two levels of filtration over the two language pair corpora to get the filtered data.

Filtration levels :

Filter 1 : Extraction of sentences between sentence (word) length of threshold 5 to 100. Below are the number of sentences extracted in each language pair.

Samanantar En-X(indic) corpus	Number of sentence pairs
English-Hindi (en-hi)	79,61,610
English-Telugu (en-te)	30,36,823

Filter 2 : Removing the corresponding parallel sentences from both English-Hindi (en-hi) files which has english text in hindi file and junk in both english text file and hindi text files. Similarly removing the corresponding parallel sentences from both English-Telugu (en-te) files which has english text in telugu file and junk in both english text file and hindi text files. \

Samanantar En-X(indic) corpus	Number of clean sentence pairs	Number of pruned sentence pairs
English-Hindi (en-hi)	76,22,024	3,39,586
English-Telugu (en-te)	29,68,429	68,394

Filter 3 : By finding “average for Bleu, Chrf++ scores”, we extracted sentences which are greater than the average scores from En-Hi and En-Te corpora. The below number of sentences are only on 15,000 sentence pairs only.

Samanantar En-X(indic) corpus	Number of clean sentence pairs	Number of pruned sentence pairs
English-Hindi (en-hi)	2312	12,688
English-Telugu (en-te)	1985	13,015

By finding the average Cosine similarity scores, we extracted sentences from En-Hi and En-Te corpora whose **average values are greater than the average value.

Samanantar En-X(indic) corpus	Number of sentence pairs
English-Hindi (en-hi)	98,149 (calculated scores only for 1,50,000 sentence pairs only)
English-Telugu (en-te)	1,92,291 (calculated scores only for 3,26,912 sentence pairs only)

Below are the number of sentences extracted from above filter3 i.e sentences greater than the average value of chrf++ and bleu scores and extracted sentences whose LaBSE Cosine similarity score is greater than it's average LaBSE cosine similarity score.

Samanantar En-X(indic) corpus	Number of clean sentence pairs	Number of pruned sentences pairs
English-Hindi (en-hi)	2061	251
English-Telugu (en-te)	1607	378

Link for all filtered text is mentioned in the below link :
https://drive.google.com/drive/folders/1er8tHCn99ETbQabD7MaA_KLBgTn8d5IC?usp=sharing

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
Samanantar data		Samanantar data
plots		plots
LINDAT English-Hindi report.pdf		LINDAT English-Hindi report.pdf
README.md		README.md
bleu,wer,chrf++ avg.py		bleu,wer,chrf++ avg.py
chrf++.py		chrf++.py
comet_mbr.py		comet_mbr.py
filter1.py		filter1.py
filter2.py		filter2.py
filter3_cosine.py		filter3_cosine.py
sent_bleu.py		sent_bleu.py
wer.py		wer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

samananthar

About

Releases

Packages

Languages

ganeshvictory/samananthar

Folders and files

Latest commit

History

Repository files navigation

samananthar

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages