This repository provides a dataset and a neural machine translation model nicknamed Tilmash for the paper
KazParC: Kazakh Parallel Corpus for Machine Translation
We collected data for our Kazakh Parallel Corpus (referred to as KazParC) from a diverse range of textual sources for Kazakh, English, Russian, and Turkish. These sources include
- proverbs and sayings
- terminology glossaries
- phrasebooks
- literary works
- periodicals
- language learning materials, including the SCoRE corpus by Chujo et al. (2015)
- educational video subtitle collections, such as QED by Abdelali et al. (2014)
- news items, such as KazNERD (Yeshpanov et al., 2022) and WMT (Tiedemann, 2012)
- TED talks
- governmental and regulatory legal documents from Kazakhstan
- communications from the official website of the President of the Republic of Kazakhstan
- United Nations publications
- image captions from sources like COCO
We categorised the data acquired from these sources into five broad domains:
Domain | lines | tokens | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
EN | KK | RU | TR | |||||||
# | % | # | % | # | % | # | % | # | % | |
Mass media | 120,547 | 32.4 | 1,817,276 | 28.3 | 1,340,346 | 28.6 | 1,454,430 | 29.0 | 1,311,985 | 28.5 |
General | 94,988 | 25.5 | 844,541 | 13.1 | 578,236 | 12.3 | 618,960 | 12.3 | 608,020 | 13.2 |
Legal documents | 77,183 | 20.8 | 2,650,626 | 41.3 | 1,925,561 | 41.0 | 1,991,222 | 39.7 | 1,880,081 | 40.8 |
Education and science | 46,252 | 12.4 | 522,830 | 8.1 | 392,348 | 8.4 | 444,786 | 8.9 | 376,484 | 8.2 |
Fiction | 32,932 | 8.9 | 589,001 | 9.2 | 456,385 | 9.7 | 510,168 | 10.2 | 433,968 | 9.4 |
Total | 371,902 | 100 | 6,424,274 | 100 | 4,692,876 | 100 | 5,019,566 | 100 | 4,610,538 | 100 |
We started the data collection process in July 2021, and it continued until September 2023. During this period, we collected a vast amount of text materials and their translations.
Our team of linguists played a crucial role in ensuring the quality of the data. They carefully reviewed the collected data, screening it for inappropriate content. The next step involved segmenting the data into individual sentences, with each sentence labelled with a domain identifier. We also paid close attention to grammar and spelling accuracy and removed any duplicate sentences.
Kazakh-Russian code-switching is a common practice in Kazakhstan, so we took steps to maintain uniformity. For sentences containing both Kazakh and Russian words, we initiated a modification process. This process involved translating the Russian elements into Kazakh while preserving the intended meaning of the sentences.
We organised the data into language pairs. We then carefully removed any unwanted characters and effectively replaced homoglyphs.
We also took care of formatting issues by eliminating line breaks (\n) and carriage returns (\r).
We identified and removed duplicate entries, making sure to filter out rows with identical text in both language columns.
However, to make our corpus more diverse and include a broader range of synonyms for different words and expressions, we decided to keep lines with duplicate text within a single language column.
In the table below, you will find statistics regarding the language pairs present in our corpus.
The column labelled '# lines' shows the total number of rows for each language pair.
In the columns labelled '# sents', '# tokens', and '# types', we provide counts of unique sentences, tokens, and word types for each language pair. For these counts, the upper numbers correspond to the first language in the pair, and the lower numbers correspond to the second language.
These token and type counts were determined after processing the data using Moses Tokenizer 1.2.1.
Pair | # lines | # sents | # tokens | # types |
---|---|---|---|---|
KK↔EN | 363,594 | 362,230 361,087 |
4,670,789 6,393,381 |
184,258 59,062 |
KK↔RU | 363,482 | 362,230 362,748 |
4,670,593 4,996,031 |
184,258 183,204 |
KK↔TR | 362,150 | 362,230 361,660 |
4,668,852 4,586,421 |
184,258 175,145 |
EN↔RU | 363,456 | 361,087 362,748 |
6,392,301 4,994,310 |
59,062 183,204 |
EN↔TR | 362,392 | 361,087 361,660 |
6,380,703 4,579,375 |
59,062 175,145 |
RU↔TR | 363,324 | 362,748 361,660 |
4,999,850 4,591,847 |
183,204 175,145 |
We began by creating a test set. To do this, we employed a random selection process, carefully choosing 250 unique and non-repeating rows from each of the sources outlined in Domains. The remaining data were divided into pairs, following an 80/20 split, while ensuring that the distribution of domains was maintained within both the training and validation sets.
Pair | Train | Valid | Test | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
# lines | # sents | # tokens | # types | # lines | # sents | # tokens | # lines | # lines | # sents | # tokens | # lines | |
KK↔EN | 290,877 | 286,958 286,197 |
3,693,263 5,057,687 |
164,766 54,311 |
72,719 | 72,426 72,403 |
920,482 1,259,827 |
83,057 32,063 |
4,750 | 4,750 4,750 |
57,044 75,867 |
17,475 9,729 |
KK↔RU | 290,785 | 286,943 287,215 |
3,689,799 3,945,741 |
164,995 165,882 |
72,697 | 72,413 72,439 |
923,750 988,374 |
82,958 87,519 |
4,750 | 4,750 4,750 |
57,044 61,916 |
17,475 18,804 |
KK↔TR | 289,720 | 286,694 286,279 |
3,691,751 3,626,361 |
164,961 157,460 |
72,430 | 72,211 72,190 |
920,057 904,199 |
82,698 80,885 |
4,750 | 4,750 4,750 |
57,044 55,861 |
17,475 17,284 |
EN↔RU | 290,764 | 286,185 287,261 |
5,058,530 3,950,362 |
54,322 165,701 |
72,692 | 72,377 72,427 |
1,257,904 982,032 |
32,208 87,541 |
4,750 | 4,750 4,750 |
75,867 61,916 |
9,729 18,804 |
EN↔TR | 289,913 | 285,967 286,288 |
5,048,274 3,621,531 |
54,224 157,369 |
72,479 | 72,220 72,219 |
1,256,562 901,983 |
32,269 80,838 |
4,750 | 4,750 4,750 |
75,867 55,861 |
9,729 17,284 |
RU↔TR | 290,899 | 287,241 286,475 |
3,947,809 3,626,436 |
165,482 157,470 |
72,725 | 72,455 72,362 |
990,125 909,550 |
87,831 80,962 |
4,750 | 4,750 4,750 |
61,916 55,861 |
18,804 17,284 |
To make our parallel corpus more extensive and diverse and to explore how well our translation models perform when dealing with a combination of human-translated and machine-translated content, we carried out web crawling to gather a total of 1,797,066 sentences from English-language websites. These sentences were then automatically translated into Kazakh, Russian, and Turkish using the Google Translate service. In the context of our research, we refer to this collection of data as 'SynC' (Synthetic Corpus).
Pair | # lines |
# sents |
# tokens |
# types |
---|---|---|---|---|
KK↔EN | 1,787,050 | 1,782,192 1,781,019 |
26,630,960 35,291,705 |
685,135 300,556 |
KK↔RU | 1,787,448 | 1,782,192 1,777,500 |
26,654,195 30,241,895 |
685,135 672,146 |
KK↔TR | 1,791,425 | 1,782,192 1,782,257 |
26,726,439 27,865,860 |
685,135 656,294 |
EN↔RU | 1,784,513 | 1,781,019 1,777,500 |
35,244,800 30,175,611 |
300,556 672,146 |
EN↔TR | 1,788,564 | 1,781,019 1,782,257 |
35,344,188 27,806,708 |
300,556 656,294 |
RU↔TR | 1,788,027 | 1,777,500 1,782,257 |
30,269,083 27,816,210 |
672,146 656,294 |
We further divided the synthetic corpus into training and validation sets with a 90/10 ratio.
Pair | Train | Valid | ||||||
---|---|---|---|---|---|---|---|---|
# lines | # sents | # tokens | # types | # lines | # sents | # tokens | # types | |
KK↔EN | 1,608,345 | 1,604,414 1,603,426 |
23,970,260 31,767,617 |
650,144 286,372 |
178,705 | 178,654 178,639 |
2,660,700 3,524,088 |
208,838 105,517 |
KK↔RU | 1,608,703 | 1,604,468 1,600,643 |
23,992,148 27,221,583 |
650,170 642,604 |
178,745 | 178,691 178,642 |
2,662,047 3,020,312 |
209,188 235,642 |
KK↔TR | 1,612,282 | 1,604,793 1,604,822 |
24,053,671 25,078,688 |
650,384 626,724 |
179,143 | 179,057 179,057 |
2,672,768 2,787,172 |
209,549 221,773 |
EN↔RU | 1,606,061 | 1,603,199 1,600,372 |
31,719,781 27,158,101 |
286,645 642,686 |
178,452 | 178,419 178,379 |
3,525,019 3,017,510 |
104,834 235,069 |
EN↔TR | 1,609,707 | 1,603,636 1,604,545 |
31,805,393 25,022,782 |
286,387 626,740 |
178,857 | 178,775 178,796 |
3,538,795 2,783,926 |
105,641 221,372 |
RU↔TR | 1,609,224 | 1,600,605 1,604,521 |
27,243,278 25,035,274 |
642,797 626,587 |
178,803 | 178,695 178,750 |
3,025,805 2,780,936 |
235,970 221,792 |
The data underwent vectorisation using HuggingFace's transformers and datasets libraries. Each language pair was vectorised individually based on the source and target languages within the pair. Subsequently, the vectorised data sets were combined into unified training and validation sets, each comprising 6 language pairs for bidirectional translation purposes. For more details, see data_tokenization.ipynb.
The corpus is organized into two distinct groups based on their file prefixes. Files "01" through "19" have the "kazparc" prefix, while Files "20" to "32" have the "sync" prefix.
├── kazparc
├── 01_kazparc_all_entries.csv
├── 02_kazparc_train_kk_en.csv
├── 03_kazparc_train_kk_ru.csv
├── 04_kazparc_train_kk_tr.csv
├── 05_kazparc_train_en_ru.csv
├── 06_kazparc_train_en_tr.csv
├── 07_kazparc_train_ru_tr.csv
├── 08_kazparc_valid_kk_en.csv
├── 09_kazparc_valid_kk_ru.csv
├── 10_kazparc_valid_kk_tr.csv
├── 11_kazparc_valid_en_ru.csv
├── 12_kazparc_valid_en_tr.csv
├── 13_kazparc_valid_ru_tr.csv
├── 14_kazparc_test_kk_en.csv
├── 15_kazparc_test_kk_ru.csv
├── 16_kazparc_test_kk_tr.csv
├── 17_kazparc_test_en_ru.csv
├── 18_kazparc_test_en_tr.csv
├── 19_kazparc_test_ru_tr.csv
├── sync
├── 20_sync_all_entries.csv
├── 21_sync_train_kk_en.csv
├── 22_sync_train_kk_ru.csv
├── 23_sync_train_kk_tr.csv
├── 24_sync_train_en_ru.csv
├── 25_sync_train_en_tr.csv
├── 26_sync_train_ru_tr.csv
├── 27_sync_valid_kk_en.csv
├── 28_sync_valid_kk_ru.csv
├── 29_sync_valid_kk_tr.csv
├── 30_sync_valid_en_ru.csv
├── 31_sync_valid_en_tr.csv
├── 32_sync_valid_ru_tr.csv
KazParC files:
- File "01" contains the original, unprocessed text data for the four languages considered within KazParC.
- Files "02" through "19" represent pre-processed texts divided into language pairs for training (Files "02" to "07"), validation (Files "08" to "13"), and testing (Files "14" to "19"). Language pairs are indicated within the filenames using two-letter language codes (e.g., kk_en).
SynC files:
- File "20" contains raw, unprocessed text data for the four languages.
- Files "21" to "32" contain pre-processed text divided into language pairs for training (Files "21" to "26") and validation (Files "27" to "32") purposes.
In both "01" and "20", each line consists of specific components: a unique line identifier (id), texts in Kazakh (kk), English (en), Russian (ru), and Turkish (tr), along with accompanying domain information (domain). For the other files, the data fields are id, source_lang, target_lang, domain, and the language pair (e.g., kk_en.).
In our study, we used Facebook's NLLB model, which supports translation for a wide range of languages, including Kazakh, English, Russian, and Turkish. To assess the performance of the model, we initially tested two versions: the baseline and the distilled models. We fine-tuned these versions on KazParC data. After comparing their results, we found that the distilled model consistently outperformed the baseline, though the difference was relatively small, with an improvement of just 0.01 BLEU score. Consequently, we focused our subsequent experiments exclusively on fine-tuning the distilled model.
We trained a total of four models:
- 'base', the off-the-shelf model.
- 'parc', fine-tuned on KazParC data.
- 'sync', fine-tuned on SynC data.
- 'parsync', fine-tuned on both KazParC and SynC data.
We fine-tuned these models using hyperparameters tuned with validation sets. We included synthetic data in the validation sets only when assessing the performance of the 'sync' and 'parsync' models. The best-performing models were then evaluated on the test sets.
In addition to the KazParC test set, we used the FLoRes dataset. We merged the dev and devtest sets from FLoRes into one set for our evaluation. We also explored language pairs, such as German-French, German-Ukrainian, and French-Uzbek, to assess how fine-tuning the model affected translation quality for different language pairs.
All the models were fine-tuned using eight GPUs on an NVIDIA DGX A100 machine. We initially set a learning rate of 2 × 10-5 and used the AdaFactor optimization algorithm. The training process spanned three epochs, with both the training and evaluation batch sizes set to 8. To start training the model, create a virtual environment and install the necessary requirements from the environment.yaml file:
conda create --name kazparc python=3.8.17
conda env update --name kazparc --file environment.yaml
Once you have completed the above steps, you are ready to run the train.py script using the command:
python3 -m torch.distributed.launch --nproc_per_node 8 --nnodes 1 train.py
In our evaluation of machine translation models, we used two widely recognised metrics:
- BLEU, based on precision in 4-grams, measures how closely machine-produced translations match human references.
- chrF evaluates translation quality by considering character n-grams, making it well-suited for languages with complex morphologies (e.g., Kazakh and Turkish). chrF calculates the harmonic mean of character-based precision and recall, offering a robust evaluation of translation performance.
We translated the test dataset using the translate_test_set.py script. To obtain the BLEU and ChrF metrics we used evaluation.ipynb. Below are the results we obtained from evaluating the Tilmash model on the KazParC and FLoRes test datasets.
Pair | FLoRes Test Set | |||||
---|---|---|---|---|---|---|
base | parc | sync | parsync | Yandex | ||
EN↔KK | 0.11 | 0.49 | 0.14 | 0.56 | 0.20 | 0.60 | 0.20 | 0.60 | 0.18 | 0.58 | 0.20 | 0.60 |
EN↔RU | 0.25 | 0.56 | 0.26 | 0.58 | 0.28 | 0.60 | 0.28 | 0.60 | 0.32 | 0.63 | 0.31 | 0.62 |
EN↔TR | 0.19 | 0.58 | 0.22 | 0.61 | 0.27 | 0.65 | 0.27 | 0.65 | 0.29 | 0.66 | 0.30 | 0.66 |
KK↔EN | 0.28 | 0.59 | 0.32 | 0.62 | 0.31 | 0.62 | 0.32 | 0.63 | 0.30 | 0.62 | 0.36 | 0.65 |
KK↔RU | 0.15 | 0.49 | 0.17 | 0.51 | 0.18 | 0.52 | 0.18 | 0.52 | 0.18 | 0.52 | 0.20 | 0.53 |
KK↔TR | 0.09 | 0.48 | 0.13 | 0.52 | 0.14 | 0.54 | 0.14 | 0.54 | 0.12 | 0.52 | 0.17 | 0.56 |
RU↔EN | 0.31 | 0.62 | 0.32 | 0.63 | 0.32 | 0.63 | 0.32 | 0.63 | 0.33 | 0.64 | 0.35 | 0.65 |
RU↔KK | 0.08 | 0.49 | 0.10 | 0.52 | 0.13 | 0.53 | 0.13 | 0.54 | 0.12 | 0.54 | 0.13 | 0.54 |
RU↔TR | 0.10 | 0.49 | 0.12 | 0.52 | 0.14 | 0.54 | 0.14 | 0.54 | 0.13 | 0.54 | 0.17 | 0.56 |
TR↔EN | 0.34 | 0.64 | 0.35 | 0.65 | 0.36 | 0.66 | 0.36 | 0.66 | 0.38 | 0.67 | 0.39 | 0.67 |
TR↔KK | 0.07 | 0.45 | 0.10 | 0.51 | 0.13 | 0.54 | 0.13 | 0.54 | 0.12 | 0.53 | 0.13 | 0.54 |
TR↔RU | 0.15 | 0.48 | 0.17 | 0.51 | 0.18 | 0.52 | 0.19 | 0.53 | 0.20 | 0.54 | 0.21 | 0.54 |
Average | 0.18 | 0.53 | 0.20 | 0.56 | 0.22 | 0.58 | 0.22 | 0.58 | 0.23 | 0.58 | 0.25 | 0.59 |
BLEU | chrF scores for models on the FLoRes test
Pair | KazParC Test Set | |||||
---|---|---|---|---|---|---|
base | parc | sync | parsync | Yandex | ||
EN↔KK | 0.12 | 0.51 | 0.18 | 0.58 | 0.18 | 0.58 | 0.21 | 0.60 | 0.18 | 0.58 | 0.30 | 0.65 |
EN↔RU | 0.31 | 0.64 | 0.38 | 0.68 | 0.35 | 0.66 | 0.38 | 0.68 | 0.39 | 0.70 | 0.41 | 0.71 |
EN↔TR | 0.19 | 0.59 | 0.22 | 0.62 | 0.25 | 0.63 | 0.25 | 0.64 | 0.27 | 0.64 | 0.34 | 0.68 |
KK↔EN | 0.24 | 0.55 | 0.33 | 0.62 | 0.24 | 0.57 | 0.32 | 0.62 | 0.28 | 0.60 | 0.31 | 0.62 |
KK↔RU | 0.22 | 0.56 | 0.29 | 0.63 | 0.24 | 0.59 | 0.29 | 0.63 | 0.29 | 0.63 | 0.29 | 0.61 |
KK↔TR | 0.10 | 0.47 | 0.15 | 0.54 | 0.14 | 0.52 | 0.16 | 0.55 | 0.13 | 0.52 | 0.23 | 0.59 |
RU↔EN | 0.34 | 0.63 | 0.43 | 0.71 | 0.34 | 0.65 | 0.42 | 0.70 | 0.43 | 0.71 | 0.42 | 0.71 |
RU↔KK | 0.15 | 0.55 | 0.21 | 0.61 | 0.18 | 0.58 | 0.22 | 0.62 | 0.23 | 0.62 | 0.24 | 0.62 |
RU↔TR | 0.11 | 0.49 | 0.16 | 0.56 | 0.16 | 0.55 | 0.18 | 0.57 | 0.16 | 0.55 | 0.22 | 0.60 |
TR↔EN | 0.31 | 0.61 | 0.38 | 0.67 | 0.32 | 0.63 | 0.38 | 0.66 | 0.36 | 0.66 | 0.37 | 0.66 |
TR↔KK | 0.08 | 0.46 | 0.14 | 0.53 | 0.14 | 0.52 | 0.16 | 0.55 | 0.14 | 0.53 | 0.19 | 0.57 |
TR↔RU | 0.17 | 0.50 | 0.23 | 0.56 | 0.20 | 0.54 | 0.24 | 0.57 | 0.23 | 0.57 | 0.26 | 0.58 |
Average | 0.20 | 0.55 | 0.27 | 0.61 | 0.23 | 0.59 | 0.27 | 0.62 | 0.26 | 0.61 | 0.30 | 0.63 |
BLEU | chrF scores for models on the KazParC test
After a comprehensive analysis of both qualitative and quantitative outcomes, we have found that the 'parsync' model, which was fine-tuned on a mix of the KazParC corpus and synthetic data, emerged as the top-performing model. Let us simply call this model Tilmash, a Kazakh term that means 'interpreter' or 'translator'.
Pair | BLEU | chrF | ||
---|---|---|---|---|
base | Tilmash | base | Tilmash | |
DE→FR | 0.33 | 0.28 | 0.61 | 0.58 |
FR→DE | 0.22 | 0.19 | 0.55 | 0.53 |
DE→UK | 0.15 | 0.04 | 0.49 | 0.36 |
UK→DE | 0.19 | 0.16 | 0.53 | 0.50 |
FR→UZ | 0.06 | 0.02 | 0.48 | 0.31 |
UZ→FR | 0.25 | 0.22 | 0.56 | 0.53 |
Results of the base and Tilmash models on the control language pairs on the FLoRes test set
Pair | Type | Text | BLEU | chrF |
---|---|---|---|---|
KK→EN | source | Ыстық және желді. Ystyq jane jeldi. |
||
reference | It is hot and windy. | 1.00 | 1.00 | |
Tilmash | It's hot and windy. | 0.55 | 0.81 | |
Yandex | Hot and windy. | 0.00 | 0.66 | |
Hot and windy. | 0.00 | 0.66 | ||
KK→EN | source | 1 қыркүйекте бесінші ана өлімі тіркелді. 1 qyrkuiekte besinshi ana olimi tirkeldi. |
||
reference | On September 1, the fifth maternal death was registered. | 1.00 | 1.00 | |
Tilmash | A fifth maternal death was recorded on 1 September. | 0.27 | 0.63 | |
Yandex | On September 1, the fifth maternal death was registered. | 1.00 | 1.00 | |
On September 1, the fifth maternal death was recorded. | 0.81 | 0.86 |
A selection of translation outputs from Tilmash, Yandex, and Google
Below are the detailed tables of Tilmash, Yandex, and Google results per domain.
EDUCATION AND SCIENCE | |||||||
---|---|---|---|---|---|---|---|
Pair | Tilmash | Yandex | |||||
BLEU | chrF | BLEU | chrF | BLEU | chrF | ||
EN→KK | 0.23 | 0.63 | 0.19 | 0.61 | 0.44 | 0.73 | |
EN→RU | 0.39 | 0.74 | 0.39 | 0.76 | 0.43 | 0.78 | |
EN→TR | 0.33 | 0.71 | 0.37 | 0.74 | 0.47 | 0.79 | |
KK→EN | 0.28 | 0.64 | 0.27 | 0.63 | 0.32 | 0.66 | |
KK→RU | 0.26 | 0.66 | 0.26 | 0.66 | 0.32 | 0.66 | |
KK→TR | 0.20 | 0.60 | 0.15 | 0.57 | 0.29 | 0.66 | |
RU→EN | 0.38 | 0.73 | 0.40 | 0.75 | 0.40 | 0.76 | |
RU→KK | 0.21 | 0.64 | 0.22 | 0.65 | 0.30 | 0.67 | |
RU→TR | 0.24 | 0.65 | 0.22 | 0.65 | 0.33 | 0.70 | |
TR→EN | 0.38 | 0.70 | 0.38 | 0.70 | 0.40 | 0.71 | |
TR→KK | 0.19 | 0.58 | 0.17 | 0.56 | 0.29 | 0.64 | |
TR→RU | 0.27 | 0.63 | 0.29 | 0.65 | 0.33 | 0.68 |
FICTION | |||||||
---|---|---|---|---|---|---|---|
Pair | Tilmash | Yandex | |||||
BLEU | chrF | BLEU | chrF | BLEU | chrF | ||
EN→KK | 0.13 | 0.51 | 0.15 | 0.52 | 0.19 | 0.53 | |
EN→RU | 0.35 | 0.64 | 0.34 | 0.66 | 0.37 | 0.66 | |
EN→TR | 0.28 | 0.62 | 0.29 | 0.63 | 0.53 | 0.74 | |
KK→EN | 0.29 | 0.57 | 0.24 | 0.54 | 0.29 | 0.58 | |
KK→RU | 0.25 | 0.58 | 0.23 | 0.55 | 0.25 | 0.57 | |
KK→TR | 0.26 | 0.62 | 0.18 | 0.56 | 0.50 | 0.77 | |
RU→EN | 0.40 | 0.66 | 0.41 | 0.67 | 0.42 | 0.68 | |
RU→KK | 0.17 | 0.55 | 0.19 | 0.56 | 0.16 | 0.55 | |
RU→TR | 0.22 | 0.59 | 0.17 | 0.55 | 0.36 | 0.67 | |
TR→EN | 0.36 | 0.63 | 0.35 | 0.62 | 0.37 | 0.64 | |
TR→KK | 0.15 | 0.55 | 0.16 | 0.55 | 0.19 | 0.58 | |
TR→RU | 0.24 | 0.56 | 0.24 | 0.56 | 0.26 | 0.58 |
GENERAL | |||||||
---|---|---|---|---|---|---|---|
Pair | Tilmash | Yandex | |||||
BLEU | chrF | BLEU | chrF | BLEU | chrF | ||
EN→KK | 0.26 | 0.68 | 0.17 | 0.62 | 0.45 | 0.77 | |
EN→RU | 0.46 | 0.76 | 0.44 | 0.77 | 0.48 | 0.79 | |
EN→TR | 0.12 | 0.54 | 0.12 | 0.54 | 0.12 | 0.55 | |
KK→EN | 0.39 | 0.68 | 0.29 | 0.64 | 0.33 | 0.65 | |
KK→RU | 0.32 | 0.68 | 0.29 | 0.66 | 0.30 | 0.66 | |
KK→TR | 0.10 | 0.52 | 0.08 | 0.47 | 0.11 | 0.51 | |
RU→EN | 0.45 | 0.74 | 0.39 | 0.71 | 0.38 | 0.70 | |
RU→KK | 0.22 | 0.66 | 0.18 | 0.63 | 0.22 | 0.65 | |
RU→TR | 0.11 | 0.52 | 0.09 | 0.49 | 0.09 | 0.51 | |
TR→EN | 0.32 | 0.62 | 0.27 | 0.59 | 0.28 | 0.60 | |
TR→KK | 0.14 | 0.55 | 0.10 | 0.50 | 0.16 | 0.56 | |
TR→RU | 0.22 | 0.57 | 0.18 | 0.57 | 0.21 | 0.58 |
LEGAL DOCUMENTS | |||||||
---|---|---|---|---|---|---|---|
Pair | Tilmash | Yandex | |||||
BLEU | chrF | BLEU | chrF | BLEU | chrF | ||
EN→KK | 0.27 | 0.67 | 0.28 | 0.67 | 0.29 | 0.68 | |
EN→RU | 0.48 | 0.75 | 0.46 | 0.76 | 0.47 | 0.76 | |
EN→TR | 0.22 | 0.64 | 0.23 | 0.64 | 0.25 | 0.55 | |
KK→EN | 0.41 | 0.69 | 0.34 | 0.65 | 0.36 | 0.66 | |
KK→RU | 0.47 | 0.77 | 0.45 | 0.76 | 0.38 | 0.71 | |
KK→TR | 0.11 | 0.54 | 0.11 | 0.53 | 0.13 | 0.54 | |
RU→EN | 0.52 | 0.76 | 0.52 | 0.76 | 0.51 | 0.76 | |
RU→KK | 0.37 | 0.74 | 0.38 | 0.75 | 0.33 | 0.71 | |
RU→TR | 0.14 | 0.57 | 0.13 | 0.56 | 0.15 | 0.58 | |
TR→EN | 0.46 | 0.72 | 0.39 | 0.69 | 0.43 | 0.70 | |
TR→KK | 0.18 | 0.58 | 0.15 | 0.56 | 0.18 | 0.58 | |
TR→RU | 0.29 | 0.63 | 0.22 | 0.59 | 0.27 | 0.61 |
MASS MEDIA | |||||||
---|---|---|---|---|---|---|---|
Pair | Tilmash | Yandex | |||||
BLEU | chrF | BLEU | chrF | BLEU | chrF | ||
EN→KK | 0.18 | 0.58 | 0.17 | 0.58 | 0.19 | 0.59 | |
EN→RU | 0.35 | 0.67 | 0.38 | 0.70 | 0.40 | 0.70 | |
EN→TR | 0.30 | 0.66 | 0.31 | 0.67 | 0.41 | 0.72 | |
KK→EN | 0.32 | 0.62 | 0.32 | 0.62 | 0.33 | 0.62 | |
KK→RU | 0.27 | 0.61 | 0.29 | 0.62 | 0.26 | 0.59 | |
KK→TR | 0.18 | 0.57 | 0.16 | 0.55 | 0.26 | 0.62 | |
RU→EN | 0.48 | 0.73 | 0.53 | 0.76 | 0.50 | 0.74 | |
RU→KK | 0.21 | 0.60 | 0.22 | 0.62 | 0.20 | 0.59 | |
RU→TR | 0.22 | 0.60 | 0.18 | 0.58 | 0.26 | 0.63 | |
TR→EN | 0.40 | 0.68 | 0.40 | 0.68 | 0.41 | 0.69 | |
TR→KK | 0.15 | 0.55 | 0.14 | 0.54 | 0.17 | 0.57 | |
TR→RU | 0.22 | 0.57 | 0.24 | 0.58 | 0.25 | 0.59 |
To translate text, you can utilise the predict.py script. To get started, make sure to download Tilmash from our Hugging Face repository. In the script, you will need to specify the source and target languages using the src and trg variables. You can choose from the following language values:
- Kazakh: kaz_Cyrl
- Russian: rus_Cyrl
- English: eng_Latn
- Turkish: tur_Latn
Once you have set the languages, simply input the text you want to translate into the text variable.
We wish to convey our deep appreciation to the diligent group of translators whose exceptional contributions have been crucial to the successful realisation of this study. Their tireless efforts to ensure the accuracy and faithful rendition of the source materials have indeed proved invaluable. Our sincerest thanks go to the following esteemed individuals: Aigerim Baidauletova, Aigerim Boranbayeva, Ainagul Akmuldina, Aizhan Seipanova, Askhat Kenzhegulov, Assel Kospabayeva, Assel Mukhanova, Elmira Nikiforova, Gaukhar Rayanova, Gulim Kabidolda, Gulzhanat Abduldinova, Indira Yerkimbekova, Moldir Orazalinova, Saltanat Kemaliyeva, and Venera Spanbayeva.
We kindly urge you, if you incorporate our dataset and/or model into your work, to cite our paper as a gesture of recognition for its valuable contribution. The act of referencing the relevant sources not only upholds academic honesty but also ensures proper acknowledgement of the authors' efforts. Your citation in your research significantly contributes to the continuous progress and evolution of the scholarly realm. Your endorsement and acknowledgement of our endeavours are genuinely appreciated.
@misc{yeshpanov2024kazparc,
title={KazParC: Kazakh Parallel Corpus for Machine Translation},
author={Rustem Yeshpanov and Alina Polonskaya and Huseyin Atakan Varol},
year={2024},
eprint={2403.19399},
archivePrefix={arXiv},
primaryClass={cs.CL}
}