This repository provides a dataset and pre-trained polarity and score classification models for the paper
KazSAnDRA: Kazakh Sentiment Analysis Dataset of Reviews and Attitudes
The source data for our dataset came from four domains:
- an online store for Android devices that offers a diverse range of applications (hereafter Appstore),
- an online library that serves as a source of books and audiobooks in Kazakh (hereafter Bookstore)
- digital mapping and navigation services (hereafter Mapping),
- online marketplaces (hereafter Market).
Domain | ⭐️ | ⭐️⭐️ | ⭐️⭐️⭐️ | ⭐️⭐️⭐️⭐️ | ⭐️⭐️⭐️⭐️⭐️ | Total |
---|---|---|---|---|---|---|
Appstore | 22,547 | 4,202 | 5,758 | 7,949 | 94,617 | 135,073 |
Bookstore | 686 | 107 | 222 | 368 | 4,422 | 5,805 |
Mapping | 959 | 270 | 369 | 525 | 6,774 | 8,897 |
Market | 1,043 | 350 | 913 | 2,775 | 25,208 | 30,289 |
Total | 25,235 | 4,929 | 7,262 | 11,617 | 131,021 | 180,064 |
In Kazakhstan, people often switch between speaking Kazakh and Russian. There is also a trend of moving from using the Cyrillic script to the Latin script. As a result, the Kazakh reviews in our dataset can take various forms: (a) purely Kazakh words written in the Kazakh Cyrillic script, (b) Kazakh words in the Latin script, (c) a mix of Cyrillic and Latin characters, (d) a mix of Russian and Kazakh words, or (e) entirely in Cyrillic with Russian characters instead of Kazakh ones.
Actual review | Correct form (Kazakh) | Correct form (English) | |
a | керемет кітап | керемет кітап | a wonderful book |
b | keremet | керемет | wonderful |
c | jok кітап | кітап жоқ | no books |
d | Осы приложениеге көп рахмет! | Осы қолданбаға көп рақмет! | Many thanks to this app! |
e | Кушти! | Күшті! | Great! |
We utilised KazSAnDRA for two distinct tasks:
- polarity classification (PC), involving the prediction of whether a review is positive or negative:
- reviews with original scores of 1 or 2 were classified as negative and assigned a new score of 0,
- reviews with original scores of 4 or 5 were classified as positive and assigned a new score of 1,
- reviews with an original score of 3 were categorized as neutral and were excluded from the task.
- score classification (SC), where the objective was to predict the score of a review on a scale ranging from 1 to 5.
During the data pre-processing stage, the following steps were undertaken:
- Removal of emojis 🤓
- Lowercasing all reviews 🔠 ➙ 🔡
- Removal of punctuation marks
⁉️ - Removal of newline (\n), tab (\t), and carriage return (\r) characters ⇥ ↵
- Replacement of multiple spaces with a single space ␣
- Reduction of consecutive recurring characters to two single instances (e.g., "кееррреемееетт" to "кеерреемеетт") 🔂
- Removal of duplicate entries (i.e., reviews sharing identical text and scores) 👯♂️
For the sake of maintaining consistency and facilitating reproducibility of our experimental outcomes among different research groups, we partitioned KazSAnDRA into three distinct sets: training (train), validation (valid), and testing (test) sets, following an 80/10/10 ratio.
Task | Train | Valid | Test | Total | ||||
# | % | # | % | # | % | # | % | |
PC | 134,368 | 80 | 16,796 | 10 | 16,797 | 10 | 167,961 | 100 |
SC | 140,126 | 80 | 17,516 | 10 | 17,516 | 10 | 175,158 | 100 |
The distribution of reviews across the three sets based on their domains and scores for the PC task:
Domain | Train | Valid | Test | |||
---|---|---|---|---|---|---|
# | % | # | % | # | % | |
Appstore | 101,477 | 75.52 | 12,685 | 75.52 | 12,685 | 75.52 |
Market | 22,561 | 16.79 | 2,820 | 16.79 | 2,820 | 16.79 |
Mapping | 6,509 | 4.84 | 813 | 4.84 | 814 | 4.85 |
Bookstore | 3,821 | 2.84 | 478 | 2.85 | 478 | 2.85 |
Total | 134,368 | 100 | 16,796 | 100 | 16,797 | 100 |
Score | Train | Valid | Test | |||
---|---|---|---|---|---|---|
# | % | # | % | # | % | |
1 | 110,417 | 82.18 | 13,801 | 82.17 | 13,804 | 82.18 |
0 | 23,951 | 17.82 | 2,995 | 17.83 | 2,993 | 17.82 |
Total | 134,368 | 100 | 16,796 | 100 | 16,797 | 100 |
The distribution of reviews across the three sets based on their domains and scores for the SC task:
Domain | Train | Valid | Test | |||
---|---|---|---|---|---|---|
# | % | # | % | # | % | |
Appstore | 106,058 | 75.69 | 13,258 | 75.69 | 13,257 | 75.69 |
Market | 23,278 | 16.61 | 2,909 | 16.61 | 2,910 | 16.61 |
Mapping | 6,794 | 4.85 | 849 | 4.85 | 849 | 4.85 |
Bookstore | 3,996 | 2.85 | 500 | 2.85 | 500 | 2.85 |
Total | 140,126 | 100 | 17,516 | 100 | 17,516 | 100 |
Score | Train | Valid | Test | |||
---|---|---|---|---|---|---|
# | % | # | % | # | % | |
5 | 101,302 | 72.29 | 12,663 | 72.29 | 12,663 | 72.29 |
1 | 20,031 | 14.29 | 2,504 | 14.30 | 2,504 | 14.30 |
4 | 9,115 | 6.50 | 1,140 | 6.51 | 1,139 | 6.50 |
3 | 5,758 | 4.11 | 719 | 4.10 | 720 | 4.11 |
2 | 3,920 | 2.80 | 490 | 2.80 | 490 | 2.80 |
Total | 140,126 | 100 | 17,516 | 100 | 17,517 | 100 |
To address the data imbalance in our training data, we employed random oversampling (ROS) and random undersampling (RUS) techniques, aiming to balance the representation of classes by creating new samples for the smaller class to align with the count of the majority class and eliminating samples from the larger class to match the count of the minority class.
The balanced training sets for the PC task:
Score | Balanced | Imbalanced | |
---|---|---|---|
OS | US | ||
0 | 110,417 | 23,951 | 23,951 |
1 | 110,417 | 23,951 | 110,417 |
The balanced training sets for the SC task:
Score | Balanced | Imbalanced | |
---|---|---|---|
OS | US | ||
1 | 101,302 | 3,920 | 20,031 |
2 | 101,302 | 3,920 | 3,920 |
3 | 101,302 | 3,920 | 5,758 |
4 | 101,302 | 3,920 | 9,115 |
5 | 101,302 | 3,920 | 101,302 |
The dataset folder contains ten ZIP files, each containing a CSV file. Files "01" to "05" are associated with PC (polarity classification), while files "06" to "10" are related to SC (score classification). To align with the enumeration used for labelling in the classifier, which starts from 0 rather than 1, labels 1-5 in the SC task were transformed into 0-4. Different training set variations are indicated by the suffixes "ib" for imbalanced data, "ros" for random oversampling, and "rus" for random undersampling. Each file includes records containing a custom review identifier (custom_id), the original review text (text), the pre-processed review text (text_cleaned), the corresponding review score (label), and the domain information (domain).
For the evaluation of KazSAnDRA, we utilised four multilingual machine learning models, all incorporating the Kazakh language and accessible through the Hugging Face Transformers framework:
The models were fine-tuned using both the balanced and imbalanced training sets, while the hyperparameters were refined using the validation set. The final and most optimal models were evaluated on the test sets. The fine-tuning of the models was executed on a single A100 GPU hosted on an NVIDIA DGX A100 machine. The initial learning rate was set at 10-5 the weight decay rate was set at 10-3. Early stopping was employed, executed when the F1-score exhibited no improvement for three consecutive epochs. We set the batch size to 32 (mBERT, XLM-R, RemBERT) or 16 (mBART-50) and applied 800 warm-up steps.
Model | PC | SC | ||||
---|---|---|---|---|---|---|
ROS | RUS | IB | ROS | RUS | IB | |
mBERT | 4 | 7 | 6 | 8 | 10 | 11 |
XLM-R | 5 | 7 | 5 | 4 | 9 | 16 |
RemBERT | 4 | 5 | 5 | 6 | 6 | 9 |
mBART-50 | 5 | 7 | 5 | 8 | 7 | 5 |
Number of training epochs for models
Several conventional metrics were used to evaluate the performance of the models, including accuracy (A), precision (P), recall (R), and F1-score (F1). Given the imbalanced nature of the dataset, where all classes carry equal importance, we opted for macro-averaging, calculated from the arithmetic (i.e., unweighted) mean of all F1-scores per class, and thus ensuring equal treatment of all classes during the evaluation, resulting in a stronger penalty if the model performs worse on minority classes.
- Download this repository and install the required packages:
git clone https://github.com/IS2AI/KazSAnDRA.git
cd KazSAnDRA/scripts
pip install -r requirements.txt
- To fine-tune and evaluate a model, select the necessary arguments in
finetune_evaluate.py
and run:
python finetune_evaluate.py
- To classify a review, select the necessary arguments and add a review in
predict.py
and run:
python predict.py
Model | POLARITY CLASSIFICATION | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Balanced (ROS) | Balanced (RUS) | Imbalanced | ||||||||||
A | P | R | F1 | A | P | R | F1 | A | P | R | F1 | |
mBERT | 0.84 | 0.74 | 0.83 | 0.77 | 0.85 | 0.76 | 0.82 | 0.78 | 0.89 | 0.82 | 0.79 | 0.80 |
XLM-R | 0.86 | 0.76 | 0.83 | 0.79 | 0.85 | 0.75 | 0.83 | 0.78 | 0.89 | 0.81 | 0.81 | 0.81 |
RemBERT | 0.88 | 0.79 | 0.82 | 0.81 | 0.87 | 0.78 | 0.82 | 0.80 | 0.89 | 0.81 | 0.82 | 0.81 |
mBART50 | 0.87 | 0.77 | 0.79 | 0.78 | 0.81 | 0.72 | 0.81 | 0.74 | 0.89 | 0.82 | 0.78 | 0.80 |
PC results on the test sets
Model | SCORE CLASSIFICATION | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Balanced (ROS) | Balanced (RUS) | Imbalanced | ||||||||||
A | P | R | F1 | A | P | R | F1 | A | P | R | F1 | |
mBERT | 0.67 | 0.34 | 0.36 | 0.35 | 0.63 | 0.35 | 0.39 | 0.36 | 0.77 | 0.44 | 0.36 | 0.37 |
XLM-R | 0.58 | 0.36 | 0.42 | 0.36 | 0.66 | 0.36 | 0.41 | 0.37 | 0.77 | 0.42 | 0.37 | 0.39 |
RemBERT | 0.73 | 0.37 | 0.36 | 0.36 | 0.62 | 0.35 | 0.40 | 0.35 | 0.76 | 0.41 | 0.38 | 0.39 |
mBART50 | 0.74 | 0.36 | 0.34 | 0.35 | 0.55 | 0.36 | 0.41 | 0.34 | 0.77 | 0.42 | 0.37 | 0.38 |
SC results on the test sets
POLARITY CLASSIFICATION | |||
---|---|---|---|
predicted → actual ↓ |
0 | 1 | Total |
0 | 2,155 | 838 | 2,993 |
1 | 1,036 | 12,768 | 13,804 |
RemBERT PC results
SCORE CLASSIFICATION | ||||||
---|---|---|---|---|---|---|
predicted → actual ↓ |
1 | 2 | 3 | 4 | 5 | Total |
1 | 1,379 | 145 | 132 | 64 | 784 | 2,504 |
2 | 182 | 55 | 56 | 25 | 172 | 490 |
3 | 173 | 54 | 118 | 65 | 310 | 720 |
4 | 110 | 39 | 90 | 169 | 731 | 1,139 |
5 | 564 | 59 | 165 | 297 | 11,578 | 12,663 |
RemBERT SC results
Domain | PC | |||
---|---|---|---|---|
A | P | R | F1 | |
Appstore | 0.87 | 0.80 | 0.81 | 0.80 |
Bookstore | 0.86 | 0.75 | 0.80 | 0.77 |
Mapping | 0.92 | 0.84 | 0.88 | 0.86 |
Market | 0.97 | 0.84 | 0.91 | 0.87 |
RemBERT PC results by domain
Domain | SC | |||
---|---|---|---|---|
A | P | R | F1 | |
Appstore | 0.74 | 0.41 | 0.37 | 0.38 |
Bookstore | 0.73 | 0.34 | 0.32 | 0.32 |
Mapping | 0.80 | 0.42 | 0.41 | 0.41 |
Market | 0.82 | 0.43 | 0.41 | 0.42 |
RemBERT SC results by domain
We sincerely thank Alma Murzagulova, Aizhan Seipanova, Meiramgul Akanova, Almas Aitzhan, Aigerim Boranbayeva, and Assel Kospabayeva, who acted as moderators during the review collection process. Their tireless efforts, diligence, and remarkable patience contributed significantly to the successful completion of this endeavour.
We kindly urge you, if you incorporate our dataset and/or model into your work, to cite our paper as a gesture of recognition for its valuable contribution. The act of referencing the relevant sources not only upholds academic honesty but also ensures proper acknowledgement of the authors' efforts. Your citation in your research significantly contributes to the continuous progress and evolution of the scholarly realm. Your endorsement and acknowledgement of our endeavours are genuinely appreciated.
@misc{yeshpanov2024kazsandra,
title={KazSAnDRA: Kazakh Sentiment Analysis Dataset of Reviews and Attitudes},
author={Rustem Yeshpanov and Huseyin Atakan Varol},
year={2024},
eprint={2403.19335},
archivePrefix={arXiv},
primaryClass={cs.CL}
}