This repository consists of several experimentation carried out on the DSLCC dataset v4.0 for discriminating between similar language variants.
Below are the list of items present in each folder:
Contains all the necessary datasets required for the experiments
- SVC - The file test_SVC.py executes the linearSVC model in project_SVC.py
- SGDC - The file test_SGDC.py executes the Stochastic Gradient Descent model in project_SGDC.py
- MultinomialNB - The file test_NB.py executes the Multinomial Naive Bayes model in project_NB.py
- XGBoost - The file test_XGBoost.py executes the Xtreme Gradient Boosting model in project_XGBoost.py
- KNN - The file test_KNN.py executes the K-nearest neighbours model in project_KNN.py
- LogisticRegression - The file test_LR.py executes the Logistic Regression model in project_LR.py
- Shallow NeuralNetwork - The file shallow_NN_1.py is an implementation of shallow neural network
Final Code --> Run shallow_nn_1.py. test accuracy achieved - ~96%.
Contains the required datasets for the model to train, validate and test
Contains the code my_evaluation.py for evaluating the resuls from model predictions using the confusion matrix - F1, accuracy, precision and recall
Contains the code documentation generated by doxygen for the experiments
- Codes - Contains the model classification as well as the pre-processing source code as a .ipynb file.
Contains the code documentation generated by doxygen for the experiments
This is the training and test data for the Discriminating between Similar Languages (DSL) task at VarDial 2017.
- DSL-TRAIN.txt - Training set for the DSL task
- DSL-DEV.txt - Development set for the DSL task
- DSL-DEV.txt - Unlabelled test set
- DSL-TEST-UNLABELLED.txt - Test set with gold labels
- README.txt - Brief description of the DSL data
Each line in the .txt files are tab-delimited in the format: sentencelanguage-label
For more details (like data stats) you can refer to the VarDial 2017 task paper:
Marcos Zampieri, Shervin Malmasi, Nikola Ljubesic, Preslav Nakov, Ahmed Ali, Jorg Tiedemann, Yves Scherrer, and Noemi Aepli. 2017. "Findings of the VarDial Evaluation Campaign 2017." In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain.