Skip to content

nairnish/DSCI-601

Repository files navigation

DSCI-601

Discriminating between similar languages

This repository consists of several experimentation carried out on the DSLCC dataset v4.0 for discriminating between similar language variants.

Below are the list of items present in each folder:

Data

Contains all the necessary datasets required for the experiments

601 project experiments - 1

1. Codes - Contains the model classification source codes and associated testing codes
  1. SVC - The file test_SVC.py executes the linearSVC model in project_SVC.py
  2. SGDC - The file test_SGDC.py executes the Stochastic Gradient Descent model in project_SGDC.py
  3. MultinomialNB - The file test_NB.py executes the Multinomial Naive Bayes model in project_NB.py
  4. XGBoost - The file test_XGBoost.py executes the Xtreme Gradient Boosting model in project_XGBoost.py
  5. KNN - The file test_KNN.py executes the K-nearest neighbours model in project_KNN.py
  6. LogisticRegression - The file test_LR.py executes the Logistic Regression model in project_LR.py
  7. Shallow NeuralNetwork - The file shallow_NN_1.py is an implementation of shallow neural network

Final Code --> Run shallow_nn_1.py. test accuracy achieved - ~96%.

2. DSLCC4 datastes

Contains the required datasets for the model to train, validate and test

3. Evaluation

Contains the code my_evaluation.py for evaluating the resuls from model predictions using the confusion matrix - F1, accuracy, precision and recall

doc - 1

Contains the code documentation generated by doxygen for the experiments

601 project experiments - 2

  1. Codes - Contains the model classification as well as the pre-processing source code as a .ipynb file.

doc - 2

Contains the code documentation generated by doxygen for the experiments

Information on the Dataset:

DSLCC 4.0 Corpus

This is the training and test data for the Discriminating between Similar Languages (DSL) task at VarDial 2017.

The package contains the following files:

  1. DSL-TRAIN.txt - Training set for the DSL task
  2. DSL-DEV.txt - Development set for the DSL task
  3. DSL-DEV.txt - Unlabelled test set
  4. DSL-TEST-UNLABELLED.txt - Test set with gold labels
  5. README.txt - Brief description of the DSL data

Each line in the .txt files are tab-delimited in the format: sentencelanguage-label

For more details (like data stats) you can refer to the VarDial 2017 task paper:

Marcos Zampieri, Shervin Malmasi, Nikola Ljubesic, Preslav Nakov, Ahmed Ali, Jorg Tiedemann, Yves Scherrer, and Noemi Aepli. 2017. "Findings of the VarDial Evaluation Campaign 2017." In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •