DSCI-601

Discriminating between similar languages

This repository consists of several experimentation carried out on the DSLCC dataset v4.0 for discriminating between similar language variants.

Below are the list of items present in each folder:

Data

Contains all the necessary datasets required for the experiments

601 project experiments - 1

1. Codes - Contains the model classification source codes and associated testing codes

SVC - The file test_SVC.py executes the linearSVC model in project_SVC.py
SGDC - The file test_SGDC.py executes the Stochastic Gradient Descent model in project_SGDC.py
MultinomialNB - The file test_NB.py executes the Multinomial Naive Bayes model in project_NB.py
XGBoost - The file test_XGBoost.py executes the Xtreme Gradient Boosting model in project_XGBoost.py
KNN - The file test_KNN.py executes the K-nearest neighbours model in project_KNN.py
LogisticRegression - The file test_LR.py executes the Logistic Regression model in project_LR.py
Shallow NeuralNetwork - The file shallow_NN_1.py is an implementation of shallow neural network

Final Code --> Run shallow_nn_1.py. test accuracy achieved - ~96%.

2. DSLCC4 datastes

Contains the required datasets for the model to train, validate and test

3. Evaluation

Contains the code my_evaluation.py for evaluating the resuls from model predictions using the confusion matrix - F1, accuracy, precision and recall

doc - 1

Contains the code documentation generated by doxygen for the experiments

601 project experiments - 2

Codes - Contains the model classification as well as the pre-processing source code as a .ipynb file.

doc - 2

Contains the code documentation generated by doxygen for the experiments

Information on the Dataset:

DSLCC 4.0 Corpus

This is the training and test data for the Discriminating between Similar Languages (DSL) task at VarDial 2017.

The package contains the following files:

DSL-TRAIN.txt - Training set for the DSL task
DSL-DEV.txt - Development set for the DSL task
DSL-DEV.txt - Unlabelled test set
DSL-TEST-UNLABELLED.txt - Test set with gold labels
README.txt - Brief description of the DSL data

Each line in the .txt files are tab-delimited in the format: sentencelanguage-label

For more details (like data stats) you can refer to the VarDial 2017 task paper:

Marcos Zampieri, Shervin Malmasi, Nikola Ljubesic, Preslav Nakov, Ahmed Ali, Jorg Tiedemann, Yves Scherrer, and Noemi Aepli. 2017. "Findings of the VarDial Evaluation Campaign 2017." In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain.

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
.idea		.idea
601 project experiments - 1		601 project experiments - 1
601 project experiments - 2/codes		601 project experiments - 2/codes
601_Project_4grams_char		601_Project_4grams_char
602		602
BERT		BERT
MTurk - Crowd Related Scripts		MTurk - Crowd Related Scripts
data		data
doc - 1		doc - 1
doc - 2		doc - 2
doc - FastText - 2		doc - FastText - 2
doc_char_4gram		doc_char_4gram
html		html
latex		latex
.DS_Store		.DS_Store
README.md		README.md
experiment results - 1.xlsx		experiment results - 1.xlsx
~$experiment results.xlsx		~$experiment results.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DSCI-601

Discriminating between similar languages

Data

601 project experiments - 1

1. Codes - Contains the model classification source codes and associated testing codes

2. DSLCC4 datastes

3. Evaluation

doc - 1

601 project experiments - 2

doc - 2

Information on the Dataset:

DSLCC 4.0 Corpus

The package contains the following files:

About

Releases

Packages

Contributors 3

Languages

nairnish/DSCI-601

Folders and files

Latest commit

History

Repository files navigation

DSCI-601

Discriminating between similar languages

Data

601 project experiments - 1

1. Codes - Contains the model classification source codes and associated testing codes

2. DSLCC4 datastes

3. Evaluation

doc - 1

601 project experiments - 2

doc - 2

Information on the Dataset:

DSLCC 4.0 Corpus

The package contains the following files:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages