Skip to content

Commit

Permalink
add README
Browse files Browse the repository at this point in the history
  • Loading branch information
Lukas Garbas committed Dec 19, 2019
0 parents commit ca574f9
Showing 1 changed file with 53 additions and 0 deletions.
53 changes: 53 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Emotion Classification in Short Messages

Multi-class sentiment analysis problem to classify text into five emotions categories: joy, sadness, anger, fear, neutral. A fun weekend project to go through different text classification techniques. This includes dataset preparation, traditional machine learning with scikit-learn, LSTM neural networks and transfer learning using BERT (tensorflow's keras).

# Datasets

## Datasets overview

**Summary Table**

| Dataset | Year | Content | Size | Emotion categories | Balanced |
| :--------------: | :--: | :-------: | ------------ | ------------------ | :-------: |
|dailydialog| 2017 | dialogues |102k sentences|neutral, joy, surprise, sadness, anger, disgust, fear| No |
|emotion-stimulus|2015|dialogues|2.5k sentences|sadness, joy, anger, fear, surprise, disgust| No |
|isear|1990|emotional situations|7.5k sentences|joy, fear, anger, sadness, disgust, shame, guilt| Yes |

links: [dailydialog](http://yanran.li/dailydialog.html), [emotion-stimulus](http://www.site.uottawa.ca/~diana/resources/emotion_stimulus_data), [isear](http://www.affective-sciences.org/index.php/download_file/view/395/296/)


## Combined dataset

Dataset was combined from dailydialog, isear, and emotion-stimulus to create a balanced dataset with 6 labels: joy, sad, anger, fear, disgust, surprise and neutral. The texts mainly consist of short messages and dialog utterances.

# Experiments

### Traditional Machine Learning:
* Data preprocessing: noise and punctuation removal, tokenization, stemming
* Text Representation: TF-IDF
* Classifiers: Naive Bayes, Random Forrest, Logistic Regrassion, SVM

| Approach | F1-Score |
| :------------------ | :------: |
| Naive Bayes | 0.6702 |
| Random Forrest | 0.6372 |
| Logistic Regression | 0.6935 |
| SVM | 0.7271 |

### Neural Networks
* Data preprocessing: noise and punctuation removal, tokenization
* Word Embeddings: pretrained 300 dimensional word2vec ([link](https://fasttext.cc/docs/en/english-vectors.html))
* Deep Network: LSTM and biLSTM

| Approach | F1-Score |
| :------------------ | :------: |
| LSTM + w2v_wiki | 0.7395 |
| biLSTM + w2v_wiki | 0.7414 |

### Transfer learning with BERT
Fine-tuning BERT for text classification

| Approach | F1-Score |
| :------------------ | :------: |
| fine-tuned BERT | 0.8320 |

0 comments on commit ca574f9

Please sign in to comment.