- Why this name? Kocasm is blend word, Korean + sarcasm
Because it converts or distorts literal meaning of sentence, sarcasm is highly related to Sentiment Classification.
- HTML data gathered from a twitter
- Data is composed of label 1,0.
- label 1: sarcasm, label0: randomly gatherd
- korean data, queries for hashtags such as 역설, 아무말, 운수좋은날, 笑, 뭐래 아닙니다, 그럴리없다, 어그로, irony sarcastic, sarcasm was labeled as True data.(so still has lots of noise)
- And pre-processed dataset (1) user anonymous (2) removing hashtag (3) removing url process.
- [email protected]
If you want to compare with other dataset, refer: [English]
- ghosh: This english dataset collected by Aniruddha Ghosh and Tony Veale. See their repository and paper, Fracking Sarcasm using Neural Network
bag_of_words.py
: Basic bayesian modeldl_models.py
: Model classes for a general transformertf_attention_models.py
: Tensorflow attentive rnn model
-
I'm strongly inspired by MirunaPislar's code and I referred a lot to that codes, but I tried to make my codes more pythonic and pytorchic style. Actually, I am still modifying the code.
-
Kokasm is compatible with: Python 2.7-3.7
export DATA_DIR=/path/to/data
export PREP_DIR=/path/to/preprocess
export SAVE_DIR=/path/to/save
python tf_attention_models.py \
--mode train \
--model_cfg config/attention_base.json \
--data_file $DATA_DIR/jiwon/train.csv \
--test_file $DATA_DIR/jiwon/test.csv \
--pretrain_file $BERT_PRETRAIN \
--vocab PREP_DIR/vocab.txt \
--save_dir $SAVE_DIR \
--max_len 128
If you found this dataset useful, please cite as:
@misc{kim2019kocasm,
author = {Kim, Jiwon and Cho, Won Ik},
title = {Kocasm: Korean Automatic Sarcasm Detection},
year = {2019},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/SpellOnYou/korean-sarcasm}}
}
- universal irony detection model with czech
- Chinese and attentive-RNN
- Focus on meaning conflict with hashtags
Implementation as proposed by Yang et al. in "Hierarchical Attention Networks for Document Classification" (2016)