Repository for Incendiary News Detection paper submitted to FLAIRS-32:

Minimal demo website: https://incendiarynews.firebaseapp.com

Notebooks according to their names:

word2vec: word vectors as features with different classifiers.
BBC_BBC : we are training with BBC positive samples and testing with BBC again.
BBC_CNN : we are training with BBC positive samples and testing with CNN positive samples.

Other files:

./data/clean_data.p : is a pickle file, includes all datasets cleaned by cleantext.py
word2vec.txt : stores vectors for each word exists in the corpus, generated with fasttext. It is tab separated and each line have a word followed with 300 floating point numbers.
cleantext.py : This code cleans corpus and generates clean_data.p pickle. Cleaning includes removing website urls, source names, words related to crawling process of specific sources, characters that exists only specific resources. Also articles with less than 100 characters are removed from corpus. For details please check the file. After cleaning data, it also create list of all words to be used with fasttext for word2vec and saves to ./data/wordlist.txt

Original data collected is stored in data folder:

article_text_pos_1063_57531.json Incendiary News
article_text_neg_bbc_iter1_07272.json Non-Incendiary News from BBC, first iteration
iter2_text_neg_bbc_12981.json Non-Incendiary News from BBC, second iteration
iter1_article_text_neg_CNN_08109.json Non-Incendiary News from CNN, first iteration

Authors:

Requirements:
python=3.7.1
scikit-learn=0.20.0
nltk=3.3.0
numpy=1.15.2

#my_tagger.yaml and pos_tagger.py files from turkish-pos-tagger
#turkish-stemmer-python

How to get notebooks working:

Add following lines to beginning of the notebook, downloands all required files


!git clone https://github.com/EnisBerk/Incendiary_news.git
%cd Incendiary_news

!git clone https://github.com/otuncelli/turkish-stemmer-python.git
%cd turkish-stemmer-python/
!git reset --hard 1f60006c023152e46e5704065cdc51e68d63240a
%cd ../

!git clone https://github.com/onuryilmaz/turkish-pos-tagger.git
%cd turkish-pos-tagger
!git reset --hard a889bc2e633561f5050035cd1ffaf91b3ef38fe5
%cd ../

!cp -r turkish-pos-tagger/* ./
!cp -r turkish-stemmer-python/* ./

!rm -r turkish-pos-tagger
!rm -r turkish-stemmer-python

!curl -O https://storage.googleapis.com/deep_learning_enis/Incendiary_news/word2vec.txt
!cp word2vec.txt ./data/

import nltk
nltk.download('punkt')

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
word2vec		word2vec
.gitignore		.gitignore
BBC-BBC.ipynb		BBC-BBC.ipynb
BBC-CNN.ipynb		BBC-CNN.ipynb
README.md		README.md
cleantext.py		cleantext.py
word2vec_classifiers-BBC_CNN.ipynb		word2vec_classifiers-BBC_CNN.ipynb
word2vec_classifiers_BBC_BBC.ipynb		word2vec_classifiers_BBC_BBC.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Repository for Incendiary News Detection paper submitted to FLAIRS-32:

How to get notebooks working:

About

Releases

Packages

Languages

EnisBerk/Incendiary_news

Folders and files

Latest commit

History

Repository files navigation

Repository for Incendiary News Detection paper submitted to FLAIRS-32:

How to get notebooks working:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages