DEMO VIDEO: https://youtu.be/pdWoBxBu9-k
This repository contains the implementation of the paper : "No Rumours Please! A Multi-Indic-Lingual Approach for Covid Fake-Tweet Detection" which has been accepted at GHCI 2020 in the original research track. The system aims to classify whether a tweet contains a verifiable claim or not in real-time and has been specifically trained to detect COVID19 related fake news. We use AI based techniques to process the tweet text and use it, along with user features, to classify the tweets as either REAL or FAKE. We are handling tweets in three different languages: English, Hindi and Bengali.
Each of the folders are equipped with detailed READMEs on how to run the scripts.
- For dataset, refer to the data folder
- To scrape and annotate more dataset, refer to scraping_tools folder (We encourage extending the dataset to accomadate more annotations in languages explored and unexplored in this work)
- For the transformer based classifiers, refer to the transformer_classifiers folder
- For ML based models and GUI implementation, refer to the GUI_MLModels folder
We next provide a very brief overview of the dataset and the methods used in our work in the following sections.
We create the Indic-covidemic tweet dataset and use it for training and testing purpose. We consider the English tweets from the Infodemic dataset and scrape Bengali and Hindi tweets from Twitter which are related to COVID-19. Fresh annotations were done and incorporated to create the larger Indic dataset for this task. For this purpose, scraping and parsing tools were created which might be helpful to further mine Indic data. We have published our annotated dataset for research purposes which can be found here.
We experimented with two different models to handle the tweet classification. In one setting, we consider a mono-lingual model, for handling English tweets. We extend the concept, by replacing the classifier with the multi-lingual one, where we consider tweets from English, Hindi and Bengali languages, as of now. The main essence of our proposed approach lies in the features we have used for the classification task, the different classifiers and their corresponding adaptation done for identifying the fake tweets.
The architecture of the classifier is as shown below.
We have used various textual and user related features for the classification task as follows:- bert based sentence encoding of the tweets (TxtEmbd)
- tweet features (twttxt)
- user features (twtusr)
- link score (FactVer) - Ratio of similarity calculated between a given tweet and titles of verified URL list obtained on querying the tweet on Google Search Engine (algorithm given below). We have a list of 50 URLs listed as verified sources.
- bias score (Bias) - The probability of a tweet containing offensive language.
We design a simple static HTML page to obtain the tweet id/URL, as user input, and detect if the tweet is real or fake. Though our monolingual English classifier gave the best performance, even by beating the SOTA, we choose the multi-lingual classifier for its wider application. Some of the snapshots of our demo is shown below:
The GUI has been hosted in a IBM server (http://pca.sl.cloud9.ibm.com:1999/) which is accessible within IBM domain.
process.py is a working code to host the GUI in the localhost. It can be easily modified to host the demo in any other server as well.
If you find our work useful, please cite our work as:
@misc{kar2020rumours,
title={No Rumours Please! A Multi-Indic-Lingual Approach for COVID Fake-Tweet Detection},
author={Debanjana Kar and Mohit Bhardwaj and Suranjana Samanta and Amar Prakash Azad},
year={2020},
eprint={2010.06906},
archivePrefix={arXiv},
primaryClass={cs.CL}
}