In this research project, we built the claim news headlines dataset. We also proposed a methodology from claim detection to a Knowledge graph (KG) construction framework.
Dataset is saved in dataset folder with name "Claim News Headline Dataset". This dataset is generated from news headlines of ARY and Express Tribune News Website. The dataset size is 5200 claims headlines and 52 are non-claims.
The input dataset is news headlines. Then, claim classification is performed as a binary task on headlines. The claim headlines are passed to the OpenIE system for triple generation. These triples are filtered and linked to DBpedia through entity linking. The final triples are stored in the knowledge graph for downstream tasks. The methodology steps are:
- Claim Classification
- OpenIE triples extraction
- Triple Filtering
- Entity Linking
- KG Construction
We use the five algorithms: SVM, Gaussian Naive Bayes, Logistic Regression, decision tree, and Ada Boost classifier. We combine TF-IDF features with numerical features (headline length, no. of nouns, no. of verbs) and use the combined features in machine learning classification models.
Baseline Model is Dependency Parsing and Three Deep Learning Models are:
- OpenIE6: https://github.com/dair-iitd/openie6
- IMOJIE: https://github.com/dair-iitd/imojie
- Gen2OIE: https://github.com/dair-iitd/moie
First, we created a lexicon of the most frequent 100 nouns from the claim dataset. Then we extracted noun phrases from triples arguments. We match the lexicon with filtered triples. If there is a match between the lexicon noun and an arguments noun phrase, we keep the triple; otherwise, we discard it.
We link the filtered triples with the DBpedia knowledge base for entity disambiguation. We use the Falcon tool for this purpose.
The linked triples were saved with DBpedia URIs. We use the Neo4j database for triples storage.
- SVM: 96%
- Logistic Regression: 95.2%
- Naive Bayes: 93%
- AdaBoost: 95.3%
- Decision Tree: 95.1 %
- Baseline Model - Dependency Parsing: 39.7 % F1
- OpenIE6: 65.9 % F1
- IMOJIE: 54.4 % F1
- Gen2OIE: 62.2 % F1