Skip to content

Latest commit

 

History

History
150 lines (107 loc) · 6.24 KB

README.md

File metadata and controls

150 lines (107 loc) · 6.24 KB

CANNOT dataset

Compilation of ANnotated, Negation-Oriented Text-pairs



Introduction

CANNOT is a dataset that focuses on negated textual pairs. It currently contains 77,376 samples, of which roughly of them are negated pairs of sentences, and the other half are not (they are paraphrased versions of each other).

The most frequent negation that appears in the dataset is verbal negation (e.g., will → won't), although it also contains pairs with antonyms (cold → hot).


Format

The dataset is given as a .tsv file with the following structure:

premise hypothesis label
A sentence. An equivalent, non-negated sentence (paraphrased). 0
A sentence. The sentence negated. 1

The dataset can be easily loaded into a Pandas DataFrame by running:

import pandas as pd

dataset = pd.read_csv('negation_dataset_v1.0.tsv', sep='\t')

Construction

The dataset has been created by cleaning up and merging the following datasets:

  1. Not another Negation Benchmark: The NaN-NLI Test Suite for Sub-clausal Negation (see datasets/nan-nli).

  2. GLUE Diagnostic Dataset (see datasets/glue-diagnostic).

  3. Automated Fact-Checking of Claims from Wikipedia (see datasets/wikifactcheck-english).

  4. From Group to Individual Labels Using Deep Features (see datasets/sentiment-labelled-sentences). In this case, the negated sentences were obtained by using the Python module negate.

  5. It Is Not Easy To Detect Paraphrases: Analysing Semantic Similarity With Antonyms and Negation Using the New SemAntoNeg Benchmark (see datasets/antonym-substitution).


Once processed, the number of remaining samples in each of the datasets above are:

Dataset Samples
Not another Negation Benchmark 118
GLUE Diagnostic Dataset 154
Automated Fact-Checking of Claims from Wikipedia 14,970
From Group to Individual Labels Using Deep Features 2,110
It Is Not Easy To Detect Paraphrases 8,597
Total
25,949

Additionally, for each of the negated samples, another pair of non-negated sentences has been added by paraphrasing them with the pre-trained model 🤗tuner007/pegasus_paraphrase.

Finally, the swapped version of each pair (premise ⇋ hypothesis) has also been included, and any duplicates have been removed.

With this, the number of premises/hypothesis in the CANNOT dataset that appear in the original datasets are:

Dataset
Sentences
Not another Negation Benchmark 552     (0.36 %)
GLUE Diagnostic Dataset 586     (0.38 %)
Automated Fact-Checking of Claims from Wikipedia 89,728   (59.98 %)
From Group to Individual Labels Using Deep Features 12,626     (8.16 %)
It Is Not Easy To Detect Paraphrases 17,198   (11.11 %)
Total
120,690   (77.99 %)

The percentages above are in relation to the total number of premises and hypothesis in the CANNOT dataset. The remaining 22.01 % (34,062 sentences) are the novel premises/hypothesis added through paraphrase and rule-based negation.


Contributions

Questions? Bugs...? Then feel free to open a new issue.


Acknowledgments

We thank all the previous authors that have made this dataset possible:

Thinh Hung Truong, Yulia Otmakhova, Timothy Baldwin, Trevor Cohn, Jey Han Lau, Karin Verspoor, Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel R. Bowman, Aalok Sathe, Salar Ather, Tuan Manh Le, Nathan Perry, Joonsuk Park, Dimitrios Kotzias, Misha Denil, Nando De Freitas, Padhraic Smyth, Teemu Vahtola, Mathias Creutz, and Jörg Tiedemann.


License

The CANNOT dataset is released under CC BY-SA 4.0.

Creative Commons License



Citation

@misc{anschütz2023correct,
      title={This is not correct! Negation-aware Evaluation of Language Generation Systems}, 
      author={Miriam Anschütz and Diego Miguel Lozano and Georg Groh},
      year={2023},
      eprint={2307.13989},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}