This directory constrains the dataset files
README.md
- this fileThis-is-not-a-dataset.zip
- the dataset in json lines format.This-is-not-a-dataset_RAW_DATA.zip
- the dataset in txt format. These files contain the raw data, and include information about the wordnet synsets used to generate the sentences.upload_to_huggingface.py
- script to upload the dataset to huggingface. This script also performs some preprocessing steps to facilitate the usage of the dataset.patterns.py
- a script with a dictionary of the patterns in the dataset
We want to prevent the dataset from being used for training LMs. To do so, we encrypt the dataset with a password. The password is hitz
The dataset is available on huggingface at:
To use it in your code, you can use the following snippet:
from datasets import load_dataset
dataset = load_dataset("HiTZ/This-is-not-a-dataset")
- pattern_id (int): The ID of the pattern,in range [1,11]
- pattern (str): The name of the pattern
- test_id (int): For each pattern we use a set of templates to instanciate the triples. Examples are grouped in triples by test id
- negation_type (str): Affirmation, verbal, non-verbal
- semantic_type (str): None (for affirmative sentences), analytic, synthetic
- syntactic_scope (str): None (for affirmative sentences), clausal, subclausal
- isDistractor (bool): We use distractors (randonly selectec synsets) to generate false kwoledge.
- sentence (str): The sentence. This is the input of the model
- label (bool): The label of the example, True if the statement is true, False otherwise. This is the target of the model
For the pattern files in This-is-not-a-dataset.zip
test_id is a string
with the format template_id
-template_variation
, for example 3-1
, 3-2
. For the train.jsonl
, dev.jsonl
, test.jsonl
and test.jsonl
files, as well as the dataset in the HuggingFace Hub we have replaced this string with a unique identifier (int) for each triple to facilitate grouping triples and evaluation.
The paper will be presented at EMNLP 2023, the citation will be available soon. For now, you can use the following bibtex:
@inproceedings{this-is-not-a-dataset,
title = "This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models",
author = "Iker García-Ferrero, Begoña Altuna, Javier Alvez, Itziar Gonzalez-Dios, German Rigau",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
year = "2023",
publisher = "Association for Computational Linguistics",
}