DATASET

This directory constrains the dataset files

README.md - this file
This-is-not-a-dataset.zip - the dataset in json lines format.
This-is-not-a-dataset_RAW_DATA.zip - the dataset in txt format. These files contain the raw data, and include information about the wordnet synsets used to generate the sentences.
upload_to_huggingface.py - script to upload the dataset to huggingface. This script also performs some preprocessing steps to facilitate the usage of the dataset.
patterns.py - a script with a dictionary of the patterns in the dataset

PASSWORD

We want to prevent the dataset from being used for training LMs. To do so, we encrypt the dataset with a password. The password is hitz

Huggingface

The dataset is available on huggingface at:

HiTZ/This-is-not-a-dataset

To use it in your code, you can use the following snippet:

from datasets import load_dataset

dataset = load_dataset("HiTZ/This-is-not-a-dataset")

Data explanation

pattern_id (int): The ID of the pattern,in range [1,11]
pattern (str): The name of the pattern
test_id (int): For each pattern we use a set of templates to instanciate the triples. Examples are grouped in triples by test id
negation_type (str): Affirmation, verbal, non-verbal
semantic_type (str): None (for affirmative sentences), analytic, synthetic
syntactic_scope (str): None (for affirmative sentences), clausal, subclausal
isDistractor (bool): We use distractors (randonly selectec synsets) to generate false kwoledge.
sentence (str): The sentence. This is the input of the model
label (bool): The label of the example, True if the statement is true, False otherwise. This is the target of the model

For the pattern files in This-is-not-a-dataset.zip test_id is a string with the format template_id-template_variation, for example 3-1, 3-2. For the train.jsonl, dev.jsonl, test.jsonl and test.jsonl files, as well as the dataset in the HuggingFace Hub we have replaced this string with a unique identifier (int) for each triple to facilitate grouping triples and evaluation.

Citation

The paper will be presented at EMNLP 2023, the citation will be available soon. For now, you can use the following bibtex:

@inproceedings{this-is-not-a-dataset,
    title = "This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models",
    author = "Iker García-Ferrero, Begoña Altuna, Javier Alvez, Itziar Gonzalez-Dios, German Rigau",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    year = "2023",
    publisher = "Association for Computational Linguistics",
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

DATASET

PASSWORD

Huggingface

Data explanation

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

DATASET

PASSWORD

Huggingface

Data explanation

Citation