Skip to content

Latest commit

 

History

History
53 lines (40 loc) · 3.17 KB

README.md

File metadata and controls

53 lines (40 loc) · 3.17 KB

DATASET

This directory constrains the dataset files

  • README.md - this file
  • This-is-not-a-dataset.zip - the dataset in json lines format.
  • This-is-not-a-dataset_RAW_DATA.zip - the dataset in txt format. These files contain the raw data, and include information about the wordnet synsets used to generate the sentences.
  • upload_to_huggingface.py - script to upload the dataset to huggingface. This script also performs some preprocessing steps to facilitate the usage of the dataset.
  • patterns.py - a script with a dictionary of the patterns in the dataset

PASSWORD

We want to prevent the dataset from being used for training LMs. To do so, we encrypt the dataset with a password. The password is hitz

Huggingface

The dataset is available on huggingface at:

To use it in your code, you can use the following snippet:

from datasets import load_dataset

dataset = load_dataset("HiTZ/This-is-not-a-dataset")

Data explanation

  • pattern_id (int): The ID of the pattern,in range [1,11]
  • pattern (str): The name of the pattern
  • test_id (int): For each pattern we use a set of templates to instanciate the triples. Examples are grouped in triples by test id
  • negation_type (str): Affirmation, verbal, non-verbal
  • semantic_type (str): None (for affirmative sentences), analytic, synthetic
  • syntactic_scope (str): None (for affirmative sentences), clausal, subclausal
  • isDistractor (bool): We use distractors (randonly selectec synsets) to generate false kwoledge.
  • sentence (str): The sentence. This is the input of the model
  • label (bool): The label of the example, True if the statement is true, False otherwise. This is the target of the model

For the pattern files in This-is-not-a-dataset.zip test_id is a string with the format template_id-template_variation, for example 3-1, 3-2. For the train.jsonl, dev.jsonl, test.jsonl and test.jsonl files, as well as the dataset in the HuggingFace Hub we have replaced this string with a unique identifier (int) for each triple to facilitate grouping triples and evaluation.

Citation

The paper will be presented at EMNLP 2023, the citation will be available soon. For now, you can use the following bibtex:

@inproceedings{this-is-not-a-dataset,
    title = "This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models",
    author = "Iker García-Ferrero, Begoña Altuna, Javier Alvez, Itziar Gonzalez-Dios, German Rigau",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    year = "2023",
    publisher = "Association for Computational Linguistics",
}