Skip to content

Latest commit

 

History

History
104 lines (83 loc) · 4.52 KB

README.md

File metadata and controls

104 lines (83 loc) · 4.52 KB

ConEL: Conversational Entity Linking Datasets

This repository provides the resources related to entity linking annotations in conversational settings.
These resources are created on the existing datasets:

These resources are developed within the following paper:

The repository is structured as follows:

  • ./data: MTurk entity annotations
  • ./mturk_interfaces: MTurk interface used to collect the entity annotations

Data

MTurk entity annotation data is stored in ./data.

  • ./data/ConEL_Concept_Named_Entity/: Stratified samples
    • ./data/ConEL_Concept_Named_Entity/ConEL_CNE.json: Entity annotations from 25 dialogues from each dataset (i.e., MWOZ, QuAC, WoW, and TREC-CAST 2020).
  • ./data/ConEL_Personal_Entity/: WoW with personal entities
    • ./data/ConEL_Personal_Entity/ConEL_PE.json: 25 WoW dialogues which contains personal entities in each dialogue.
  • run folders contain EL tools' results

Statistics

Stratified samples
(ConEL_CNE.json)
WoW with personal entities
(ConEL_PE.json)
# dialogues 100 25
# user utterances 708 113

Data Format

This section explains ground truth files data format (ConEL_CNE.json and ConEL_PE.json)
Each element in a list has a dict structure as follows:

{
    "dialogue_id": "10060",
    "dataset_name": "wow", # or "quac", "mwoz", "cast20raw", "cast20manu"
    "turns": [
        {
            "speaker": "USER", # or "SYSTEM"
            "utterance": "Blue is my favorite color, by far. What's yours?",
            "turn_number": 0, 
            "el_annotations": [ # Ground truth annotations
                {
                    "mention": "Blue",
                    "entity": "Blue",
                    "entity_type": "concept", # or "named_entity"
                }
            ],
            "personal_entity_annotations": [ # Personal entity annotations
                {
                    "personal_entity_mention": "my favorite color",
                    "explicit_entity_mention": "Blue",
                    "turn_number_of_explicit_entity_mention": 0,
                    "entity": "Blue"
                }
            ]
        },
    ]
}
  • dialogue_id: dialogue id provided by each original dataset (i.e., MWOZ, QuAC, WoW, and TREC-CAsT 2020).
  • dataset_name: The name of the dataset in which the conversations were used (cast20raw and cast20manu represent )
  • turns: each element contains an user or system turns
    • speaker: USER or SYSTEM
    • utterance: utterance acquired from the dataset. (Note that for TREC-CAST 2020 system turns, only manual_canonical_result_id are shown)
    • el_annotations: annotations with MTurk workers
    • personal_entity_annotations: Personal entity annotations. Note that only ConEL_PE.json has this annotations.

MTurk Interfaces

MTurk interface used to collect the entity annotations.
Interfaces are Stored in ./mturk_interfaces directory.

Conversational Dataset List

Conversational Dataset List: A comprehensive list of around 130 conversational datasets released by different research communities

Cite

@inproceedings{Joko:2021:CEL,
 author =    {Joko, Hideaki and Hasibi, Faegheh and Balog, Krisztian and de Vries, Arjen P.},
 title =     {Conversational Entity Linking: Problem Definition and Datasets},
 booktitle = {Proceedings of the 44rd International ACM SIGIR Conference on Research and Development in Information Retrieval},
 series =    {SIGIR '21},
 year =      {2021},
 publisher = {ACM}
}

Contact

If you have any questions, please contact Hideaki Joko at [email protected]