This repository provides the resources related to entity linking annotations in conversational settings.
These resources are created on the existing datasets:
- MultiWOZ (MWOZ)
- Question Answering in Context (QuAC),
- Wizard-Of-Wikipedia (WoW)
- TREC-CAsT 2020
These resources are developed within the following paper:
- Hideaki Joko, Faegheh Hasibi, Krisztian Balog, and Arjen P. de Vries. “Conversational Entity Linking: Problem Definition and Datasets”.
The repository is structured as follows:
./data
: MTurk entity annotations./mturk_interfaces
: MTurk interface used to collect the entity annotations
MTurk entity annotation data is stored in ./data
.
./data/ConEL_Concept_Named_Entity/
: Stratified samples./data/ConEL_Concept_Named_Entity/ConEL_CNE.json
: Entity annotations from 25 dialogues from each dataset (i.e., MWOZ, QuAC, WoW, and TREC-CAST 2020).
./data/ConEL_Personal_Entity/
: WoW with personal entities./data/ConEL_Personal_Entity/ConEL_PE.json
: 25 WoW dialogues which contains personal entities in each dialogue.
run
folders contain EL tools' results
Stratified samples ( ConEL_CNE.json ) |
WoW with personal entities ( ConEL_PE.json ) |
|
---|---|---|
# dialogues | 100 | 25 |
# user utterances | 708 | 113 |
This section explains ground truth files data format (ConEL_CNE.json
and ConEL_PE.json
)
Each element in a list has a dict structure as follows:
{
"dialogue_id": "10060",
"dataset_name": "wow", # or "quac", "mwoz", "cast20raw", "cast20manu"
"turns": [
{
"speaker": "USER", # or "SYSTEM"
"utterance": "Blue is my favorite color, by far. What's yours?",
"turn_number": 0,
"el_annotations": [ # Ground truth annotations
{
"mention": "Blue",
"entity": "Blue",
"entity_type": "concept", # or "named_entity"
}
],
"personal_entity_annotations": [ # Personal entity annotations
{
"personal_entity_mention": "my favorite color",
"explicit_entity_mention": "Blue",
"turn_number_of_explicit_entity_mention": 0,
"entity": "Blue"
}
]
},
]
}
dialogue_id
: dialogue id provided by each original dataset (i.e., MWOZ, QuAC, WoW, and TREC-CAsT 2020).dataset_name
: The name of the dataset in which the conversations were used (cast20raw and cast20manu represent )turns
: each element contains an user or system turnsspeaker
: USER or SYSTEMutterance
: utterance acquired from the dataset. (Note that for TREC-CAST 2020 system turns, only manual_canonical_result_id are shown)el_annotations
: annotations with MTurk workerspersonal_entity_annotations
: Personal entity annotations. Note that onlyConEL_PE.json
has this annotations.
MTurk interface used to collect the entity annotations.
Interfaces are Stored in ./mturk_interfaces
directory.
Conversational Dataset List: A comprehensive list of around 130 conversational datasets released by different research communities
@inproceedings{Joko:2021:CEL,
author = {Joko, Hideaki and Hasibi, Faegheh and Balog, Krisztian and de Vries, Arjen P.},
title = {Conversational Entity Linking: Problem Definition and Datasets},
booktitle = {Proceedings of the 44rd International ACM SIGIR Conference on Research and Development in Information Retrieval},
series = {SIGIR '21},
year = {2021},
publisher = {ACM}
}
If you have any questions, please contact Hideaki Joko at [email protected]