Original dataset | |
Document type | newspaper (mid-19C to mid 20C) |
Languages | English, French, German |
Annotation guidelines | |
Annotation tool | INCEpTION |
Original format and tagging scheme | .tsv, IOB |
Annotations | NERC, EL (towards Wikidata, dump of 2019.11.13 ) |
Version (used in HIPE-2022) | v1.4 |
Related publication | Overview of CLEF-HIPE-2020, Extended Overview of CLEF-HIPE-2020 |
License |
Coarse-grained tagset | Fine-grained tagset | Nesting applies | Linking applies |
---|---|---|---|
pers | pers.ind | yes | yes |
pers.coll | yes | yes | |
pers.ind.articleauthor | yes | yes | |
org | org.adm | yes | yes |
org.ent | yes | yes | |
org.ent.pressagency | yes | yes | |
prod | prod.media | yes | yes |
prod.doctr | yes | yes | |
time | time.date.abs | yes | yes |
loc | loc.adm.town | yes | yes |
loc.adm.reg | yes | yes | |
loc.adm.nat | yes | yes | |
loc.adm.sup | yes | yes | |
loc.phys.geo | yes | yes | |
loc.phys.hydro | yes | yes | |
loc.phys.astro | yes | yes | |
loc.oro | yes | yes | |
loc.fac | yes | yes | |
loc.add.phys | yes | yes | |
loc.add.elec | yes | yes | |
loc.unk | yes | yes |
The hipe2020 dataset can be used for:
- Tasks: NERC-Coarse, NERC-Fine, NEL.
- Challenges: Multilingual Newspaper Coarse, Multilingual Newspaper Fine, Global Adaptation Coarse.
- Annotation guidelines: mostly compatible with letemps and newseye datasets.
- Documents: hipe2020 documents corresponds to newspaper articles.
- Train set: for this dataset, there is no training set. Only a dev set that is representative for the test set in terms of newspapers and periods.
- Sentence splitting: performed automatically on OCRed text using pySBD (performances not perfect).
- Metonymic sense: literal and metonymic annotations are in separated columns.
- Known glitches:
- some negative offsets in Partial are wrong/off
HIPE-2022 v1.0 release notes