Skip to content

Latest commit

 

History

History
241 lines (155 loc) · 13.8 KB

datasets.md

File metadata and controls

241 lines (155 loc) · 13.8 KB

Datasets

Listed here are prominent datasets centered at the task of event detection, event extraction and event-event relation extraction.

Tables / Column Meaning

This page has tables which present details about the datasets. The following section explains what each column means.

Data Source

The source of the document in the corpus (e.g., news articles, wikipedia, etc.)

Annotation

What kind of annotation the corpus holdes (e.g., events, entities, coreference, etc.)

Density

Annotating events and event relations, is considered a very challenging and expensive task. Consequentially, some datasets were exhaustively annotated to have all events covered in a given text, while other datasets only contain annotations to part of the events which exist in the text.

ℹ️ Partial-exhaustive annotation means that only part of the text is annotated (for example, only the first x sentences in any given document in the corpus), and in those selected sentences, the annotation is exhaustive.

ℹ️ In the non-exhaustive case, the task of event detection/extraction usually cannot be performed, and the event mention spans are given as part of the input to the model.

Scope

Two main settings exist for the event-event relation extraction task

ℹ️ Within Document (WD) event-event relation extraction is the task of identifying event relations between pairs of event mentions within a single document.

ℹ️ Cross Document (CD) event-event relation extraction is the task of identifying event relations between pairs of event mentions within a single document and across multiple documents.

Lang

List the languages for which this dataset contains annotations for.

License

While some datasets are free and open, others are more restricted. This might be a factor when considering whether to use a particular dataset.

Datasets

Automatic Content Extraction (ACE)

ACE is a thorough event annotation guidelines which include guidelines in multiple languages. Addtinoally, ACE is used as the base annotation scheme to many following event annotation guidelines.

References

CaTeRS: Causal and Temporal Relation Scheme

A novel semantic annotation framework, called Causal and Temporal Relation Scheme (CaTeRS), which is unique in simultaneously capturing a comprehensive set of temporal and causal relations between events.

Annotating a total of 1,600 sentences in the context of 320 five-sentence short stories sampled from ROCStories corpus

References (2016)

Data Source Documents Events Density Annotation Scope Lang License
ROCStories 320 2,708 exhaustive events
causal
temporal
within documents eng ---

EventCorefBank Extension (ECB+)

An extended version of the EventCorefBank (ECB), this dataset is the most commonly used dataset for training and testing models for the CD event coreference task. ECB+ consists of documents partitioned into 43 clusters, each corresponding to a certain news topic.

References

Data Source Documents Events Density Annotation Scope Lang License
News 982 6,833 partial-exhaustive events
entities
coreference
within and
cross documents
eng CC-BY

Entities, Relations and Events (ERE)

Event-Event Relations (EER)

EER Annotation focuses on relations between events in the ERE/ACE taxonomy, both within document and cross-document.

References

Data Source Documents Events Density Annotation Lang License
News 125 863 partial-exhaustive events
coreference
temporal
causal
subevent
TPD Free

Event StoryLine Corpus (ESC)

Annotation scheme and benchmark dataset for the temporal and causal relation detection. The annotation is built on and extends the ECB+ annotation scheme.

References (2017)

Data Source Documents Events Density Annotation Scope Lang License
News 258 7,275 partial-exhaustive events
entities
coreference
temporal
causal
within and
cross document
en CC-BY

Gun Violence Corpus (GVC)

GVC is an automatically annotated dataset for the cross-document coreferece task.

References

Data Source Documents Events Density Annotation Scope Lang License
Police Reports 510 7,298 non-exhaustive events
event arguments
coreference
within and
cross document
eng CC

HiEve

A corpus for recognizing relations of spatiotemporal containment between events. The narratives are represented as hierarchies of events based on relations of spatiotemporal containment (i.e., superevent–subevent relations).

References

Data Source Documents Events Density Annotation Scope Lang License
News 100 ~32 per-doc non-exhaustive events
coreference
sub-events
within document eng CC BY-NC-SA 3.0

HyperCoref

A method for collecting a large scale cross-document event coreference dataset from news articles, leveraging the hyperlinks of events that point to the same news article.

References

MAVEN

MAssive eVENt detection dataset (MAVEN), alleviates the data scarcity problem and covers much more general event types.

References

Data Source Documents Events Density Annotation Scope Lang License
Wikipedia 4,480 118,732 exhaustive events within document eng ??

MAVEN-ERE

A unified large-scale human-annotated dataset (build on top of MAVEN dataset), containing events, event coreference chains, temporal relations, causal relations, and subevent relations.

References

Data Source Documents Events Density Annotation Scope Lang License
Wikipedia 4,480 103,193 exhaustive events
coreference
temporal
causal
sub-events
within document eng CC BY-NC-SA 3.0

MATRES

MATRES proposes a new multi-axis modeling to better capture the temporal structure of events. In addition, we identify that event end-points are a major source of confusion in annotation, so we also propose to annotate TempRels based on start-points only.

References

MEANTIME

MEANTIME corpus is a semantically annotated corpus of Wikinews articles. MEANTIME and ECB+ uses the same NewsReader annotation guideliness, The corpus consists of 480 news articles in English, Spanish, Italian, and Dutch.

References

Data Source Documents Events Density Annotation Scope Lang License
Wikinews 480 2,107 exhaustive events
entities
coreference
within and
cross document
eng
it
de
sp
CC-BY

Richer Event Description (RED)

Richer Event Description is an attempt to bring together a number of existing and well-researched veins of document annotation into a single representation of the events and participants in a discourse. It is not concerned with semantic role annotation in the traditional sense

References

Data Source Docs Events Density Annotation License
News 95 8731 Exhaustive entities
events
coreference
temporal
causal
subevent
LDC

The Penn Discourse TreeBank (PDTB)

The Penn Discourse Treebank (PDTB) is a discourse level annotation over 1M word Wall Street Journal corpus. The annotation consist of events, events arguments (entities) and the relations between them (event-event, event-entity and entity-entity).

References

Wikipedia Event Coreference (WEC)

WEC is an automatic annotation method for extracting a large-scale corpus from Wikipedia articles (in supporting languages). WEC-Eng is the corpus generated by WEC from the English Wikipedia.

References

Data Source Docs Events Density Annotation Scope License
Wikipedia NA 43,672 non-exhaustive events
coreference
cross-document CC BY-SA