Listed here are prominent datasets centered at the task of event detection, event extraction and event-event relation extraction.
This page has tables which present details about the datasets. The following section explains what each column means.
The source of the document in the corpus (e.g., news articles, wikipedia, etc.)
What kind of annotation the corpus holdes (e.g., events, entities, coreference, etc.)
Annotating events and event relations, is considered a very challenging and expensive task. Consequentially, some datasets were exhaustively annotated to have all events covered in a given text, while other datasets only contain annotations to part of the events which exist in the text.
ℹ️ Partial-exhaustive annotation means that only part of the text is annotated (for example, only the first x sentences in any given document in the corpus), and in those selected sentences, the annotation is exhaustive.
ℹ️ In the non-exhaustive case, the task of event detection/extraction usually cannot be performed, and the event mention spans are given as part of the input to the model.
Two main settings exist for the event-event relation extraction task
ℹ️ Within Document (WD) event-event relation extraction is the task of identifying event relations between pairs of event mentions within a single document.
ℹ️ Cross Document (CD) event-event relation extraction is the task of identifying event relations between pairs of event mentions within a single document and across multiple documents.
List the languages for which this dataset contains annotations for.
While some datasets are free and open, others are more restricted. This might be a factor when considering whether to use a particular dataset.
- Automatic Content Extraction (ACE)
- CaTeRS
- EventCorefBank Extension (ECB+)
- Entities, Relations and Events (ERE)
- Event-Event Relations (EER)
- Event StoryLine Corpus (ESC)
- Gun Violence Corpus (GVC)
- HiEve
- HyperCoref
- Richer Event Description (RED)
- TB-Dense
- The Penn Discourse TreeBank
- Wikipedia Event Coreference (WEC)
ACE is a thorough event annotation guidelines which include guidelines in multiple languages. Addtinoally, ACE is used as the base annotation scheme to many following event annotation guidelines.
A novel semantic annotation framework, called Causal and Temporal Relation Scheme (CaTeRS), which is unique in simultaneously capturing a comprehensive set of temporal and causal relations between events.
Annotating a total of 1,600 sentences in the context of 320 five-sentence short stories sampled from ROCStories corpus
- CaTeRS: Causal and Temporal Relation Scheme for Semantic Annotation of Event Structures
- CaTeRS Dataset
Data Source | Documents | Events | Density | Annotation | Scope | Lang | License |
ROCStories | 320 | 2,708 | exhaustive | events causal temporal |
within documents | eng | --- |
An extended version of the EventCorefBank (ECB), this dataset is the most commonly used dataset for training and testing models for the CD event coreference task. ECB+ consists of documents partitioned into 43 clusters, each corresponding to a certain news topic.
Data Source | Documents | Events | Density | Annotation | Scope | Lang | License |
News | 982 | 6,833 | partial-exhaustive | events entities coreference |
within and cross documents |
eng | CC-BY |
- Light ERE - was designed as a lighter-weight version of ACE and a simple approach to entity, relation, and event annotation, with the goal of making annotation easier and more consistent.
- TBD link to paper and dataset
- Rich ERE - annotation expands on both the inventories and taggability of Light ERE
EER Annotation focuses on relations between events in the ERE/ACE taxonomy, both within document and cross-document.
- Building a Cross-document Event-Event Relation Corpus
- Corpus-TBD link in paper doesn't work
Data Source | Documents | Events | Density | Annotation | Lang | License |
News | 125 | 863 | partial-exhaustive | events coreference temporal causal subevent |
TPD | Free |
Annotation scheme and benchmark dataset for the temporal and causal relation detection. The annotation is built on and extends the ECB+ annotation scheme.
Data Source | Documents | Events | Density | Annotation | Scope | Lang | License |
News | 258 | 7,275 | partial-exhaustive | events entities coreference temporal causal |
within and cross document |
en | CC-BY |
GVC is an automatically annotated dataset for the cross-document coreferece task.
Data Source | Documents | Events | Density | Annotation | Scope | Lang | License |
Police Reports | 510 | 7,298 | non-exhaustive | events event arguments coreference |
within and cross document |
eng | CC |
A corpus for recognizing relations of spatiotemporal containment between events. The narratives are represented as hierarchies of events based on relations of spatiotemporal containment (i.e., superevent–subevent relations).
Data Source | Documents | Events | Density | Annotation | Scope | Lang | License |
News | 100 | ~32 per-doc | non-exhaustive | events coreference sub-events |
within document | eng | CC BY-NC-SA 3.0 |
A method for collecting a large scale cross-document event coreference dataset from news articles, leveraging the hyperlinks of events that point to the same news article.
MAssive eVENt detection dataset (MAVEN), alleviates the data scarcity problem and covers much more general event types.
Data Source | Documents | Events | Density | Annotation | Scope | Lang | License |
Wikipedia | 4,480 | 118,732 | exhaustive | events | within document | eng | ?? |
A unified large-scale human-annotated dataset (build on top of MAVEN dataset), containing events, event coreference chains, temporal relations, causal relations, and subevent relations.
- MAVEN-ERE: A Unified Large-scale Dataset for Event Coreference, Temporal, Causal, and Subevent Relation Extraction
- MAVEN-ERE Github
Data Source | Documents | Events | Density | Annotation | Scope | Lang | License |
Wikipedia | 4,480 | 103,193 | exhaustive | events coreference temporal causal sub-events |
within document | eng | CC BY-NC-SA 3.0 |
MATRES proposes a new multi-axis modeling to better capture the temporal structure of events. In addition, we identify that event end-points are a major source of confusion in annotation, so we also propose to annotate TempRels based on start-points only.
MEANTIME corpus is a semantically annotated corpus of Wikinews articles. MEANTIME and ECB+ uses the same NewsReader annotation guideliness, The corpus consists of 480 news articles in English, Spanish, Italian, and Dutch.
Data Source | Documents | Events | Density | Annotation | Scope | Lang | License |
Wikinews | 480 | 2,107 | exhaustive | events entities coreference |
within and cross document |
eng it de sp |
Richer Event Description is an attempt to bring together a number of existing and well-researched veins of document annotation into a single representation of the events and participants in a discourse. It is not concerned with semantic role annotation in the traditional sense
- Richer Event Description: Integrating event coreference with temporal, causal and bridging annotation
- RED Annotation Guidelines
- RED Corpus
Data Source | Docs | Events | Density | Annotation | License |
News | 95 | 8731 | Exhaustive | entities events coreference temporal causal subevent |
The Penn Discourse Treebank (PDTB) is a discourse level annotation over 1M word Wall Street Journal corpus. The annotation consist of events, events arguments (entities) and the relations between them (event-event, event-entity and entity-entity).
WEC is an automatic annotation method for extracting a large-scale corpus from Wikipedia articles (in supporting languages). WEC-Eng is the corpus generated by WEC from the English Wikipedia.
- WEC: Deriving a Large-scale Cross-document Event Coreference dataset from Wikipedia
- WEC Annotation Process
- WEC-Eng Corpus
Data Source | Docs | Events | Density | Annotation | Scope | License |
Wikipedia | NA | 43,672 | non-exhaustive | events coreference |
cross-document | CC BY-SA |