Introduction

This repository provides the models and the source code for the master's thesis "Extraction and Classification of Time in Unstructured Data" (2023). Furthermore, it describes the steps required to reproduce the thesis results. The transformer-based models are finetuned to extract and classify temporal expressions in unstructured text.

The models produced by the thesis utilize the two frameworks, UIE and MaChAmp. Both repositories were forked in August 2023 and modified to suit the problem of temporal extraction and classification. Most changes were applied to the evaluation and dataset preprocessing scripts. The scripts for finetuning and inference remain very close to the original versions:

Unified Structure Generation for Universal Information Extraction (UIE) [Lu et al., 2022] - GitHub Link
- UIE is a sequence-to-sequence framework that extracts various information extraction targets (such as entities, relations, and events) into a graph structure called "Structured Extraction Language." It is based on the T5 library [Raffel et al., 2020].
Massive Choice, Ample Tasks (MACHAMP) [van der Goot et al., 2020] - GitHub Link
- MaChAmp is a multitask learning framework. In this thesis, it is used to train BERT-based models in a single-task fashion.

In the thesis, a 10-fold-crossvalidation approach was chosen to test the two frameworks. This documentation describes both a quickstart and a full reproduction of all the steps. Both approaches are described for each of the two frameworks separately.

The steps for the two frameworks are almost the same, but they are achieved using different scripts and conventions, described in detail in the respective documentation. The documentation for UIE and MaChAmp can be found in the folders with the same name. This page gives some general information, introduces the project, and refers to the necessary pages.

Project Overview

The overall project structure looks like this:

temporal-extraction
├── uie                 # Contains all the scripts and documentation related to UIE
├── machamp             # Contains all the scripts and documentation related to MaChAmp
├── results             # Contains the result tables and log files of the finetuned models used in the thesis 
├── temporal-data       # Contains the datasets, as well as the scripts required for conversion
├── docs                # Contains assets for the documentation

This is the main directory (temporal-extraction). Both uie and machamp contain the framework-specific documentation required to both use the models and fully reproduce the steps in the thesis.

The results folder shows the result tables and log files produced by the thesis. In particular, it shows the results for every dataset and every fold in the cross-validation. Furthermore, it contains the files that display the exact error cases, i.e., where the model mispredicted the sentence.

The temporal-data folder contains all the converted datasets and the publicly available original versions. It also contains the scripts to convert the original datasets into the required format.

Data

Multiple data formats can be found in this repository. First, each of the used datasets follows a general XML or JSON format. Despite this, most datasets have a different format and are, therefore, not directly comparable.

The graphic shows the datasets, formats, and the relations between them.

In the temporal-data section, the different formats are described in more detail, as well as the scripts to convert them into a uniform format. In summary, the thesis uses four datasets, of which some consist of multiple subsets (for example, TempEval-3 is a union of AQUAINT and TimeBank). The MaChAmp framework requires a BIO format, while UIE has its format based on JSON. Furthermore, the author of the thesis introduced a JSONLINES format, which is used as a step to convert to the other formats.

Anaconda

To use this repository, it is recommended to use Anaconda. With Anaconda, a separate environment can be created for each of the two frameworks. From this directory, the following commands may be used:

UIE:

conda create -n uie python=3.8
conda activate uie
pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r uie/requirements.txt

MaChAmp:

conda create -n machamp python=3.8
conda activate machamp
pip install -r machamp/requirements.txt

Reproduction Steps Summary

Before the models are used, it is recommended to prepare the data first.

Quickstart

Setup Anaconda environment
Prepare the data
Download the finetuned models
Select the dataset and run the inference script

Reproduce the thesis steps

Setup Anaconda environment
Prepare the data
Prepare the cross-validation approach
Download the clean UIE models
Finetune the models on each of the folds
Run the cross-validation evaluation scripts to get the results

Clean UIE Models

The following two models should be used for finetuning. These are the same models as proposed in the original paper [Lu et al., 2022].

UIE Base: ZFDM Download, Google Drive Download
UIE Large: ZFDM Download, Google Drive Download

Finetuned Models

The thesis tested a single and a multi-class setup on all datasets and their subsets. Due to the large amount of models, only the multi-class (date, time, duration, set) models on the four temporal datasets are shared and made available for download. Generally speaking, the single-class models do not perform much better despite an easier task. This makes the single-class models obsolete in practice. The following table shows the multiclass results on the different datasets:

Table shows the temporal extraction and classification performance for the models produced in the thesis. M stands for MaChAmp models. The bottom part of the table shows the performance of related work. "Strict" means an exact match, and "Type" means a match where at least one token (also known as a "relaxed" match) and the temporal class is correct.

UIE Models

UIE GitHub Link, [Lu et al., 2022]

Dataset	Base	Large	Citation
TempEval-3	Download Link	Download Link	[UzZaman et al., 2013]
WikiWars	Download Link	Download Link	[Derczynski et al., 2012]
Tweets	Download Link	Download Link	[Zhong et al., 2017]
Fullpate	Download Link	Download Link	Zarcone et al., 2020

MaChAmp Models

MaChAmp-BERT Models

Base Model Huggingface Link, Large Model Huggingface Link, [Devlin et al., 2018]

Dataset	Base	Large	Citation
TempEval-3	Download Link	Download Link	[UzZaman et al., 2013]
WikiWars	Download Link	Download Link	[Derczynski et al., 2012]
Tweets	Download Link	Download Link	[Zhong et al., 2017]
Fullpate	Download Link	Download Link	Zarcone et al., 2020

MaChAmp-RoBERTa Models

Base Model Huggingface Link, Large Model Huggingface Link, [Liu et al., 2019]

Dataset	Base	Large	Citation
TempEval-3	Download Link	Download Link	[UzZaman et al., 2013]
WikiWars	Download Link	Download Link	[Derczynski et al., 2012]
Tweets	Download Link	Download Link	[Zhong et al., 2017]
Fullpate	Download Link	Download Link	Zarcone et al., 2020

MaChAmp-XLM-RoBERTa Models

Base Model Huggingface Link, Large Model Huggingface Link, [Conneau et al., 2019]

Dataset	Base	Large	Citation
TempEval-3	Download Link	Download Link	[UzZaman et al., 2013]
WikiWars	Download Link	Download Link	[Derczynski et al., 2012]
Tweets	Download Link	Download Link	[Zhong et al., 2017]
Fullpate	Download Link	Download Link	Zarcone et al., 2020

MaChAmp-mBERT Models

Base Model Huggingface Link, [Devlin et al., 2018]

Dataset	Base	Citation
TempEval-3	Download Link	[UzZaman et al., 2013]
WikiWars	Download Link	[Derczynski et al., 2012]
Tweets	Download Link	[Zhong et al., 2017]
Fullpate	Download Link	Zarcone et al., 2020

Relation Extraction dataset

Even though the dataset was not explicitly used in the thesis, scripts to convert the TempEval-3 relation extraction dataset to the UIE format were created and tested prototypically. The converted dataset and the conversion scripts are available in the temporal-data folder. This dataset may be used for future work.

Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
docs/images		docs/images
machamp		machamp
results		results
temporal-data		temporal-data
uie		uie
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Project Overview

Data

Anaconda

Reproduction Steps Summary

Quickstart

Reproduce the thesis steps

Clean UIE Models

Finetuned Models

UIE Models

MaChAmp Models

MaChAmp-BERT Models

MaChAmp-RoBERTa Models

MaChAmp-XLM-RoBERTa Models

MaChAmp-mBERT Models

Relation Extraction dataset

References

About

Releases

Packages

Languages

semantic-systems/temporal-extraction

Folders and files

Latest commit

History

Repository files navigation

Introduction

Project Overview

Data

Anaconda

Reproduction Steps Summary

Quickstart

Reproduce the thesis steps

Clean UIE Models

Finetuned Models

UIE Models

MaChAmp Models

MaChAmp-BERT Models

MaChAmp-RoBERTa Models

MaChAmp-XLM-RoBERTa Models

MaChAmp-mBERT Models

Relation Extraction dataset

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages