Skip to content

MaChAmp and UIE frameworks applied to the task of extracting temporal information from unstructured data.

Notifications You must be signed in to change notification settings

skonline90/Temporal-Extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

This repository provides the models and the source code for the master's thesis "Extraction and Classification of Time in Unstructured Data" (2023). Furthermore, it describes the steps required to reproduce the thesis results. The transformer-based models are finetuned to extract and classify temporal expressions in unstructured text.

The models produced by the thesis utilize the two frameworks, UIE and MaChAmp. Both repositories were forked in August 2023 and modified to suit the problem of temporal extraction and classification. Most changes were applied to the evaluation and dataset preprocessing scripts. The scripts for finetuning and inference remain very close to the original versions:

  • Unified Structure Generation for Universal Information Extraction (UIE) [Lu et al., 2022] - GitHub Link
    • UIE is a sequence-to-sequence framework that extracts various information extraction targets (such as entities, relations, and events) into a graph structure called "Structured Extraction Language." It is based on the T5 library [Raffel et al., 2020].
  • Massive Choice, Ample Tasks (MACHAMP) [van der Goot et al., 2020] - GitHub Link
    • MaChAmp is a multitask learning framework. In this thesis, it is used to train BERT-based models in a single-task fashion.

In the thesis, a 10-fold-crossvalidation approach was chosen to test the two frameworks. This documentation describes both a quickstart and a full reproduction of all the steps. Both approaches are described for each of the two frameworks separately.

The steps for the two frameworks are almost the same, but they are achieved using different scripts and conventions, described in detail in the respective documentation. The documentation for UIE and MaChAmp can be found in the folders with the same name. This page gives some general information, introduces the project, and refers to the necessary pages.

Project Overview

The overall project structure looks like this:

temporal-extraction
├── uie                 # Contains all the scripts and documentation related to UIE
├── machamp             # Contains all the scripts and documentation related to MaChAmp
├── results             # Contains the result tables and log files of the finetuned models used in the thesis 
├── temporal-data       # Contains the datasets, as well as the scripts required for conversion
├── docs                # Contains assets for the documentation

This is the main directory (temporal-extraction). Both uie and machamp contain the framework-specific documentation required to both use the models and fully reproduce the steps in the thesis.

The results folder shows the result tables and log files produced by the thesis. In particular, it shows the results for every dataset and every fold in the cross-validation. Furthermore, it contains the files that display the exact error cases, i.e., where the model mispredicted the sentence.

The temporal-data folder contains all the converted datasets and the publicly available original versions. It also contains the scripts to convert the original datasets into the required format.

Data

Multiple data formats can be found in this repository. First, each of the used datasets follows a general XML or JSON format. Despite this, most datasets have a different format and are, therefore, not directly comparable.

Temporal Conversion Formats Overview

The graphic shows the datasets, formats, and the relations between them.

In the temporal-data section, the different formats are described in more detail, as well as the scripts to convert them into a uniform format. In summary, the thesis uses four datasets, of which some consist of multiple subsets (for example, TempEval-3 is a union of AQUAINT and TimeBank). The MaChAmp framework requires a BIO format, while UIE has its format based on JSON. Furthermore, the author of the thesis introduced a JSONLINES format, which is used as a step to convert to the other formats.

Anaconda

To use this repository, it is recommended to use Anaconda. With Anaconda, a separate environment can be created for each of the two frameworks. From this directory, the following commands may be used:

UIE:

conda create -n uie python=3.8
conda activate uie
pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r uie/requirements.txt

MaChAmp:

conda create -n machamp python=3.8
conda activate machamp
pip install -r machamp/requirements.txt

Reproduction Steps Summary

Before the models are used, it is recommended to prepare the data first.

Quickstart

  • Setup Anaconda environment
  • Prepare the data
  • Download the finetuned models
  • Select the dataset and run the inference script

Reproduce the thesis steps

  • Setup Anaconda environment
  • Prepare the data
  • Prepare the cross-validation approach
  • Download the clean UIE models
  • Finetune the models on each of the folds
  • Run the cross-validation evaluation scripts to get the results

Clean UIE Models

The following two models should be used for finetuning. These are the same models as proposed in the original paper [Lu et al., 2022].

Finetuned Models

The thesis tested a single and a multi-class setup on all datasets and their subsets. Due to the large amount of models, only the multi-class (date, time, duration, set) models on the four temporal datasets are shared and made available for download. Generally speaking, the single-class models do not perform much better despite an easier task. This makes the single-class models obsolete in practice. The following table shows the multiclass results on the different datasets:

This table compares the most important metrics, “Strict-F1” and “RelaxedType-F1,” for the temporal extraction and classification tasks across all datasets and all models. The best three values per column are highlighted with bold font.

Table shows the temporal extraction and classification performance for the models produced in the thesis. M stands for MaChAmp models. The bottom part of the table shows the performance of related work. "Strict" means an exact match, and "Type" means a match where at least one token (also known as a "relaxed" match) and the temporal class is correct.

UIE Models

UIE GitHub Link, [Lu et al., 2022]

Dataset Base Large Citation
TempEval-3 Download Link Download Link [UzZaman et al., 2013]
WikiWars Download Link Download Link [Derczynski et al., 2012]
Tweets Download Link Download Link [Zhong et al., 2017]
Fullpate Download Link Download Link Zarcone et al., 2020

MaChAmp Models

MaChAmp-BERT Models

Base Model Huggingface Link, Large Model Huggingface Link, [Devlin et al., 2018]

Dataset Base Large Citation
TempEval-3 Download Link Download Link [UzZaman et al., 2013]
WikiWars Download Link Download Link [Derczynski et al., 2012]
Tweets Download Link Download Link [Zhong et al., 2017]
Fullpate Download Link Download Link Zarcone et al., 2020

MaChAmp-RoBERTa Models

Base Model Huggingface Link, Large Model Huggingface Link, [Liu et al., 2019]

Dataset Base Large Citation
TempEval-3 Download Link Download Link [UzZaman et al., 2013]
WikiWars Download Link Download Link [Derczynski et al., 2012]
Tweets Download Link Download Link [Zhong et al., 2017]
Fullpate Download Link Download Link Zarcone et al., 2020

MaChAmp-XLM-RoBERTa Models

Base Model Huggingface Link, Large Model Huggingface Link, [Conneau et al., 2019]

Dataset Base Large Citation
TempEval-3 Download Link Download Link [UzZaman et al., 2013]
WikiWars Download Link Download Link [Derczynski et al., 2012]
Tweets Download Link Download Link [Zhong et al., 2017]
Fullpate Download Link Download Link Zarcone et al., 2020

MaChAmp-mBERT Models

Base Model Huggingface Link, [Devlin et al., 2018]

Dataset Base Citation
TempEval-3 Download Link [UzZaman et al., 2013]
WikiWars Download Link [Derczynski et al., 2012]
Tweets Download Link [Zhong et al., 2017]
Fullpate Download Link Zarcone et al., 2020

Relation Extraction dataset

Even though the dataset was not explicitly used in the thesis, scripts to convert the TempEval-3 relation extraction dataset to the UIE format were created and tested prototypically. The converted dataset and the conversion scripts are available in the temporal-data folder. This dataset may be used for future work.

References

About

MaChAmp and UIE frameworks applied to the task of extracting temporal information from unstructured data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published