This repository provides the models and the source code for the master's thesis "Extraction and Classification of Time in Unstructured Data" (2023). Furthermore, it describes the steps required to reproduce the thesis results. The transformer-based models are finetuned to extract and classify temporal expressions in unstructured text.
The models produced by the thesis utilize the two frameworks, UIE and MaChAmp. Both repositories were forked in August 2023 and modified to suit the problem of temporal extraction and classification. Most changes were applied to the evaluation and dataset preprocessing scripts. The scripts for finetuning and inference remain very close to the original versions:
- Unified Structure Generation for Universal Information Extraction (UIE) [Lu et al., 2022] - GitHub Link
- UIE is a sequence-to-sequence framework that extracts various information extraction targets (such as entities, relations, and events) into a graph structure called "Structured Extraction Language." It is based on the T5 library [Raffel et al., 2020].
- Massive Choice, Ample Tasks (MACHAMP) [van der Goot et al., 2020] - GitHub Link
- MaChAmp is a multitask learning framework. In this thesis, it is used to train BERT-based models in a single-task fashion.
In the thesis, a 10-fold-crossvalidation approach was chosen to test the two frameworks. This documentation describes both a quickstart and a full reproduction of all the steps. Both approaches are described for each of the two frameworks separately.
The steps for the two frameworks are almost the same, but they are achieved using different scripts and conventions, described in detail in the respective documentation. The documentation for UIE and MaChAmp can be found in the folders with the same name. This page gives some general information, introduces the project, and refers to the necessary pages.
The overall project structure looks like this:
temporal-extraction
├── uie # Contains all the scripts and documentation related to UIE
├── machamp # Contains all the scripts and documentation related to MaChAmp
├── results # Contains the result tables and log files of the finetuned models used in the thesis
├── temporal-data # Contains the datasets, as well as the scripts required for conversion
├── docs # Contains assets for the documentation
This is the main directory (temporal-extraction). Both uie and machamp contain the framework-specific documentation required to both use the models and fully reproduce the steps in the thesis.
The results folder shows the result tables and log files produced by the thesis. In particular, it shows the results for every dataset and every fold in the cross-validation. Furthermore, it contains the files that display the exact error cases, i.e., where the model mispredicted the sentence.
The temporal-data folder contains all the converted datasets and the publicly available original versions. It also contains the scripts to convert the original datasets into the required format.
Multiple data formats can be found in this repository. First, each of the used datasets follows a general XML or JSON format. Despite this, most datasets have a different format and are, therefore, not directly comparable.
The graphic shows the datasets, formats, and the relations between them.
In the temporal-data section, the different formats are described in more detail, as well as the scripts to convert them into a uniform format. In summary, the thesis uses four datasets, of which some consist of multiple subsets (for example, TempEval-3 is a union of AQUAINT and TimeBank). The MaChAmp framework requires a BIO format, while UIE has its format based on JSON. Furthermore, the author of the thesis introduced a JSONLINES format, which is used as a step to convert to the other formats.
To use this repository, it is recommended to use Anaconda. With Anaconda, a separate environment can be created for each of the two frameworks. From this directory, the following commands may be used:
UIE:
conda create -n uie python=3.8
conda activate uie
pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r uie/requirements.txt
MaChAmp:
conda create -n machamp python=3.8
conda activate machamp
pip install -r machamp/requirements.txt
Before the models are used, it is recommended to prepare the data first.
- Setup Anaconda environment
- Prepare the data
- Download the finetuned models
- Select the dataset and run the inference script
- Setup Anaconda environment
- Prepare the data
- Prepare the cross-validation approach
- Download the clean UIE models
- Finetune the models on each of the folds
- Run the cross-validation evaluation scripts to get the results
The following two models should be used for finetuning. These are the same models as proposed in the original paper [Lu et al., 2022].
- UIE Base: ZFDM Download, Google Drive Download
- UIE Large: ZFDM Download, Google Drive Download
The thesis tested a single and a multi-class setup on all datasets and their subsets. Due to the large amount of models, only the multi-class (date, time, duration, set) models on the four temporal datasets are shared and made available for download. Generally speaking, the single-class models do not perform much better despite an easier task. This makes the single-class models obsolete in practice. The following table shows the multiclass results on the different datasets:
Table shows the temporal extraction and classification performance for the models produced in the thesis. M stands for MaChAmp models. The bottom part of the table shows the performance of related work. "Strict" means an exact match, and "Type" means a match where at least one token (also known as a "relaxed" match) and the temporal class is correct.
UIE GitHub Link, [Lu et al., 2022]
Dataset | Base | Large | Citation |
---|---|---|---|
TempEval-3 | Download Link | Download Link | [UzZaman et al., 2013] |
WikiWars | Download Link | Download Link | [Derczynski et al., 2012] |
Tweets | Download Link | Download Link | [Zhong et al., 2017] |
Fullpate | Download Link | Download Link | Zarcone et al., 2020 |
Base Model Huggingface Link, Large Model Huggingface Link, [Devlin et al., 2018]
Dataset | Base | Large | Citation |
---|---|---|---|
TempEval-3 | Download Link | Download Link | [UzZaman et al., 2013] |
WikiWars | Download Link | Download Link | [Derczynski et al., 2012] |
Tweets | Download Link | Download Link | [Zhong et al., 2017] |
Fullpate | Download Link | Download Link | Zarcone et al., 2020 |
Base Model Huggingface Link, Large Model Huggingface Link, [Liu et al., 2019]
Dataset | Base | Large | Citation |
---|---|---|---|
TempEval-3 | Download Link | Download Link | [UzZaman et al., 2013] |
WikiWars | Download Link | Download Link | [Derczynski et al., 2012] |
Tweets | Download Link | Download Link | [Zhong et al., 2017] |
Fullpate | Download Link | Download Link | Zarcone et al., 2020 |
Base Model Huggingface Link, Large Model Huggingface Link, [Conneau et al., 2019]
Dataset | Base | Large | Citation |
---|---|---|---|
TempEval-3 | Download Link | Download Link | [UzZaman et al., 2013] |
WikiWars | Download Link | Download Link | [Derczynski et al., 2012] |
Tweets | Download Link | Download Link | [Zhong et al., 2017] |
Fullpate | Download Link | Download Link | Zarcone et al., 2020 |
Base Model Huggingface Link, [Devlin et al., 2018]
Dataset | Base | Citation |
---|---|---|
TempEval-3 | Download Link | [UzZaman et al., 2013] |
WikiWars | Download Link | [Derczynski et al., 2012] |
Tweets | Download Link | [Zhong et al., 2017] |
Fullpate | Download Link | Zarcone et al., 2020 |
Even though the dataset was not explicitly used in the thesis, scripts to convert the TempEval-3 relation extraction dataset to the UIE format were created and tested prototypically. The converted dataset and the conversion scripts are available in the temporal-data folder. This dataset may be used for future work.
-
[van der Goot et al., 2020] van der Goot, R., Üstün, A., Ramponi, A., Sharaf, I., and Plank, B. (2020). Massive choice, ample tasks (machamp): A toolkit for multi-task learning in nlp. arXiv preprint arXiv:2005.14672.
-
[Derczynski et al., 2012] Derczynski, L., Llorens, H., and Saquete, E. (2012). Massively increasing timex3 resources: a transduction approach. arXiv preprint arXiv:1203.5076.
-
[Mazur and Dale, 2010] Mazur, P. and Dale, R. (2010). Wikiwars: A new corpus for research on temporal expressions. In Proceedings of the 2010 conference on empirical methods in natural language processing, pages 913–922
-
[Devlin et al., 2018] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805