Skip to content

idiap/semi-structured-annotations

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

semi-structured-annotations

This repository contains the code and links to the data and trained models for the paper A Corpus and Evaluation for Predicting Semi-Structured Human Annotations presented at the GEM workshop at EMNLP 2022.

Contents

  1. Short Description
  2. Data
  3. Models
  4. Installation
  5. Usage
  6. Contact
  7. Authors and Acknowledgments
  8. Citation

Short Description

Our goal is to teach seq2seq models to interpret policy announcements. We present the FOMC dataset on the monetary policy of the Federal Reserve, where source documents are policy announcements and targets are selected and annotated sentences from New York Times articles. We train seq2seq models (Transformer, BERT, BART) to generate the annotated targets conditioned on the source documents. We also introduce an evaluation method called equivalence classes evaluation. Equivalence classes group semantically interchangeable values from a specific annotation category. The seq2seq model then has to identify the true continuation among two possibilities from different equivalence classes.

Data

Please contact me at andreas.marfurt [at] idiap.ch to get access to the data.

Models

We provide the following checkpoints of models finetuned on the FOMC dataset:

  • Transformer: Randomly initialized Transformer
  • BERT: BERT encoder, Transformer decoder
  • BART: BART model
  • FilterBERT: BERT-based model for filtering source documents

The models are shared under a CC BY 4.0 license.

Installation

First, install conda, e.g. from Miniconda. Then create and activate the environment:

conda env create -f environment.yml
conda activate semi-structured-annotations

Usage

Training

To train a model, use the main.py script. The default arguments are set to the hyperparameter values we used in our experiments. Here's an example of how to train BART with the parameters we used ourselves:

python main.py \
--model bart \
--data_dir data_fomc_bart \
--model_dir models/bart \
--default_root_dir logs/bart \
--deterministic \
--gpus 1 \
--batch_size 2 \
--accumulate_grad_batches 2 \
--max_epochs 20 \
--min_epochs 10 \
--max_steps 16000

Filtering

You can filter source documents with FilterBERT, or the Oracle/Lead strategies. For the former, use the filter_bert.py script:

python filter_bert.py \
--model_dir models/filterbert \
--pretrained_dir bert-base-uncased \
--data_dir data_fomc_bert \
--default_root_dir logs/filterbert \
--deterministic \
--gpus 1 \
--batch_size 5 \
--max_epochs 10 \
--min_epochs 5 \
--max_steps 17000

For Oracle/Lead filtering, use filter_source_docs_with_tokenizer.py and specify a HuggingFace tokenizer.

Saving Model Outputs

To run text generation evaluation, you have to first save a model's outputs in text format. Run save_model_outputs.py for Transformer/BERT models or save_bart_outputs.py for a BART model with the default parameters. Don't forget to specify model_dir and output_dir.

Text Generation Evaluation

Use the evaluations.py script to run the text generation evaluations and specify the path to your model outputs as the input_dir. Results will be saved as a json file in the same directory.

Equivalence Classes Evaluation

Our definition of equivalence classes can be found in equivalence_classes.json. We provide the evaluation instances we used in data_fomc_equiv. If you want to generate your own, use the create_equivalance_classes_examples.py script.

Run the evaluation with the equivalence_classes_evaluation.py file by specifying the path to your evaluation data, the model directory and the output path.

Contact

In case of problems or questions open a Github issue or write an email to andreas.marfurt [at] idiap.ch.

Authors and Acknowledgments

Our paper was written by Andreas Marfurt, Ashley Thornton, David Sylvan, Lonneke van der Plas and James Henderson.

The work was supported as a part of the grant Automated interpretation of political and economic policy documents: Machine learning using semantic and syntactic information, funded by the Swiss National Science Foundation (grant number CRSII5_180320), and led by the co-PIs James Henderson, Jean-Louis Arcand and David Sylvan. We would also like to thank Maria Kamran, Alessandra Romani, Julia Greene, Clarisse Labbé, Shekhar Hari Kumar, Claire Ransom, Daniele Rinaldo, Eugenia Zena and Raphael Leduc for their invaluable data collection and annotation efforts.

Citation

If you use our code, data or models, please cite us.

@inproceedings{marfurt-etal-2022-corpus,
    title = "A Corpus and Evaluation for Predicting Semi-Structured Human Annotations",
    author = "Marfurt, Andreas  and
      Thornton, Ashley  and
      Sylvan, David  and
      van der Plas, Lonneke  and
      Henderson, James",
    booktitle = "Proceedings of the Second Workshop on Generation, Evaluation and Metrics",
    month = dec,
    year = "2022",
    publisher = "Association for Computational Linguistics",    
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages