This repository contains code, data, and templates for crowdsourcing protocols, described by the paper: Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries.
calculate.ipynb: to calculate the score distribution, krippendorff reliability, and SHR reliability.
We released our evaluation templates and annotations to promote future work on factual consistency evaluation. The annotations can be found in for CNN&DM data, for XSUM data and templates
The code for BART, ProphetNet, PEGASUS, and BERTSUM is based on Fairseq(-py). Our pretrained models can be found in for CNN&DM data and for XSUM data
If you use our code in your research, please cite our work:
@inproceedings{tang2022investigating,
title={Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries},
author={Tang, Xiangru and Fabbri, Alexander R and Mao, Ziming and Adams, Griffin and Wang, Borui and Li, Haoran and Mehdad, Yashar and Radev, Dragomir},
booktitle={North American Association for Computational Linguistics (NAACL)},
year={2022}
}