Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space
We provide a Pytorch implementation of the following paper:
Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space, Dayiheng Liu, Yeyun Gong, Jie Fu, Yu Yan, Jiusheng Chen, Jiancheng Lv, Nan Duan and Ming Zhou, Conference on Empirical Methods in Natural Language Processing. EMNLP 2020 [paper]
- Python 3.6
- Tensorflow 1.10.0+
- Pytorch 1.3.0+
- nltk 3.3+
- cuda 9.0
Please install the Huggingface transformers locally as follows:
cd pytorch-transformers-master
python setup.py install
Download the SQuAD 2.0 dataset files (train-v2.0.json
and dev-v2.0.json
) at here.
The Transformer autoencoder can be trained with the questions in train-v2.0.json
.
We can also pretrain the Transformer autoencoder with our collected 2M question corpus, which contains about 2M questions from the training sets of several MRC and QA datasets, including SQuAD2.0, Natural Questions, NewsQA, QuAC, TriviaQA, CoQA, HotpotQA, DuoRC, and MS MARCO. This 2M questions corpus can be downloaded at here.
In addition, we can pretrain the Transformer autoencoder with the large-scale corpora English Wikipedia and BookCorpus, please refer to here to download and preprocess the dataset. After that, you can obtain a text file wikicorpus_en_one_article_per_line.txt
for Transformer autoencoder pre-training.
We adopt the BERT (BERTforQuestionAnswering
) and RoBERTa (RobertaForQuestionAnswering
) based models of Huggingface as the SQuAD 2.0 MRC models.
We provide a well trained RoBERTa SQuAD 2.0 MRC model whose checkpoint can be downloaded at here.
Before training the Transformer-based Autoencoder, please put the checkpoint files of the well trained RoBERTa SQuAD 2.0 MRC model into the default directory crqda/data/mrc_model
, and put the wikicorpus_en_one_article_per_line.txt
(or other dataset, like 2M questions corpus) into the default directory crqda/data/
.
Then train the Transformer-based Autoencoder with this script:
cd crqda
./run_train.sh
The Transformer-based Autoencoder will be saved at data/ae_models
.
To rewrite the question and obtain the augmented dataset, please run this script:
cd crqda
python inference.py \
--OS_ID 0 \
--GAP 33000 \
--NEG \
--ae_model_path 'data/ae_models/pytorch_model.bin'
set --NEG
to generate unanswerable questions, and --para
to generate answerable questions. Since the rewriting process is slow, we set up a manual parallel rewriting function, set OS_ID
to indicate which GPU should be used for this rewriting, and GAP
is the number of original training samples should be rewritten in this GPU.
Here we provide a SQuAD 2.0 augmented dataset which contains the original SQuAD 2.0 training data pairs and some unanswerable question data pairs generated by CRQDA. It can be downloaded at here.
After question data augmention with CRQDA, we can finetune the BERT-large model on the augmented dataset with the script:
cd pytorch-transformers-master/examples
./run_fine_tune_bert_with_crqda.sh
You may obtain the results like
"best_exact": 80.56093657879222, "best_f1": 83.3359726931614, "exact": 80.03032089615093, "f1": 82.97608915068454
We also provide the implementation of the baselines, including EDA, Back-Translation, and Text-VAE, which can be found in baselines/EDA
, baselines/Mu-Forcing-VRAE
, and baselines/Style-Transfer-Through-Back-Translation
, respectively.
@inproceedings{liu2020crqda,
title = "Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space",
author="Liu, Dayiheng and Gong, Yeyun and Fu, Jie, and Yan, Yu and Chen Jiusheng, and Lv, Jiancheng and Duan, Nan and Zhou, Ming",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
year = "2020"
}