This is a Pytorch implementation of Abstractive summarization methods on top of OpenNMT. It features vanilla attention seq-to-seq LSTMs, pointer-generator networks (See 2017) ("copy attention"), as well as transformer networks (Vaswani 2017) ("attention is all you need") as well as instructions to run the networks on both the Gigaword and the CNN/Dayly Mail datasets.
pip install -r requirements.txt
The following models are implemented:
- Vanilla attention LSTM encoder-decoder
- Pointer-generator networks: "Get To The Point: Summarization with Pointer-Generator Networks", See et al., 2017
- Transformer networks: "Attention is all you need", Vaswani et al., 2017
python preprocess.py -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data data/demo -share_vocab -dynamic_dict -src_vocab_size 50000
The data can be either Gigaword or the CNN/Daily Mail dataset. For CNN/daily mail, it is also recommended to truncate inputs and outputs: -src_seq_length_trunc 400 -tgt_seq_length_trunc 100
The data consists of parallel source (src
) and target (tgt
) data containing one example per line with tokens separated by a space:
src-train.txt
tgt-train.txt
src-val.txt
tgt-val.txt
For Gigaword, download the data from : https://github.com/harvardnlp/sent-summary. Then, extract it (tar -xzf summary.tar.gz
)
For CNN/Daily Mail, we assume access to such files. Otherwise, these can be built from https://github.com/OpenNMT/cnn-dailymail.
Validation files are required and used to evaluate the convergence of the training.
After running the preprocessing, the following files are generated:
demo.train.pt
: serialized PyTorch file containing training datademo.valid.pt
: serialized PyTorch file containing validation datademo.vocab.pt
: serialized PyTorch file containing vocabulary data
Internally the system never touches the words themselves, but uses these indices.
The basic command would be:
python train.py -data data/demo -save_model demo_model -share_embeddings
The main relevant parameters to be changed for summarization are:
- pointer_gen to enable Pointer Generator
- -encoder_type transformer -decoder_type transformer to enable Transformer networks
- word_vec_size (128 has given good results)
- rnn_size (256 or 512 work well in practice)
- encoder_type (brnn works best on most models)
- layers (1 or 2, up to 6 on transformer)
- gpuid (0 for the first gpu, -1 if on cpu)
The parameters for our trained models are described below
python translate.py -model demo-model_epochX_PPL.pt -src data/src-test.txt -o output_pred.txt -beam_size 10
-dynamic_dict -share_vocab
Now you have a model which you can use to predict on new data. We do this by running beam search. This will output predictions into pred.txt
.
Perplexity and accuracy are not the main evaluation metrics for summarization. Rather, the field uses ROUGE To evaluate for rouge, we use files2rouge, which itself uses pythonrouge.
Installation instructions:
pip install git+https://github.com/tagucci/pythonrouge.git
git clone https://github.com/pltrdy/files2rouge.git
cd files2rouge
python setup_rouge.py
python setup.py install
To run evaluation, simply run:
files2rouge summaries.txt references.txt
In the case of CNN, evaluation should be done with beginning and end of sentences tokens stripped.
Rouge-1 | Rouge-2 | Rouge-L | |
---|---|---|---|
Attention LSTM | 35.59 | 17.63 | 33.46 |
Pointer-Generator | 33.44 | 16.55 | 31.43 |
Transformer | 35.10 | 17.01 | 33.09 |
lead-8w baseline | 21.31 | 7.34 | 19.95 |
Detailed settings:
Attention | Pointer-Generator | Transformer | |
---|---|---|---|
Vocabulary size | 50k | 50k | 50k |
Word embedding size | 128 | 128 | 512 |
Attention | MLP | Copy | Multi-head |
Encoder layers | 1 | 1 | 4 |
Decoder layers | 1 | 1 | 4 |
Enc/Dec type | BiLSTM | BiLSTM | Transformer |
Enc units | 512 | 256 | 512 |
Optimizer | Sgd | Sgd | Adam |
Learning rate | 1 | 1 | 1 |
Dropout | 0.3 | 0.3 | 0.2 |
Max grad norm | 2 | 2 | n.a |
Batch size | 64 | 32 | 32 |
The Transformer network also has the following additional settings during training:
param_init=0 position_encoding warmup_steps=4000 decay_method=noam
Rouge-1 | Rouge-2 | Rouge-L | |
---|---|---|---|
Attention LSTM | 30.25 | 12.41 | 22.93 |
Pointer-Generator | 34.00 | 14.70 | 36.57 |
Transformer | 23.90 | 5.85 | 17.36 |
lead-3 baseline | 40.34 | 17.70 | 36.57 |
Detailed settings:
Attention | Pointer-Generator | Transformer | |
---|---|---|---|
Vocabulary size | 50k | 50k | 50k |
Word embedding size | 128 | 128 | 256 |
Attention | MLP | Copy | Multi-head |
Encoder layers | 1 | 1 | 4 |
Decoder layers | 1 | 1 | 4 |
Enc/Dec type | BiLSTM | BiLSTM | Transformer |
Enc units | 256 | 256 | 256 |
Optimizer | Sgd | Sgd | Adam |
Learning rate | 1 | 1 | 1 |
Dropout | 0.3 | 0.3 | 0.2 |
Max grad norm | 2 | 2 | n.a |
Batch size | 32 | 32 | 64 |
The Transformer network also has the following additional settings during training:
param_init=0 position_encoding warmup_steps=4000 decay_method=noam