This is the implementation of Constituency Parsing with Span Attention at Findings of EMNLP2020.
Please contact us at [email protected]
if you have any questions.
Visit our homepage to find more our recent research and softwares for NLP (e.g., pre-trained LM, POS tagging, NER, sentiment analysis, relation extraction, datasets, etc.).
We are improving our SAPar. For updates, please visit HERE.
If you use or extend our work, please cite our paper at Findings of EMNLP-2020.
@inproceedings{tian-etal-2020-improving,
title = "Improving Constituency Parsing with Span Attention",
author = "Tian, Yuanhe and Song, Yan and Xia, Fei and Zhang, Tong",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
month = nov,
year = "2020",
address = "Online",
pages = "1691--1703",
}
python 3.6
pytorch 1.1
Install python dependencies by running:
pip install -r requirements.txt
EVALB
and EVALB_SPMRL
contain the code to evaluate the parsing results for English and other languages. Before running evaluation, you need to go to the EVALB
(for English) or EVALB_SPMRL
(for other languages) and run make
.
In our paper, we use BERT, ZEN, and XLNet as the encoder.
For BERT, please download pre-trained BERT model from Google and convert the model from the TensorFlow version to PyTorch version.
- For Arabic, we use MulBERT-Base, Multilingual Cased.
- For Chinese, we use BERT-Base, Chinese;
- For English, we use BERT-Large, Cased and BERT-Large, Uncased.
For ZEN, you can download the pre-trained model from here.
For XLNet, you can download the pre-trained model from here.
For our pre-trained model, you can download them from Baidu Wangpan (passcode: 2o1n) or Google Drive.
To train a model on a small dataset, run:
./run.sh
We use datasets in three languages: Arabic, Chinese, and English.
- Arabic: we use ATB2.0 part 1-3 (LDC2003T06, LDC2004T02, and LDC2005T20).
- Chinese: we use CTB5 (LDC2005T01).
- English: we use PTB (LDC99T42).
To preprocess the data, please go to data_processing
directory and follow the instruction to process the data. You need to obtain the official datasets yourself before running our code.
Ideally, all data will appear in ./data
directory. The data with gold POS tags are located in folders whose name is the same as the dataset name (i.e., ATB, CTB, and PTB); the data with predicted POS tags are located in folders whose name has a "_POS" suffix (i.e., ATB_POS, CTB_POS, and PTB_POS).
You can find the command lines to train and test models on a specific dataset in run.sh
.
- Regular maintenance.
You can leave comments in the Issues
section, if you want us to implement any functions.
You can check our updates at updates.md.