This is the code for the paper NFLAT: Non-Flat-Lattice Transformer for Chinese Named Entity Recognition.
We advocate a novel lexical enhancement method, InterFormer, that effectively reduces the amount of computational and memory costs by constructing non-flat lattices. Furthermore, with InterFormer as the backbone, we implement NFLAT for Chinese NER. NFLAT decouples lexicon fusion and context feature encoding. Compared with FLAT, it reduces unnecessary attention calculations in "word-character" and "word-word". This reduces the memory usage by about 50% and can use more extensive lexicons or higher batches for network training.
The code has been tested under Python 3.7. The required packages are as follows:
torch==1.5.1
numpy==1.18.5
FastNLP==0.5.0
fitlog==0.3.2
you can click here to know more about FastNLP. And you can click here to know more about Fitlog.
-
Download the pretrained character embeddings and word embeddings and put them in the data folder.
- Character embeddings (gigaword_chn.all.a2b.uni.ite50.vec): Google Drive or Baidu Pan
- Bi-gram embeddings (gigaword_chn.all.a2b.bi.ite50.vec): Baidu Pan
- Word(Lattice) embeddings (ctb.50d.vec): Baidu Pan
- If you want to use a larger word embedding, you can refer to Chinese Word Vectors 中文词向量 and Tencent AI Lab Embedding
-
Modify the
utils/paths.py
to add the pretrained embedding and the dataset. -
Long sentence clipping for MSRA and Ontonotes, run the command:
python sentence_clip.py
- Merging char embeddings and word embeddings:
python char_word_mix.py
- Model training and evaluation
- Weibo dataset
python main.py --dataset weibo
- Resume dataset
python main.py --dataset resume
- Ontonotes dataset
python main.py --dataset ontonotes
- MSRA dataset
python main.py --dataset msra
- Thanks to Dr. Li and his team for contributing the FLAT source code.
- Thanks to the author team and contributors of TENER source code.
- Thanks to the author team and contributors of FastNLP.