Skip to content

jneto04/ner-pt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Assessing the Impact of Contextual Embeddings for Portuguese Named Entity Recognition

Modern approaches to Named Entity Recognition (NER) use neural networks (NN) to automatically extract features from text and seamlessly integrate them with sequence taggers in an end-to-end fashion. Word embeddings, which are a side product of pretrained neural language models (LMs), are key ingredients to boost the performance of NER systems. More recently, contextual word embeddings, which adapt according to the context where the word appears, have proved to be an invaluable resource to improve NER systems. In this work, we assess how different combinations of (shallow) word embeddings and contextual embeddings impact NER for the Portuguese Language. We show a comparative study of 16 different combinations of shallow and contextual embeddings and explore how textual diversity and the size of training corpora used in LMs impact our NER results. We evaluate NER performance using the HAREM corpus. Our best NER system outperforms the state-of-the-art in Portuguese NER by 5.99 in absolute percentage points. State-of-The-Art results evaluated by CoNLL-2002 Script.

Results for the Total Scenario (HAREM)

Approach Precision Recall F1
BiLSTM-CRF+FlairBBP 74.91% 74.37% 74.64%
BiLSTM-CRF (Castro, et al.) 72.28% 68.03% 70.33%
CharWNN (dos Santos, et al.) 67.16% 63.74% 65.41%

Results for the Selective Scenario (HAREM)

Approach Precision Recall F1
BiLSTM-CRF+FlairBBP 83.38% 81.17% 82.26%
BiLSTM-CRF (Castro, et al.) 78.26% 74.39% 76.27%
CharWNN (dos Santos, et al.) 73.98% 68.68% 65.41%

Reproduce our tests for NER

Before you begin, you should download the Flair library. Flair is a powerful NLP library with state-of-the-art results. Flair was developed by Zalando Research. You can see all details in this github link.

  • Paper: Contextual String Embeddings for Sequence Labeling (Akbik, et al.)

STEP 1: Download our language model FlairBBP (backward and forward);

STEP 2: Clone this repository;

STEP 3: Install Flair. See how to install here;

STEP 4: Download NILC's Word Embedding. You must download Word2Vec-Skip-Gram with 300 dimensions; Put the file inside the cloned folder;

STEP 5: Run our script python3.6 ner_flair.py

Tagging your portuguese text with our NER model

Tag your text using our best model for NER. The model is formed by FlairBBP + NILC-Word2Vec-Skpg-300d. It is possible to recognize the following categories: PERSON, LOCATION, ORGANIZATION, TIME and VALUE. You need install the last version of Flair.

STEP 1: Download our NER model Download Here!;

STEP 2: Use the pToolNER to labelling your text.

pToolNER = PortugueseToolNER()

pToolNER.loadNamedEntityModel('best-model.pt')

pToolNER.sequenceTaggingOnText(
               rootFolderPath='./PredictablesFiles',
               fileExtension='.txt',
               useTokenizer=True,
               maskNamedEntity=False,
               createOutputFile=True,
               outputFilePath='./TaggedTexts',
               outputFormat='plain',
               createOutputListSpans=True
               )

Alternative use (We strongly recommend you to use the pToolNER!):

STEP 1: Download our NER model Download Here!;

STEP 2: Clone this repository;

STEP 3: Run our script python3.6 tagging_ner.py [input_file_name.txt] [output_file_name.txt] [mode] modes:

  • conll - input text in conll formart
  • plain - input text in plain formart

Language Models

Flair Embeddings - FlairBBP

You can download our Flair Embeddings models (FlairBBP) in the following links:

Word Embeddings

You can download our Word Embedding models in the following links, note that all models were trained in 300 dimensions:

Algorithm Architecture Downloads
Word2Vec Skip-Gram Word2Vec_skpg_300d
Word2Vec CBOW Word2Vec_cbow_300d
FastText Skip-Gram Fasttext_skpg_300d
FastText CBOW Fasttext_cbow_300d

NILC Word Embeddings

You can download the Word Embeddings provided by NILC in the following link: http://nilc.icmc.usp.br/embeddings

  • Paper: Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks (Hartmann, et al.)

Language Models Corpora

BlogSet-BR

BlogSet-BR is a large corpus built from millions of sentences taken from Brazilian Portuguese web blogs.

brWaC

brWaC is another portuguese large corpus.

ptwiki-20190301

ptwiki-20190301 is a corpus formed by texts from wikipedia in Portuguese.

Language Model Corpora Size Details (after pre-processing):

Corpus Sentences Tokens
brWaC 127,272,109 2,930,573,938
BlogSet-BR 58,494,090 1,807,669,068
ptwiki-20190301 7,053,954 162,109,057
All Corpora 192,820,153 4,900,352,063

Citing our Paper

@inproceedings{santos2019assessing,
  author    = {Joaquim Santos and
               Bernardo Consoli and
               Cicero dos Santos and
               Juliano Terra and
               Sandra Collonini and
               Renata Vieira},
  title     = {Assessing the Impact of Contextual Embeddings for Portuguese Named Entity Recognition},
  booktitle = {Proceedings of the 8th Brazilian Conference on Intelligent Systems},
  pages     = {437--442},
  year      = {2019}
}

About

Portuguese Named Entity Recognition

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published