Skip to content

Latest commit

 

History

History
324 lines (229 loc) · 11.7 KB

README.md

File metadata and controls

324 lines (229 loc) · 11.7 KB

NameTag 3

NameTag 3 is an open-source tool for both flat and nested named entity recognition (NER). NameTag 3 identifies proper names in text and classifies them into a set of predefined categories, such as names of persons, locations, organizations, etc.

NameTag 3 offers state-of-the-art or near state-of-the-art performance in English, German, Spanish, Dutch, Czech and Ukrainian.

NameTag 3 is a free software under Mozilla Public License 2.0, and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions. NameTag is versioned using Semantic Versioning.

Copyright 2024 Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Czech Republic.

Current Release

NameTag 3 can be used either as a commandline tool or by requesting the NameTag webservice:

NameTag 3 source code can be found at GitHub.

The NameTag website contains download links of both the released packages and trained models, hosts documentation and refers to demo and online web service.

License

Copyright 2024 Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Czech Republic.

NameTag 3 is a free software under Mozilla Public License 2.0 license and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions. NameTag is versioned using Semantic Versioning.

Please Cite as (How to Cite)

If you use this software, please give us credit by referencing Straková et al. (2019):

@inproceedings{strakova-etal-2019-neural,
    title = "Neural Architectures for Nested {NER} through Linearization",
    author = "Strakov{\'a}, Jana  and
      Straka, Milan  and
      Hajic, Jan",
    editor = "Korhonen, Anna  and
      Traum, David  and
      M{\`a}rquez, Llu{\'\i}s",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/P19-1527",
    doi = "10.18653/v1/P19-1527",
    pages = "5326--5331",
}

Versions

Compared to NameTag 2, NameTag 3 is a fine-tuned large language model (LLM) with either a classification head for flat NEs (e.g., the CoNLL-2003 English data) or with seq2seq decoding head for nested NEs (e.g., the CNEC 2.0 Czech data). The seq2seq decoding head is the head proposed by Straková et al. (2019).

Requirements

The software has been developed and tested on Linux and is run from the commandline.

NameTag 3 without Installation with curl

For basic use without installation, see a simple script nametag3_with_curl.sh for accessing NameTag 3 webservice from the command line using curl. The script will call a server. Do not send personal or private data unless you are authorized and comfortable with it being processed by NameTag 3.

Usage:

  1. Get the nametag3_with_curl.sh script either by cloning the entire NameTag 3 repository:
git clone https://github.com/ufal/nametag3

or by simply downloading just the script specifically from the NameTag 3 repository by opening

https://github.com/ufal/nametag3/blob/main/nametag3_with_curl.sh

and hitting the download button ("Download raw file").

  1. Save your text in a plaintext file, see an example in examples/cs_input.txt. At the command line, type the following command:
./nametag3_with_curl.sh examples/cs_input.txt
  1. The output will be printed to the standard output. To redirect the output into a file, you can type:
./nametag3_with_curl.sh examples/cs_input.txt > output_file.xml
  1. Additionally, you can specify the language of your data. The options are english, german, dutch, spanish, ukraininan, and czech (lowercased):
./nametag3_with_curl.sh examples/en_input.txt english > output_file.xml

NameTag 3 Client without Installation with basic Python

The nametag3_client.py only requires basic Python and does not need any additional installed packages or downloading the trained models. By default, the script will call the NameTag 3 server. Do not send personal or private data unless you are authorized and comfortable with it being processed by NameTag 3.

Usage:

  1. Get this script either by cloning the entire NameTag 3 repository:
git clone https://github.com/ufal/nametag3

or by simply downloading just nametag3_client.py specifically from the NameTag 3 repository by opening

https://github.com/ufal/nametag3/blob/main/nametag3_client.py

and hitting the download button ("Download raw file").

  1. Save your text in a plaintext file, see an example in examples/cs_input.txt. At the command line, type the following command:
./nametag3_client.py examples/cs_input.txt
  1. The output will be printed to the standard output. To redirect the output into a file, you can type:
./nametag3_client.py examples/cs_input.txt > output_file.xml

Or you can specify the output filename:

./nametag3_client.py examples/cs_input.txt --outfile=output_file.xml
  1. Additionally, you can specify the language of your data or the exact required model for your data. The language options are english, german, dutch, spanish, ukraininan, and czech (lowercased):
./nametag3_client.py examples/en_input.txt --model=english > output_file.xml

The list of available models can be obtained by:

./nametag3_client.py --list_models

E.g.:

./nametag3_client.py examples/cs_input.txt --model=nametag3-czech-cnec2.0-240830

For other available input and output formats, as well as other options, see the script command-line arguments.

Installation

  1. Clone the repository:
git clone https://github.com/ufal/nametag3
  1. Create a Python virtual environment with torch called venv in the root of this directory:
python3 -m venv venv
venv/bin/pip3 install -r requirements.txt
  1. Download the NameTag 3 Models:

Download the latest version of NameTag 3 models.

  1. The nametag3.py script is then called using the Python installed in your virtual environment:
venv/bin/python3 ./nametag3.py [--argument=value]

Running NER Prediction with NameTag 3

The main NameTag 3 script is called nametag3.py. Example NER prediction usage:

venv/bin/python3 nametag3.py \
  --load_checkpoint=models/nametag3-multilingual-conll-240830/ \
  --test_data=examples/en_input.conll

The input data file format is a vertical file, one token and its label(s) per line: labels separated by a |, columns separated by a tabulator; sentences delimited by newlines (such as the first and the fourth column in the well-known CoNLL-2003 shared task). A line containing -DOCSTART- with the label O, as seen in the CoNLL-2003 shared task data, can be used to mark document boundaries. Input examples can be found in nametag3.py and in examples.

Training NameTag 3

The main NameTag 3 script nametag3.py can be used for training a custom corpus. It will do so when provided the parameters --train_data. Optionally, --dev_data and training hyperparameters can be provided.

The input data file format is a vertical file, one token and its label(s) per line: labels separated by a |, columns separated by a tabulator; sentences delimited by newlines (such as the first and the fourth column in the well-known CoNLL-2003 shared task). A line containing -DOCSTART- with the label O, as seen in the CoNLL-2003 shared task data, can be used to mark document boundaries. Input examples can be found in nametag3.py and in examples.

Example usage of multilingual traning for flat NER with a softmax classification head:

venv/bin/python3 nametag3.py \
  --batch_size=8 \
  --context_type="split_document" \
  --corpus="english-CoNLL2003-conll,german-CoNLL2003-conll,spanish-CoNLL2002-conll,dutch-CoNLL2002-conll,czech-cnec2.0-conll,ukrainian-languk-conll" \
  --decoding="classification" \
  --dev_data=data/english-CoNLL2003-conll/dev.conll,data/german-CoNLL2003-conll/dev.conll,data/spanish-CoNLL2002-conll/dev.conll,data/dutch-CoNLL2002-conll/dev.conll,data/czech-cnec2.0-conll/dev.conll,data/ukrainian-languk-conll/dev.conll \
  --dropout=0.5 \
  --epochs=20 \
  --evaluate_test_data \
  --hf_plm="xlm-roberta-large" \
  --learning_rate=2e-5 \
  --logdir="logs/" \
  --name="multilingual" \
  --sampling="temperature" \
  --save_best_checkpoint \
  --test_data=data/english-CoNLL2003-conll/test.conll,data/german-CoNLL2003-conll/test.conll,data/spanish-CoNLL2002-conll/test.conll,data/dutch-CoNLL2002-conll/test.conll,data/czech-cnec2.0-conll/test.conll,data/ukrainian-languk-conll/test.conll \
  --threads=4 \
  --train_data=data/english-CoNLL2003-conll/train.conll,data/german-CoNLL2003-conll/train.conll,data/spanish-CoNLL2002-conll/train.conll,data/dutch-CoNLL2002-conll/train.conll,data/czech-cnec2.0-conll/train.conll,data/ukrainian-languk-conll/train.conll \
  --warmup_epochs=1

NameTag 3 Server

See nametag3_server.py.

The mandatory arguments are given in this order:

  • port
  • default model name
  • each following triple of arguments defines a model, of which
    • first argument is the model name
    • second argument is the model directory
    • third argument are the acknowledgements to append

A single instance of a trained model physically stored on a disc can be listed under several variants, just like in the following example, in which one model (models/nametag3-multilingual-conll-240830/) is served as a nametag3-multilingual-conll-240830 model and also as a nametag3-english-CoNLL2003-conll-240830 model. The first model is also known as multilingual-conll, and the second one which is also named eng and en:

venv/bin/python3 nametag3_server.py 8001 multilingual-conll \
  nametag3-multilingual-conll-240830:multilingual-conll models/nametag3-multilingual-conll-240830/ multilingual_acknowledgements \
  nametag3-english-CoNLL2003-conll-240830:eng:en models/nametag3-multilingual-conll-240830/ english_acknowledgements \

Example server usage with three monolingual models:

venv/bin/python3 nametag3_server.py 8001 cs \
    czech-cnec2.0-240830:cs:ces models/nametag3-czech-cnec2.0-240830/ czech-cnec2_acknowledgements \
    english-CoNLL2003-conll-240830:en:eng models/nametag3-english-CoNLL2003-conll-240830/ english-CoNLL2003-conll_acknowledgements \
    spanish-CoNLL2002-conll-240830:es:spa models/nametag3-spanish-CoNLL2002-conll-240830/ spanish-CoNLL2002-conll_acknowledgements

Authors and Contact: