NameTag 3 is an open-source tool for both flat and nested named entity recognition (NER). NameTag 3 identifies proper names in text and classifies them into a set of predefined categories, such as names of persons, locations, organizations, etc.
NameTag 3 offers state-of-the-art or near state-of-the-art performance in English, German, Spanish, Dutch, Czech and Ukrainian.
NameTag 3 is a free software under Mozilla Public License 2.0, and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions. NameTag is versioned using Semantic Versioning.
Copyright 2024 Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Czech Republic.
NameTag 3 can be used either as a commandline tool or by requesting the NameTag webservice:
- LINDAT/CLARIN hosts the NameTag Web Application,
- LINDAT/CLARIN also hosts the NameTag REST Web Service.
NameTag 3 source code can be found at GitHub.
The NameTag website contains download links of both the released packages and trained models, hosts documentation and refers to demo and online web service.
Copyright 2024 Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Czech Republic.
NameTag 3 is a free software under Mozilla Public License 2.0 license and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions. NameTag is versioned using Semantic Versioning.
If you use this software, please give us credit by referencing Straková et al. (2019):
@inproceedings{strakova-etal-2019-neural,
title = "Neural Architectures for Nested {NER} through Linearization",
author = "Strakov{\'a}, Jana and
Straka, Milan and
Hajic, Jan",
editor = "Korhonen, Anna and
Traum, David and
M{\`a}rquez, Llu{\'\i}s",
booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2019",
address = "Florence, Italy",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/P19-1527",
doi = "10.18653/v1/P19-1527",
pages = "5326--5331",
}
Compared to NameTag 2, NameTag 3 is a fine-tuned large language model (LLM) with either a classification head for flat NEs (e.g., the CoNLL-2003 English data) or with seq2seq decoding head for nested NEs (e.g., the CNEC 2.0 Czech data). The seq2seq decoding head is the head proposed by Straková et al. (2019).
The software has been developed and tested on Linux and is run from the commandline.
For basic use without installation, see a simple script nametag3_with_curl.sh
for accessing NameTag 3 webservice from the command line using curl. The script
will call a server. Do not send personal or private data unless you are
authorized and comfortable with it being processed by NameTag 3.
Usage:
- Get the
nametag3_with_curl.sh
script either by cloning the entire NameTag 3 repository:
git clone https://github.com/ufal/nametag3
or by simply downloading just the script specifically from the NameTag 3 repository by opening
https://github.com/ufal/nametag3/blob/main/nametag3_with_curl.sh
and hitting the download button ("Download raw file").
- Save your text in a plaintext file, see an example in
examples/cs_input.txt
. At the command line, type the following command:
./nametag3_with_curl.sh examples/cs_input.txt
- The output will be printed to the standard output. To redirect the output into a file, you can type:
./nametag3_with_curl.sh examples/cs_input.txt > output_file.xml
- Additionally, you can specify the language of your data. The options are
english
,german
,dutch
,spanish
,ukraininan
, andczech
(lowercased):
./nametag3_with_curl.sh examples/en_input.txt english > output_file.xml
The nametag3_client.py
only requires basic Python and does not need any additional
installed packages or downloading the trained models. By default, the script
will call the NameTag 3 server. Do not send personal or private data unless you
are authorized and comfortable with it being processed by NameTag 3.
Usage:
- Get this script either by cloning the entire NameTag 3 repository:
git clone https://github.com/ufal/nametag3
or by simply downloading just nametag3_client.py
specifically from the NameTag
3 repository by opening
https://github.com/ufal/nametag3/blob/main/nametag3_client.py
and hitting the download button ("Download raw file").
- Save your text in a plaintext file, see an example in
examples/cs_input.txt
. At the command line, type the following command:
./nametag3_client.py examples/cs_input.txt
- The output will be printed to the standard output. To redirect the output into a file, you can type:
./nametag3_client.py examples/cs_input.txt > output_file.xml
Or you can specify the output filename:
./nametag3_client.py examples/cs_input.txt --outfile=output_file.xml
- Additionally, you can specify the language of your data or the exact required
model for your data. The language options are
english
,german
,dutch
,spanish
,ukraininan
, andczech
(lowercased):
./nametag3_client.py examples/en_input.txt --model=english > output_file.xml
The list of available models can be obtained by:
./nametag3_client.py --list_models
E.g.:
./nametag3_client.py examples/cs_input.txt --model=nametag3-czech-cnec2.0-240830
For other available input and output formats, as well as other options, see the script command-line arguments.
- Clone the repository:
git clone https://github.com/ufal/nametag3
- Create a Python virtual environment with torch called
venv
in the root of this directory:
python3 -m venv venv
venv/bin/pip3 install -r requirements.txt
- Download the NameTag 3 Models:
Download the latest version of NameTag 3 models.
- The
nametag3.py
script is then called using the Python installed in your virtual environment:
venv/bin/python3 ./nametag3.py [--argument=value]
The main NameTag 3 script is called nametag3.py
. Example NER prediction usage:
venv/bin/python3 nametag3.py \
--load_checkpoint=models/nametag3-multilingual-conll-240830/ \
--test_data=examples/en_input.conll
The input data file format is a vertical file, one token and its label(s) per
line: labels separated by a |
, columns separated by a tabulator; sentences
delimited by newlines (such as the first and the fourth column in the well-known
CoNLL-2003 shared task). A line containing -DOCSTART-
with the label O
, as
seen in the CoNLL-2003 shared task data, can be used to mark document
boundaries. Input examples can be found in nametag3.py
and in examples
.
The main NameTag 3 script nametag3.py
can be used for training a custom
corpus. It will do so when provided the parameters --train_data
. Optionally,
--dev_data
and training hyperparameters can be provided.
The input data file format is a vertical file, one token and its label(s) per
line: labels separated by a |
, columns separated by a tabulator; sentences
delimited by newlines (such as the first and the fourth column in the well-known
CoNLL-2003 shared task). A line containing -DOCSTART-
with the label O
, as
seen in the CoNLL-2003 shared task data, can be used to mark document
boundaries. Input examples can be found in nametag3.py
and in examples
.
Example usage of multilingual traning for flat NER with a softmax classification head:
venv/bin/python3 nametag3.py \
--batch_size=8 \
--context_type="split_document" \
--corpus="english-CoNLL2003-conll,german-CoNLL2003-conll,spanish-CoNLL2002-conll,dutch-CoNLL2002-conll,czech-cnec2.0-conll,ukrainian-languk-conll" \
--decoding="classification" \
--dev_data=data/english-CoNLL2003-conll/dev.conll,data/german-CoNLL2003-conll/dev.conll,data/spanish-CoNLL2002-conll/dev.conll,data/dutch-CoNLL2002-conll/dev.conll,data/czech-cnec2.0-conll/dev.conll,data/ukrainian-languk-conll/dev.conll \
--dropout=0.5 \
--epochs=20 \
--evaluate_test_data \
--hf_plm="xlm-roberta-large" \
--learning_rate=2e-5 \
--logdir="logs/" \
--name="multilingual" \
--sampling="temperature" \
--save_best_checkpoint \
--test_data=data/english-CoNLL2003-conll/test.conll,data/german-CoNLL2003-conll/test.conll,data/spanish-CoNLL2002-conll/test.conll,data/dutch-CoNLL2002-conll/test.conll,data/czech-cnec2.0-conll/test.conll,data/ukrainian-languk-conll/test.conll \
--threads=4 \
--train_data=data/english-CoNLL2003-conll/train.conll,data/german-CoNLL2003-conll/train.conll,data/spanish-CoNLL2002-conll/train.conll,data/dutch-CoNLL2002-conll/train.conll,data/czech-cnec2.0-conll/train.conll,data/ukrainian-languk-conll/train.conll \
--warmup_epochs=1
See nametag3_server.py
.
The mandatory arguments are given in this order:
- port
- default model name
- each following triple of arguments defines a model, of which
- first argument is the model name
- second argument is the model directory
- third argument are the acknowledgements to append
A single instance of a trained model physically stored on a disc can be listed
under several variants, just like in the following example, in which one model
(models/nametag3-multilingual-conll-240830/
) is served as
a nametag3-multilingual-conll-240830
model and also as
a nametag3-english-CoNLL2003-conll-240830
model. The first model is also known as
multilingual-conll
, and the second one which is also named eng
and en
:
venv/bin/python3 nametag3_server.py 8001 multilingual-conll \
nametag3-multilingual-conll-240830:multilingual-conll models/nametag3-multilingual-conll-240830/ multilingual_acknowledgements \
nametag3-english-CoNLL2003-conll-240830:eng:en models/nametag3-multilingual-conll-240830/ english_acknowledgements \
Example server usage with three monolingual models:
venv/bin/python3 nametag3_server.py 8001 cs \
czech-cnec2.0-240830:cs:ces models/nametag3-czech-cnec2.0-240830/ czech-cnec2_acknowledgements \
english-CoNLL2003-conll-240830:en:eng models/nametag3-english-CoNLL2003-conll-240830/ english-CoNLL2003-conll_acknowledgements \
spanish-CoNLL2002-conll-240830:es:spa models/nametag3-spanish-CoNLL2002-conll-240830/ spanish-CoNLL2002-conll_acknowledgements