title | description | prev | next | type | id |
---|---|---|---|---|---|
Chapter 4: Training a neural network model |
In this chapter, you'll learn how to update spaCy's statistical models to customize them for your use case – for example, to predict a new entity type in online comments. You'll train your own model from scratch, and understand the basics of how training works, along with tips and tricks that can make your custom NLP projects more successful. |
/chapter3 |
chapter |
4 |
To train a model, you typically need training data and development data for evaluation. What is this evaluation data used for?
During training, the model will only be updated from the training data. The development data is used to evaluate the model by comparing its predictions on unseen examples to the correct annotations. This is then reflected in the accuracy score.
The development data is used to evaluate the model by comparing its predictions on unseen examples to the correct annotations. This is then reflected in the accuracy score.
The development data is used to evaluate the model by comparing its predictions on unseen examples to the correct annotations. This is then reflected in the accuracy score.
spaCy's rule-based Matcher
is a great way to quickly create training data for
named entity models. A list of sentences is available as the variable TEXTS
.
You can print it to inspect it. We want to find all mentions of different iPhone
models, so we can create training data to teach a model to recognize them as
"GADGET"
.
- Write a pattern for two tokens whose lowercase forms match
"iphone"
and"x"
. - Write a pattern for two tokens: one token whose lowercase form matches
"iphone"
and a digit.
- To match the lowercase form of a token, you can use the
"LOWER"
attribute. For example:{"LOWER": "apple"}
. - To find a digit token, you can use the
"IS_DIGIT"
flag. For example:{"IS_DIGIT": True}
.
After creating the data for our corpus, we need to save it out to a .spacy
file. The code from the previous example is already available.
- Instantiate the
DocBin
with the list ofdocs
. - Save the
DocBin
to a file calledtrain.spacy
.
- You can initialize the
DocBin
with a list of docs by passing them in as the keyword argumentdocs
. - The
DocBin
'sto_disk
method takes one argument: the path of the file to save the binary data to. Make sure to use the file extension.spacy
.
The config.cfg
file is the "single source of truth" for training a pipeline
with spaCy. Which of the following is not true about the config?
The config file includes all settings for the training process, including hyperparameters.
Because the config includes all settings and no hidden defaults, it can help make your training experiments more reproducible and others will be able to re-run your experiments with the exact same settings.
The config file includes all settings related to training and how to set up the
pipeline, but it doesn't package your pipeline. To create an installable Python
package, you can use the spacy package
command.
The [components]
block of the config file includes all pipeline components and
their settings, including the model implementations used.
The init config
command auto-generates
a config file for training with the default settings. We want to train a named
entity recognizer, so we'll generate a config file for one pipeline component,
ner
. Because we're executing the command in a Jupyter environment in this
course, we're using the prefix !
. If you're running the command in your local
terminal, you can leave this out.
- Use spaCy's
init config
command to auto-generate a config for an English pipeline. - Save the config to a file
config.cfg
. - Use the
--pipeline
argument to specify one pipeline component,ner
.
- The argument
--lang
defines the language class, e.g.en
for English.
Let's take a look at the config spaCy just generated! You can run the command below to print the config to the terminal and inspect it.
Let's use the config file generated in the previous exercise and the training corpus we've created to train a named entity recognizer!
The train
command lets you train a model
from a training config file. A file config_gadget.cfg
is already available in
the directory exercises/en
, as well as a file train_gadget.spacy
containing
the training examples, and a file dev_gadget.spacy
containing the evaluation
examples. Because we're executing the command in a Jupyter environment in this
course, we're using the prefix !
. If you're running the command in your local
terminal, you can leave this out.
- Call the
train
command with the fileexercises/en/config_gadget.cfg
. - Save the trained pipeline to a directory
output
. - Pass in the
exercises/en/train_gadget.spacy
andexercises/en/dev_gadget.spacy
paths.
- The first argument of the
spacy train
command is the path to the config file.
Let's see how the model performs on unseen data! To speed things up a little, we
already ran a trained pipeline for the label "GADGET"
over some text. Here are
some of the results:
Text | Entities |
---|---|
Apple is slowing down the iPhone 8 and iPhone X - how to stop it | (iPhone 8, iPhone X) |
I finally understand what the iPhone X 'notch' is for | (iPhone X,) |
Everything you need to know about the Samsung Galaxy S9 | (Samsung Galaxy,) |
Looking to compare iPad models? Here’s how the 2018 lineup stacks up | (iPad,) |
The iPhone 8 and iPhone 8 Plus are smartphones designed, developed, and marketed by Apple | (iPhone 8, iPhone 8) |
what is the cheapest ipad, especially ipad pro??? | (ipad, ipad) |
Samsung Galaxy is a series of mobile computing devices designed, manufactured and marketed by Samsung Electronics | (Samsung Galaxy,) |
Out of all the entities in the texts, how many did the model get correct? Keep in mind that incomplete entity spans count as mistakes, too! Tip: Count the number of entities that the model should have predicted. Then count the number of entities it actually predicted correctly and divide it by the number of total correct entities.
Try counting the number of correctly predicted entities and divide it by the number of total correct entities the model should have predicted.
Try counting the number of correctly predicted entities and divide it by the number of total correct entities the model should have predicted.
On our test data, the model achieved an accuracy of 70%.
Try counting the number of correctly predicted entities and divide it by the number of total correct entities the model should have predicted.
Here's an excerpt from a training set that labels the entity type
TOURIST_DESTINATION
in traveler reviews.
doc1 = nlp("i went to amsterdem last year and the canals were beautiful")
doc1.ents = [Span(doc1, 3, 4, label="TOURIST_DESTINATION")]
doc2 = nlp("You should visit Paris once, but the Eiffel Tower is kinda boring")
doc2.ents = [Span(doc2, 3, 4, label="TOURIST_DESTINATION")]
doc3 = nlp("There's also a Paris in Arkansas, lol")
doc3.ents = []
doc4 = nlp("Berlin is perfect for summer holiday: great nightlife and cheap beer!")
doc4.ents = [Span(doc4, 0, 1, label="TOURIST_DESTINATION")]
Why is this data and label scheme problematic?
A much better approach would be to only label "GPE"
(geopolitical entity) or
"LOCATION"
and then use a rule-based system to determine whether the entity is
a tourist destination in this context. For example, you could resolve the
entities types back to a knowledge base or look them up in a travel wiki.
While it's possible that Paris, AK is also a tourist attraction, this only highlights how subjective the label scheme is and how difficult it will be to decide whether the label applies or not. As a result, this distinction will also be very difficult to learn for the entity recognizer.
Even very uncommon words or misspellings can be labelled as entities. In fact, being able to predict categories in misspelled text based on the context is one of the big advantages of statistical named entity recognition.
- Rewrite the
doc.ents
to only use spans of the label"GPE"
(cities, states, countries) instead of"TOURIST_DESTINATION"
. - Don't forget to add spans for the
"GPE"
entities that weren't labeled in the old data.
- For the spans that are already labelled, you'll only need to change the label
name from
"TOURIST_DESTINATION"
to"GPE"
. - One text includes a city and a state that aren't labeled yet. To add the
entity spans, count the tokens to find out where the entity span starts and
where it ends. Keep in mind that the last token index is exclusive! Then add
a new
Span
to thedoc.ents
. - Keep an eye on the tokenization! Print the tokens in the
Doc
if you're not sure.
Here's a small sample of a dataset created to train a new entity type
"WEBSITE"
. The original dataset contains a few thousand sentences. In this
exercise, you'll be doing the labeling by hand. In real life, you probably want
to automate this and use an annotation tool – for example,
Brat, a popular open-source solution, or
Prodigy, our own annotation tool that integrates with spaCy.
- Complete the token offsets for the
"WEBSITE"
entities in the data.
- Keep in mind that the end token of a span is exclusive. So an entity that
starts at token 2 and ends at token 3 will have a start of
2
and an end of4
.
A model was trained with the data you just labelled, plus a few thousand similar
examples. After training, it's doing great on "WEBSITE"
, but doesn't recognize
"PERSON"
anymore. Why could this be happening?
It's definitely possible for a model to learn about very different categories. For example, spaCy's pre-trained English models can recognize persons, but also organizations or percentages.
If "PERSON"
entities occur in the training data but aren't labelled, the model
will learn that they shouldn't be predicted. Similarly, if an existing entity
type isn't present in the training data, the model may "forget" and stop
predicting it.
While the hyperparameters can influence a model's accuracy, they're likely not the problem here.
- Update the training data to include annotations for the
"PERSON"
entities "PewDiePie" and "Alexis Ohanian".
- To add more entities, add another
Span
to thedoc.ents
. - Keep in mind that the end token of a span is exclusive. So an entity that
starts at token 2 and ends at token 3 will have a start of
2
and an end of4
.