title | description | prev | next | type | id |
---|---|---|---|---|---|
Chapter 3: Processing Pipelines |
This chapter will show you everything you need to know about spaCy's processing pipeline. You'll learn what goes on under the hood when you process a text, how to write your own components and add them to the pipeline, and how to use custom attributes to add your own metadata to the documents, spans and tokens. |
/chapter2 |
/chapter4 |
chapter |
3 |
What does spaCy do when you call nlp
on a string of text?
doc = nlp("This is a sentence.")
The tokenizer is always run before all other pipeline components, because it
transforms a string of text into a Doc
object. The pipeline also doesn't have
to consist of the tagger, parser and entity recognizer.
The tokenizer turns a string of text into a Doc
object. spaCy then applies
every component in the pipeline on document, in order.
spaCy computes everything on the machine and doesn't need to connect to any server.
When you call spacy.load()
to load a pipeline, spaCy will initialize the
language, add the pipeline and load in the binary model weights. When you call
the nlp
object on a text, the pipeline is already loaded.
Let's inspect the small English pipeline!
- Load the
en_core_web_sm
pipeline and create thenlp
object. - Print the names of the pipeline components using
nlp.pipe_names
. - Print the full pipeline of
(name, component)
tuples usingnlp.pipeline
.
The list of component names is available as the nlp.pipe_names
attribute. The
full pipeline consisting of (name, component)
tuples is available as
nlp.pipeline
.
Which of these problems can be solved by custom pipeline components? Choose all that apply!
- Updating the trained pipelines and improving their predictions
- Computing your own values based on tokens and their attributes
- Adding named entities, for example based on a dictionary
- Implementing support for an additional language
Custom components can only modify the Doc
and can't be used to update weights
of other components directly.
Custom components can only modify the Doc
and can't be used to update weights
of other components directly.
Custom components can only modify the Doc
and can't be used to update weights
of other components directly. They're also added to the pipeline after the
language class is already initialized and after tokenization, so they're not
suitable to add new languages.
Custom components are great for adding custom values to documents, tokens and
spans, and customizing the doc.ents
.
Custom components are added to the pipeline after the language class is already initialized and after tokenization, so they're not suitable to add new languages.
Custom components are added to the pipeline after the language class is already initialized and after tokenization, so they're not suitable to add new languages.
The example shows a custom component that prints the number of tokens in a document. Can you complete it?
- Complete the component function with the
doc
's length. - Add the
"length_component"
to the existing pipeline as the first component. - Try out the new pipeline and process any text with the
nlp
object β for example "This is a sentence.".
- To get the length of a
Doc
object, you can call Python's built-inlen()
method on it. - Use the
nlp.add_pipe
method to add the component to the pipeline. Remember to use the string name of the component and to set thefirst
keyword argument toTrue
to make sure it's added before all other components. - To process a text, call the
nlp
object on it.
In this exercise, you'll be writing a custom component that uses the
PhraseMatcher
to find animal names in the document and adds the matched spans
to the doc.ents
. A PhraseMatcher
with the animal patterns has already been
created as the variable matcher
.
- Define the custom component and apply the
matcher
to thedoc
. - Create a
Span
for each match, assign the label ID for"ANIMAL"
and overwrite thedoc.ents
with the new spans. - Add the new component to the pipeline after the
"ner"
component. - Process the text and print the entity text and entity label for the entities
in
doc.ents
.
- Remember that the matches are a list of
(match_id, start, end)
tuples. - The
Span
class takes 4 arguments: the parentdoc
, the start index, the end index and the label. - To add a component after another, use the
after
keyword argument onnlp.add_pipe
. Remember to use the string name of the component to add it.
Let's practice setting some extension attributes.
- Use
Token.set_extension
to register"is_country"
(defaultFalse
). - Update it for
"Spain"
and print it for all tokens.
Remember that extension attributes are available via the ._
property. For
example, doc._.has_color
.
- Use
Token.set_extension
to register"reversed"
(getter functionget_reversed
). - Print its value for each token.
Remember that extension attributes are available via the ._
property. For
example, doc._.has_color
.
Let's try setting some more complex attributes using getters and method extensions.
- Complete the
get_has_number
function . - Use
Doc.set_extension
to register"has_number"
(getterget_has_number
) and print its value.
- Remember that extension attributes are available via the
._
property. For example,doc._.has_color
. - The
get_has_number
function should return whether any of the tokens in thedoc
returnTrue
fortoken.like_num
(whether the token resembles a number).
- Use
Span.set_extension
to register"to_html"
(methodto_html
). - Call it on
doc[0:2]
with the tag"strong"
.
- Method extensions can take one or more arguments. For example:
doc._.some_method("argument")
. - The first argument passed to the method is always the
Doc
,Token
orSpan
object the method was called on.
In this exercise, you'll combine custom extension attributes with the statistical predictions and create an attribute getter that returns a Wikipedia search URL if the span is a person, organization, or location.
- Complete the
get_wikipedia_url
getter so it only returns the URL if the span's label is in the list of labels. - Set the
Span
extension"wikipedia_url"
using the getterget_wikipedia_url
. - Iterate over the entities in the
doc
and output their Wikipedia URL.
- To get the string label of a span, use the
span.label_
attribute. This is the label predicted by the entity recognizer if the span is an entity span. - Remember that extension attributes are available via the
._
property. For example,doc._.has_color
.
Extension attributes are especially powerful if they're combined with custom pipeline components. In this exercise, you'll write a pipeline component that finds country names and a custom extension attribute that returns a country's capital, if available.
A phrase matcher with all countries is available as the variable matcher
. A
dictionary of countries mapped to their capital cities is available as the
variable CAPITALS
.
- Complete the
countries_component_function
and create aSpan
with the label"GPE"
(geopolitical entity) for all matches. - Add the component to the pipeline.
- Register the Span extension attribute
"capital"
with the getterget_capital
. - Process the text and print the entity text, entity label and entity capital
for each entity span in
doc.ents
.
- The
Span
class takes four arguments: thedoc
, thestart
andend
token index of the span and thelabel
. - Calling the
PhraseMatcher
on adoc
returns a list of(match_id, start, end)
tuples. - To register a new extension attribute, use the
set_extension
method on the global class, e.g.Doc
,Token
orSpan
. To define a getter, use thegetter
keyword argument. - Remember that extension attributes are available via the
._.
property. For example,doc._.has_color
.
In this exercise, you'll be using nlp.pipe
for more efficient text processing.
The nlp
object has already been created for you. A list of tweets about a
popular American fast food chain are available as the variable TEXTS
.
- Rewrite the example to use
nlp.pipe
. Instead of iterating over the texts and processing them, iterate over thedoc
objects yielded bynlp.pipe
.
- Using
nlp.pipe
lets you merge the first two lines of code into one. nlp.pipe
takes theTEXTS
and yieldsdoc
objects that you can loop over.
- Rewrite the example to use
nlp.pipe
. Don't forget to calllist()
around the result to turn it into a list.
- Rewrite the example to use
nlp.pipe
. Don't forget to calllist()
around the result to turn it into a list.
In this exercise, you'll be using custom attributes to add author and book meta information to quotes.
A list of [text, context]
examples is available as the variable DATA
. The
texts are quotes from famous books, and the contexts dictionaries with the keys
"author"
and "book"
.
- Use the
set_extension
method to register the custom attributes"author"
and"book"
on theDoc
, which default toNone
. - Process the
[text, context]
pairs inDATA
usingnlp.pipe
withas_tuples=True
. - Overwrite the
doc._.book
anddoc._.author
with the respective info passed in as the context.
- The
Doc.set_extension
method takes two arguments: the string name of the attribute, and a keyword argument indicating the default, getter, setter or method. For example,default=True
. - If
as_tuples
is set toTrue
, thenlp.pipe
method takes a list of(text, context)
tuples and yields(doc, context)
tuples.
In this exercise, you'll use the nlp.make_doc
and nlp.select_pipes
methods
to only run selected components when processing a text.
- Rewrite the code to only tokenize the text using
nlp.make_doc
.
The nlp.make_doc
method can be called on a text and returns a Doc
, just like
the nlp
object.
- Disable the tagger and lemmatizer using the
nlp.select_pipes
method. - Process the text and print all entities in the
doc
.
The nlp.select_pipes
method accepts the keyword arguments enable
or
disable
that take a list of component names to enable or disable. For example,
nlp.select_pipes(disable="ner")
will disable the named entity recognizer.