Skip to content

Latest commit

 

History

History
249 lines (167 loc) · 6.9 KB

TUTORIAL_1_BASICS.md

File metadata and controls

249 lines (167 loc) · 6.9 KB

Tutorial 1: NLP Base Types

This is part 1 of the tutorial, in which we look into some of the base types used in this library.

Creating a Sentence

There are two types of objects that are central to this library, namely the Sentence and Token objects. A Sentence holds a textual sentence and is essentially a list of Token.

Let's start by making a Sentence object for an example sentence.

# The sentence objects holds a sentence that we may want to embed or tag
from flair.data import Sentence

# Make a sentence object by passing a whitespace tokenized string
sentence = Sentence('The grass is green .')

# Print the object to see what's in there
print(sentence)

This should print:

Sentence: "The grass is green ."   [− Tokens: 5]

The print-out tells us that the sentence consists of 5 tokens. You can access the tokens of a sentence via their token id or with their index:

# using the token id
print(sentence.get_token(4))
# using the index itself
print(sentence[3])

which should print in both cases

Token: 4 green

This print-out includes the token id (4) and the lexical value of the token ("green"). You can also iterate over all tokens in a sentence.

for token in sentence:
    print(token)

This should print:

Token: 1 The
Token: 2 grass
Token: 3 is
Token: 4 green
Token: 5 .

Tokenization

In some use cases, you might not have your text already tokenized. For this case, we added a simple tokenizer using the lightweight segtok library.

If you want to use this tokenizer, simply set the use_tokenizer flag when instantiating your Sentence with an untokenized string:

from flair.data import Sentence

# Make a sentence object by passing an untokenized string and the 'use_tokenizer' flag
sentence = Sentence('The grass is green.', use_tokenizer=True)

# Print the object to see what's in there
print(sentence)

Adding Custom Tokenizers

You can also pass custom tokenizers to the initialization method. Instead of passing a boolean True value to the use_tokenizer parameter, you can pass a tokenization method, like this:

from flair.data import Sentence, segtok_tokenizer

# Make a sentence object by passing an untokenized string and a tokenizer
sentence = Sentence('The grass is green.', use_tokenizer=segtok_tokenizer)

# Print the object to see what's in there
print(sentence)

This should print:

Sentence: "The grass is green ." - 5 Tokens

The second way allows you to write your own wrapper around the tokenizer you want to use. The wrapper is a function which has the same signature as flair.data.segtok_tokenizer (take a string and return List[Token]). Check the code of flair.data.space_tokenizer (which is very simple) to have an idea of how to implement such wrapper.

Adding Labels

In Flair, any data point can be labeled. For instance, you can label a word or label a sentence:

Adding Labels to Tokens

A Token has fields for linguistic annotation, such as lemmas, part-of-speech tags or named entity tags. You can add a tag by specifying the tag type and the tag value. In this example, we're adding an NER tag of type 'color' to the word 'green'. This means that we've tagged this word as an entity of type color.

# add a tag to a word in the sentence
sentence[3].add_tag('ner', 'color')

# print the sentence with all tags of this type
print(sentence.to_tagged_string())

This should print:

The grass is green <color> .

Each tag is of class Label which next to the value has a score indicating confidence. Print like this:

# get token 3 in the sentence 
token = sentence[3]

# get the 'ner' tag of the token
tag = token.get_tag('ner')

# print token
print(f'"{token}" is tagged as "{tag.value}" with confidence score "{tag.score}"')

This should print:

"Token: 4 green" is tagged as "color" with confidence score "1.0"

Our color tag has a score of 1.0 since we manually added it. If a tag is predicted by our sequence labeler, the score value will indicate classifier confidence.

Adding Labels to Sentences

You can also add a Label to a whole Sentence. For instance, the example below shows how we add the label 'sports' to a sentence, thereby labeling it as belonging to the sports "topic".

sentence = Sentence('France is the current world cup winner.')

# add a label to a sentence
sentence.add_label('topic', 'sports')

print(sentence)

# Alternatively, you can also create a sentence with label in one line
sentence = Sentence('France is the current world cup winner.').add_label('topic', 'sports')

print(sentence)

This should print:

Sentence: "France is the current world cup winner."   [− Tokens: 7  − Sentence-Labels: {'topic': [sports (1.0)]}]

Indicating that this sentence belongs to the topic 'sports' with confidence 1.0.

Multiple Labels

Any data point can be labeled multiple times. A sentence for instance might belong to two topics. In this case, add two labels with the same label name:

sentence = Sentence('France is the current world cup winner.')

# this sentence has multiple topic labels
sentence.add_label('topic', 'sports')
sentence.add_label('topic', 'soccer')

You might want to add different layers of annotation for the same sentence. Next to topic you might also want to predict the "language" of a sentence. In this case, add a label with a different label name:

sentence = Sentence('France is the current world cup winner.')

# this sentence has multiple "topic" labels
sentence.add_label('topic', 'sports')
sentence.add_label('topic', 'soccer')

# this sentence has a "language" labels
sentence.add_label('language', 'English')

print(sentence)

This should print:

Sentence: "France is the current world cup winner."   [− Tokens: 7  − Sentence-Labels: {'topic': [sports (1.0), soccer (1.0)], 'language': [English (1.0)]}]

Indicating that this sentence has two "topic" labels and one "language" label.

Accessing a Sentence's Labels

You can access these labels like this:

for label in sentence.labels:
    print(label)

Remember that each label is a Label object, so you can also access the label's value and score fields directly:

print(sentence.to_plain_string())
for label in sentence.labels:
    print(f' - classified as "{label.value}" with score {label.score}')

This should print:

France is the current world cup winner.
 - classified as "sports" with score 1.0
 - classified as "soccer" with score 1.0

If you are interested only in the labels of one layer of annotation, you can access them like this:

for label in sentence.get_labels('topic'):
    print(label)

Giving you only the "topic" labels.

Next

Now, let us look at how to use pre-trained models to tag your text.