Parts of Speech

Parts of Speech:

They are word classes or parts of speech (POS).
There are several basics: noun, vefrb, pronoun, preposition, adverb, conjunction, participle, article.
Open/Closed: no new words appear. Articles and prepositions.
Can be defined in syntactics or semantics.

Semantic function:

Nouns perform the semantic of identifying objects, adjectives qualify the obejcts, etc.

Distributional regularities:

Why useful?

Give information about word and neighbours
E.g. cóntent (noun) and contént (adjective).
Determine which morphological affixes can take.
Write regex, like Mr. PNOUN PNOUN.
Time, dates, etc.
Word disambituation (is bridge a noun or a verb)?
Helps predicting word class of next. E.g. possesive pronouns (my, your, his) followed by nouns; personal pronouns (I, you) by a verb.

Tagsets for English:

Why is it difficult?

Automatization:

Algorithms:

Rule based vs. stochastic/probabilistics
Hand-crafted vs. machine learning
Supervised learning will only learn the given given. Unsupersivsed can come up with several word classes.

Rule-based tagging:

Two stage architecture: use a dictionary to assign all possible tags; then apply rules to eliminate all but one.
Example: EngCG (1995, 1999).
Approx 56,000 words.
Assign points depending to the surrounding words, and deduce.
Doesn't work well with very old or very new text.

Hidden Markov Model Tagging:

Bayesian rule:

P(x | y) = P (y | x) P(x) / P(y)
In our case, t̂¹n = argmax of t¹n over P(w¹n | t¹n) P(t¹n), but this is too hard.

Hidden Markov Model Tagging:

Assume the probability of a word appearing is independent of the other words. E.g. passing from P(w¹n | t¹n) to the product of P(wi | ti) for each i.
The probability of a tag appearing is dependent only on the preceding tag (bigram). P(t¹n) is more or less the product of P(ti | ti-1).
The P(ti | ti-1) represents the probability of a tag given the preceding tag, e.g. P(NN | DT), P(JJ | DT)...
This can be done counting: count how many times we see the sequence ti-1 followed by ti and divide by the number of times ti-1 is seen.
E.g. P(NN|DT) is the proportion of occurrences of DT in which is followed by a NN.
The word likelihood probabilities P(wi | ti) represent of a word given a particular tag.
E.g. P(is | VBZ) = C(VBZ, is) / C(VBZ). That is, the proportion of occurrences of VBZ in which it is associated with the word is.

Representation:

The Viterbi algorithm:

Decoding task.
Dynamic programming.
Gives a path that maximizes the likelihood of the sequence being tagged of the graph which contains the PoS. Each PoS node has associated the probabilities of the words being of that category.

Provide feedback