Tokenization

Tokenization:

Text comes as a sequence of bytes.
Convert to a sequence of tokens: like words.
Splitting by spaces. It is not enough.

Edge cases:

There is a long tail of edge cases.
Punctuation: doesn't maps to doesn, apostrophe, t?
Abbreviations: Mr. Gates to Mr, dot and Gates?
What if there are no spaces? E.g. hashtags.
There are no spaces in chinese.

Gazetteer lookup:

A dictionary is a big one.
Can be difficult to split (da, huo, ji means diffent things, but dahuoji means lighter).
Solution...

Greedy methods:

Find the longest match first.
Problems: out of vocabulary words (OOV):
- Paracetamol can be poisonous.
- New words will surely arise.
- Missalignment: now that vs. nowt hat.
Can do somehow good with regex (in the 60s), but exposes problems.