Tokenization:
- Text comes as a sequence of bytes.
- Convert to a sequence of tokens: like words.
- Splitting by spaces. It is not enough.
Edge cases:
- There is a long tail of edge cases.
- Punctuation: doesn't maps to doesn, apostrophe, t?
- Abbreviations: Mr. Gates to Mr, dot and Gates?
- What if there are no spaces? E.g. hashtags.
- There are no spaces in chinese.
Gazetteer lookup:
- A dictionary is a big one.
- Can be difficult to split (da, huo, ji means diffent things, but dahuoji means lighter).
- Solution...
Greedy methods:
- Find the longest match first.
- Problems: out of vocabulary words (OOV):
- Paracetamol can be poisonous.
- New words will surely arise.
- Missalignment: now that vs. nowt hat.
- Can do somehow good with regex (in the 60s), but exposes problems.