Skip to content

Latest commit

 

History

History
26 lines (22 loc) · 834 Bytes

02-Tokens.md

File metadata and controls

26 lines (22 loc) · 834 Bytes

Tokenization

Tokenization:

  • Text comes as a sequence of bytes.
  • Convert to a sequence of tokens: like words.
  • Splitting by spaces. It is not enough.

Edge cases:

  • There is a long tail of edge cases.
  • Punctuation: doesn't maps to doesn, apostrophe, t?
  • Abbreviations: Mr. Gates to Mr, dot and Gates?
  • What if there are no spaces? E.g. hashtags.
  • There are no spaces in chinese.

Gazetteer lookup:

  • A dictionary is a big one.
  • Can be difficult to split (da, huo, ji means diffent things, but dahuoji means lighter).
  • Solution...

Greedy methods:

  • Find the longest match first.
  • Problems: out of vocabulary words (OOV):
    • Paracetamol can be poisonous.
    • New words will surely arise.
    • Missalignment: now that vs. nowt hat.
  • Can do somehow good with regex (in the 60s), but exposes problems.