Skip to content

Releases: ThuraAung1601/myTokenize

myTokenize-v0.1.1

22 Nov 14:49
5a6e96b
Compare
Choose a tag to compare

Features

  • Syllable Tokenization: Break text into syllables using regex rules.
  • BPE and Unigram Tokenization: Leverage SentencePiece models for tokenization.
  • Word Tokenization: Segment text into words using:
    • myWord: Dictionary-based tokenization.
    • CRF: Conditional Random Fields-based tokenization.
    • BiLSTM: Neural network-based tokenization.
  • Phrase Tokenization: Identify phrases in text using normalized pointwise mutual information (NPMI).
  • Sentence Tokenization: Use a BiLSTM model to segment text into sentences.