Releases: ThuraAung1601/myTokenize
Releases · ThuraAung1601/myTokenize
myTokenize-v0.1.1
Features
- Syllable Tokenization: Break text into syllables using regex rules.
- BPE and Unigram Tokenization: Leverage SentencePiece models for tokenization.
- Word Tokenization: Segment text into words using:
- myWord: Dictionary-based tokenization.
- CRF: Conditional Random Fields-based tokenization.
- BiLSTM: Neural network-based tokenization.
- Phrase Tokenization: Identify phrases in text using normalized pointwise mutual information (NPMI).
- Sentence Tokenization: Use a BiLSTM model to segment text into sentences.