Skip to content

tokenizers

github-actions[bot] edited this page Jun 15, 2024 · 1 revision

Tokenizers

1. Character Tokenizer

See librespeech config

This splits the text into characters and then maps each character to an index. The index starts from 1 and 0 is reserved for blank token. This tokenizer only used for languages that have a small number of characters and each character is not a combination of other characters. For example, English, Vietnamese, etc.

2. Wordpiece Tokenizer

See librespeech config for wordpiece splitted by whitespace

See librespeech config for wordpiece that whitespace is a separate token

This splits the text into words and then splits each word into subwords. The subwords are then mapped to indices. Blank token can be set to as index 0. This tokenizer is used for languages that have a large number of words and each word can be a combination of other words, therefore it can be applied to any language.

3. Sentencepiece Tokenizer

See librespeech config

This splits the whole sentence into subwords and then maps each subword to an index. Blank token can be set to as index 0. This tokenizer is used for languages that have a large number of words and each word can be a combination of other words, therefore it can be applied to any language.

Clone this wiki locally