Skip to content

A useful tools to handle multi-lingual tokens when you use Bert.

License

Notifications You must be signed in to change notification settings

StefenSal/Bert-Tokens-Tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Bert tokens Tools

A useful tools to handle problems when you use Bert.

Installation

With Pip

This repository is tested on Python 3.6+, and can be installed using pip as follows:

pip install bert-tokens

Usage

Tokenization and token-span-convert

WordPiece tokenization for BERT, which can be universally applicable for different language versions for BERT. The supported BERT checkpoints including but not limited to:

Token-span-convert

Convert token span from char-level to wordpiece-level. This usually happens in multi-lingual scenarios.

For example, query="播放mylove",the char-level index of sequence "mylove" is [2,8], while the token index after bert tokenization should be [2,4]

And convert token span from wordpiece-level to char-level, just as the reverse procedure of above.

Example

from bert_tokens.bert_tokenizer import Tokenizer
from bert_tokens.convert_word_span import convert_word_span, convert_char_span

dict_path = "vocab/vocab.txt"
tokenizer = Tokenizer(dict_path, do_lower_case=True)
tokens = tokenizer.tokenize("播放MYLOVE")
print(tokens)
## ['[CLS]', '播', '放', 'my', '##love', '[SEP]']
convert_word_span("播放MYLOVE", [2,8], tokenizer)
## [2, 4]
convert_char_span("播放MYLOVE", [2,4], tokenizer)
## [2, 8]

About

A useful tools to handle multi-lingual tokens when you use Bert.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages