A useful tools to handle problems when you use Bert.
This repository is tested on Python 3.6+, and can be installed using pip as follows:
pip install bert-tokens
WordPiece tokenization for BERT, which can be universally applicable for different language versions for BERT. The supported BERT checkpoints including but not limited to:
Convert token span from char-level to wordpiece-level. This usually happens in multi-lingual scenarios.
For example, query="播放mylove",the char-level index of sequence "mylove" is [2,8], while the token index after bert tokenization should be [2,4]
And convert token span from wordpiece-level to char-level, just as the reverse procedure of above.
from bert_tokens.bert_tokenizer import Tokenizer
from bert_tokens.convert_word_span import convert_word_span, convert_char_span
dict_path = "vocab/vocab.txt"
tokenizer = Tokenizer(dict_path, do_lower_case=True)
tokens = tokenizer.tokenize("播放MYLOVE")
print(tokens)
## ['[CLS]', '播', '放', 'my', '##love', '[SEP]']
convert_word_span("播放MYLOVE", [2,8], tokenizer)
## [2, 4]
convert_char_span("播放MYLOVE", [2,4], tokenizer)
## [2, 8]