Bert tokens Tools

A useful tools to handle problems when you use Bert.

Installation

With Pip

This repository is tested on Python 3.6+, and can be installed using pip as follows:

pip install bert-tokens

Usage

Tokenization and token-span-convert

WordPiece tokenization for BERT, which can be universally applicable for different language versions for BERT. The supported BERT checkpoints including but not limited to:

Token-span-convert

Convert token span from char-level to wordpiece-level. This usually happens in multi-lingual scenarios.

For example, query="播放mylove"，the char-level index of sequence "mylove" is [2,8], while the token index after bert tokenization should be [2,4]

And convert token span from wordpiece-level to char-level, just as the reverse procedure of above.

Example

from bert_tokens.bert_tokenizer import Tokenizer
from bert_tokens.convert_word_span import convert_word_span, convert_char_span

dict_path = "vocab/vocab.txt"
tokenizer = Tokenizer(dict_path, do_lower_case=True)
tokens = tokenizer.tokenize("播放MYLOVE")
print(tokens)
## ['[CLS]', '播', '放', 'my', '##love', '[SEP]']
convert_word_span("播放MYLOVE", [2,8], tokenizer)
## [2, 4]
convert_char_span("播放MYLOVE", [2,4], tokenizer)
## [2, 8]

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
bert_tokens		bert_tokens
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bert tokens Tools

Installation

With Pip

Usage

Tokenization and token-span-convert

Token-span-convert

Example

About

Releases

Packages

Languages

License

StefenSal/Bert-Tokens-Tools

Folders and files

Latest commit

History

Repository files navigation

Bert tokens Tools

Installation

With Pip

Usage

Tokenization and token-span-convert

Token-span-convert

Example

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages