Skip to content

Latest commit

 

History

History
37 lines (31 loc) · 1.01 KB

README.md

File metadata and controls

37 lines (31 loc) · 1.01 KB

BytePairEncoding-for-NLP

A simple python BPE implementation

  • vocab is sorted by token lengh and frequency
  • py_wiki.vocab is a pretrained tokenizer, it was trained on 1k wikipedia articles and 4k python scripts. It has a vocab size of 10_000
  • hard_stop=False splits a sequnce over the max_len into batches to preserve the original text
  • with hard_stop=False, max_len=512, it encodes ~100 wikipedia articles per second

Train

text = """
Byte pair encoding[1][2] or digram coding[3] is a simple
form of data compression in which the most common pair of consecutive
bytes of data is replaced with a byte that does not occur within that data."""

from BPE import BPEtokenizer
bpe = BPEtokenizer()
bpe.train(text,1000,75,'test.vocab')

Load

bpe = bpe.load('test.vocab')

Encode and Decode

token_list = bpe.encode('Hello World this is the BPE tokenizer',4,False,True,True)
print(token_list)

for token in token_list:
   print(bpe.decode(token))

Print Vocab

print(bpe.tokens)