are our tokenizers initalized correctly? #99

kjappelbaum · 2024-08-15T22:39:03Z

perhaps not for batch inference

n0w0f · 2024-08-21T08:40:35Z

For the llama runs we do not use mattext tokenizers though.

n0w0f · 2024-08-21T08:47:24Z

Ah I see now.
There was this issue of Llama tokenizer not including pad token.
So we set tokenizer.pad_token = tokenizer.eos_token
ref.

We also tried adding a token, this then resized the vocab and creates a set of problems

kjappelbaum · 2024-09-17T11:20:15Z

This is not an issue for the serial interface that is in the code at the moment. For batched inference this might be important in the future

kjappelbaum changed the title ~~are our tokenized initalized correctly?~~ are our tokenizers initalized correctly? Aug 16, 2024

kjappelbaum added the low-priority label Sep 17, 2024

Provide feedback