Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

are our tokenizers initalized correctly? #99

Open
kjappelbaum opened this issue Aug 15, 2024 · 3 comments
Open

are our tokenizers initalized correctly? #99

kjappelbaum opened this issue Aug 15, 2024 · 3 comments

Comments

@kjappelbaum
Copy link
Contributor

kjappelbaum commented Aug 15, 2024

perhaps not for batch inference

@kjappelbaum kjappelbaum changed the title are our tokenized initalized correctly? are our tokenizers initalized correctly? Aug 16, 2024
@n0w0f
Copy link
Collaborator

n0w0f commented Aug 21, 2024

For the llama runs we do not use mattext tokenizers though.

@n0w0f
Copy link
Collaborator

n0w0f commented Aug 21, 2024

Ah I see now.
There was this issue of Llama tokenizer not including pad token.
So we set tokenizer.pad_token = tokenizer.eos_token
ref.

We also tried adding a token, this then resized the vocab and creates a set of problems

@kjappelbaum
Copy link
Contributor Author

This is not an issue for the serial interface that is in the code at the moment. For batched inference this might be important in the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants