Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect token id for <|image|> token #219

Open
vancoykendall opened this issue Nov 14, 2024 · 3 comments
Open

Incorrect token id for <|image|> token #219

vancoykendall opened this issue Nov 14, 2024 · 3 comments

Comments

@vancoykendall
Copy link

vancoykendall commented Nov 14, 2024

In this repo the Llama3 tokenizer sets the <|image|> special token to 128011

special_tokens = [
"<|begin_of_text|>",
"<|end_of_text|>",
"<|reserved_special_token_0|>",
"<|reserved_special_token_1|>",
"<|finetune_right_pad_id|>",
"<|step_id|>",
"<|start_header_id|>",
"<|end_header_id|>",
"<|eom_id|>", # end of message
"<|eot_id|>", # end of turn
"<|python_tag|>",
"<|image|>",
]
reserved_tokens = [
f"<|reserved_special_token_{2 + i}|>"
for i in range(self.num_reserved_special_tokens - len(special_tokens))
]
special_tokens = special_tokens + reserved_tokens
self.special_tokens = {
token: num_base_tokens + i for i, token in enumerate(special_tokens)
}

However, in the tokenizer_config.json uploaded to the huggingface repo meta-llama/Llama-3.2-11B-Vision-Instruct, the <|image|> token is mapped to 128256.

image
image

I also checked the norms of the model's embedding layer for tokens 128011 and 128256. 128011 has a norm near zero, while 128256 token has a regular norm. This makes me think 128256 is the correct token embedding for the <|image|> token.
Screenshot 2024-11-13 at 6 24 30 PM

@vancoykendall
Copy link
Author

vancoykendall commented Nov 14, 2024

Also if you download the weights from meta

llama model download --source meta --model-id Llama3.2-11B-Vision-Instruct

There's no embeddings for 128256 and up so it looks like its just missing the image token embedding all together.

I also checked the norms of the embedding tokens from the meta checkpoint here to confirm 128011 is an untrained embedding vector:
image

@ashwinb
Copy link
Contributor

ashwinb commented Nov 14, 2024

@vancoykendall you have identified a most terrible wart in the llama3 vision model.

See https://github.com/meta-llama/llama-models/blob/main/models/llama3/api/chat_format.py#L226

Essentially the <|image|> token does correspond to 128011 in the tokenizer ... however, the special token that actually got trained is the last token 128056. As to why that happened is a very esoteric reason in our training process.

cc @abhimanyudubey

@vancoykendall
Copy link
Author

@ashwinb Gotcha, but shouldn't the downloaded checkpoint from meta contain the token embedding for the <|image|> token? The HF checkpoint has it as the 128256 embedding idx, but from what I can tell the meta download checkpoint just doesn't have it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants