Incorrect token id for `<|image|>` token #219

vancoykendall · 2024-11-14T02:27:09Z

In this repo the Llama3 tokenizer sets the <|image|> special token to 128011

llama-models/models/llama3/api/tokenizer.py

Lines 79 to 101 in ec6b563

    
           special_tokens = [ 
        
               "<|begin_of_text|>", 
        
               "<|end_of_text|>", 
        
               "<|reserved_special_token_0|>", 
        
               "<|reserved_special_token_1|>", 
        
               "<|finetune_right_pad_id|>", 
        
               "<|step_id|>", 
        
               "<|start_header_id|>", 
        
               "<|end_header_id|>", 
        
               "<|eom_id|>",  # end of message 
        
               "<|eot_id|>",  # end of turn 
        
               "<|python_tag|>", 
        
               "<|image|>", 
        
           ] 
        
           reserved_tokens = [ 
        
               f"<|reserved_special_token_{2 + i}|>" 
        
               for i in range(self.num_reserved_special_tokens - len(special_tokens)) 
        
           ] 
        
           special_tokens = special_tokens + reserved_tokens 
        
           self.special_tokens = { 
        
               token: num_base_tokens + i for i, token in enumerate(special_tokens) 
        
           }

However, in the tokenizer_config.json uploaded to the huggingface repo meta-llama/Llama-3.2-11B-Vision-Instruct, the <|image|> token is mapped to 128256.

I also checked the norms of the model's embedding layer for tokens 128011 and 128256. 128011 has a norm near zero, while 128256 token has a regular norm. This makes me think 128256 is the correct token embedding for the <|image|> token.

The text was updated successfully, but these errors were encountered:

vancoykendall · 2024-11-14T03:04:07Z

Also if you download the weights from meta

llama model download --source meta --model-id Llama3.2-11B-Vision-Instruct

There's no embeddings for 128256 and up so it looks like its just missing the image token embedding all together.

I also checked the norms of the embedding tokens from the meta checkpoint here to confirm 128011 is an untrained embedding vector:

ashwinb · 2024-11-14T19:02:47Z

@vancoykendall you have identified a most terrible wart in the llama3 vision model.

See https://github.com/meta-llama/llama-models/blob/main/models/llama3/api/chat_format.py#L226

Essentially the <|image|> token does correspond to 128011 in the tokenizer ... however, the special token that actually got trained is the last token 128056. As to why that happened is a very esoteric reason in our training process.

cc @abhimanyudubey

vancoykendall · 2024-11-14T21:14:26Z

@ashwinb Gotcha, but shouldn't the downloaded checkpoint from meta contain the token embedding for the <|image|> token? The HF checkpoint has it as the 128256 embedding idx, but from what I can tell the meta download checkpoint just doesn't have it

vancoykendall mentioned this issue Nov 14, 2024

Llama3 tokenizer missing token ID 128011 pytorch/torchtune#1995

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect token id for `<|image|>` token #219

Incorrect token id for `<|image|>` token #219

vancoykendall commented Nov 14, 2024 •

edited

Loading

vancoykendall commented Nov 14, 2024 •

edited

Loading

ashwinb commented Nov 14, 2024 •

edited

Loading

vancoykendall commented Nov 14, 2024

Incorrect token id for <|image|> token #219

Incorrect token id for <|image|> token #219

Comments

vancoykendall commented Nov 14, 2024 • edited Loading

vancoykendall commented Nov 14, 2024 • edited Loading

ashwinb commented Nov 14, 2024 • edited Loading

vancoykendall commented Nov 14, 2024

Incorrect token id for `<|image|>` token #219

Incorrect token id for `<|image|>` token #219

vancoykendall commented Nov 14, 2024 •

edited

Loading

vancoykendall commented Nov 14, 2024 •

edited

Loading

ashwinb commented Nov 14, 2024 •

edited

Loading