-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Volta] [No flash attention] Llama 3.1 8B Instruct failed to start - "< not supported between instances of 'NoneType' and 'int'" #2440
Comments
Doesn't seem to be case with flash attention-enabled ADA generation GPU, thus seems to be specific to lack of flash attention. |
For anyone wondering about this, this is due to the fact that pad_token is not present in Llama's tokenizer_config.json. Something as simple as adding "pad_token": "<|eot_id|>" to the end of the json works. For some reason (code branching?) this doesn't bother FA enabled GPUs/is fixed within that branch, but bother those that need to disable FA. |
I see the same issue running
I believe these XPU containers should have attention, but it's different attention implementation vs. CUDA. So, it's still might be branch difference why XPU steps into that. With XPU this also is a regression because I see this working fine with
|
Further, running again on Intel GPU, but with stock PyTorch this time. This variant definitely does not have attention (setup differs from docker xpu runs). There is behavior change coming after this PR in Transformers: Without above commit original issue can be reproduced:
While after the commit I pointed to, behavior changes and TGI fails earlier on model initialization:
Failure in the second case is here: Printing also values which we get I have that
I.e. we have a list of tokens instead of a single string value. |
#2702 (has been merged) means to fix following issue you should use image ghcr.io/huggingface/text-generation-inference:latest-intel-xpu to avoid this issue. meta-llama/Llama-3.2-3B-Instruct should work in latest tgi xpu image |
@sywangyi : thank you for pointing this out. I missed this warning. Indeed |
Basically, here is a simplified script to reproduce the issue. That's what TGI is doing around
Script:
Output:
Unfortunately, my knowledge of Transformers is not enough to say what's wrong and where. |
Hm. I like this place in the attention path (which works): text-generation-inference/server/text_generation_server/models/flash_causal_lm.py Lines 1261 to 1263 in ab7ccf5
This was introduced by the following PR: @Narsil : do you recall details on the |
I have filed issue/question on Transformers side: |
Cool, thanks for a reproducer. I will check it out and will be commenting under the transformers issue |
I didn't post follow-up, but if you disable flash attention on newer NVIDIA generations through the TGI env variable USE_FLASH_ATTENTION=False, you are able to reproduce it there as well. |
LLama 3 has a list of values as eos_token_id: "['<|end_of_text|>', '<|eom_id|>', '<|eot_id|>']" This breaks tokenizer since it expects single value. This commit uses tokenizer.eos_token_id instead in such a case. Fixes: huggingface#2440 Signed-off-by: Dmitry Rogozhkin <[email protected]>
@zucchini-nlp : thank you for feedback. I've posted #2774 with the proposed fix.
@zucchini-nlp : indeed. After #2774 this case on NVidia start to work as well. |
System Info
Hi everyone, when trying to update from Llama 3 8B Instruct to Llama 3.1 8B Instruct, I noticed a crash:
Deployment mode: Docker compose
Container settings:
OS: Ubuntu 22.04.4 LTS
Rust version: N/A
Container version: sha256:b49037cef8d0c61ec022d4d7c5baad22357e34bce7970148a457a11f8f8d7e36
Model being used: meta-llama/Meta-Llama-3.1-8B-Instruct
GPUs: 2x Volta V100 - hence disabled Flash attention
Information
Tasks
Reproduction
Expected behavior
Llama 3.1 8B Instruct should work
The text was updated successfully, but these errors were encountered: