-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Usage]: Cannot use max_model_len greater than 8192 Tokens for llama 3.1 70B #510
Comments
llama3.1 is supported from 1.18.0. Please set next flags for OOM/functional issues avoiding in 1.18.0 Other flags, depending on context length. 32K context length flags example:
|
@ppatel-eng Hi, did you try flags described by @iboiko-habana, did they fix the OOM issue? |
Hi @michalkuligowski, we were able to resolve the OOM issue by upgrading to 1.18 and updating the bucket configuration. Thanks! |
Your current environment
Environment Details
Running in a Kubernetes environment with Habana Gaudi2 accelerators:
Hardware: Habana Gaudi2 accelerators
Deployment: Kubernetes cluster
Node Resources:
vLLM Version: 1.17
Python Version: 3.10
How would you like to use vllm
I want to run Meta-Llama-3-1-70B-Instruct model with a large context window (ideally up to 132k tokens but at least 50k tokens) on Habana Gaudi2 accelerators. Currently, I can only achieve a context length of 8192 tokens before encountering OOM issues.
Current Configuration
yaml
Meta-Llama-3-1-70B-Instruct:
extraParams:
--tensor-parallel-size 2
--max-model-len 16384
--gpu-memory-utilization 0.90
--enable-chunked-prefill True
gpuLimit: 2
numGPU: 2
Issue Description
Without max_model_len:
With max_model_len > 8192 but less than 132k:
Working Configuration:
max-model-len <= 8192
and 2 HPUsAttempted Solutions
--gpu-memory-utilization
--swap-space
--block-size
Questions
Additional Context
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: