[Usage]: Cannot use max_model_len greater than 8192 Tokens for llama 3.1 70B #510

ppatel-eng · 2024-11-16T04:27:05Z

Your current environment

Environment Details

Running in a Kubernetes environment with Habana Gaudi2 accelerators:

Hardware: Habana Gaudi2 accelerators
Deployment: Kubernetes cluster
Node Resources:
- CPU: 160 cores
- Memory: 734GB
- Gaudi2 Accelerators: 8 per node
vLLM Version: 1.17
Python Version: 3.10

How would you like to use vllm

I want to run Meta-Llama-3-1-70B-Instruct model with a large context window (ideally up to 132k tokens but at least 50k tokens) on Habana Gaudi2 accelerators. Currently, I can only achieve a context length of 8192 tokens before encountering OOM issues.

Current Configuration

yaml
Meta-Llama-3-1-70B-Instruct:
extraParams:
--tensor-parallel-size 2
--max-model-len 16384
--gpu-memory-utilization 0.90
--enable-chunked-prefill True
gpuLimit: 2
numGPU: 2

Issue Description

Without max_model_len:
- Single HPU: OOM error
- Two HPUs: Warning about potential OOM during profiling and requests still fail
With max_model_len > 8192 but less than 132k:
- Model never completes warm-up phase or freezes immediately after
- Tried various combinations of memory optimization parameters without success
Working Configuration:
- Only works with max-model-len <= 8192 and 2 HPUs

Attempted Solutions

Increased tensor parallelism to 4 GPUs
Modified memory optimization parameters based on similar issues on vLLM and TGI repo:
- --gpu-memory-utilization
- --swap-space
- --block-size
Tried various combinations of CPU offloading and swap space

Questions

Is there a recommended configuration for running 70B models with large context windows on Gaudi2?
Are there specific memory optimization parameters we should adjust for large context windows on Gaudi2?

Additional Context

No resource quotas or limit ranges are configured in the namespace
Node has sufficient free resources (CPU, Memory, and Gaudi accelerators)

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

iboiko-habana · 2024-11-20T11:20:34Z

llama3.1 is supported from 1.18.0.

Please set next flags for OOM/functional issues avoiding in 1.18.0
VLLM_ENGINE_ITERATION_TIMEOUT_S=3600
VLLM_RPC_TIMEOUT=100000
VLLM_PROMPT_USE_FUSEDSDPA=1
PT_HPU_ENABLE_LAZY_COLLECTIVES=true

Other flags, depending on context length. 32K context length flags example:

decreasing of VLLM_GRAPH_RESERVED_MEM, depends on model and long context. VLLM_GRAPH_RESERVED_MEM=0.02 for llama3.1-8b. VLLM_GRAPH_RESERVED_MEM=0.1 for llama3.1-70b.
VLLM_PROMPT_BS_BUCKET_MIN=1 # proposal for usage. depends on model. Can be increased if no OOM
VLLM_PROMPT_BS_BUCKET_STEP=16 # proposal for usage. depends on model. Can be increased until no OOM or decreased if OOM
VLLM_PROMPT_BS_BUCKET_MAX=16 # proposal for usage. depends on model. Can be increased until no OOM or decreased if OOM
VLLM_PROMPT_SEQ_BUCKET_MIN=24576 # proposal for usage. depends on warmup results
VLLM_PROMPT_SEQ_BUCKET_STEP=2048 # proposal for usage. depends on warmup results
VLLM_PROMPT_SEQ_BUCKET_MAX=32768 # context length 32K, 16384 for 16K
VLLM_DECODE_BLOCK_BUCKET_MIN=1024 # proposal for usage. depends on warmup results
VLLM_DECODE_BLOCK_BUCKET_STEP=1024 # proposal for usage. depends on warmup results
VLLM_DECODE_BLOCK_BUCKET_MAX=max_num_seqs * max_decode_seq // self.block_size # i.e. 128*(32 * 1024)/128 or 32*(32*1024)/128

michalkuligowski · 2024-12-02T13:43:12Z

@ppatel-eng Hi, did you try flags described by @iboiko-habana, did they fix the OOM issue?

ppatel-eng · 2024-12-16T21:18:22Z

Hi @michalkuligowski, we were able to resolve the OOM issue by upgrading to 1.18 and updating the bucket configuration. Thanks!

michalkuligowski closed this as completed Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: Cannot use max_model_len greater than 8192 Tokens for llama 3.1 70B #510

[Usage]: Cannot use max_model_len greater than 8192 Tokens for llama 3.1 70B #510

ppatel-eng commented Nov 16, 2024 •

edited

Loading

iboiko-habana commented Nov 20, 2024 •

edited

Loading

michalkuligowski commented Dec 2, 2024

ppatel-eng commented Dec 16, 2024

[Usage]: Cannot use max_model_len greater than 8192 Tokens for llama 3.1 70B #510

[Usage]: Cannot use max_model_len greater than 8192 Tokens for llama 3.1 70B #510

Comments

ppatel-eng commented Nov 16, 2024 • edited Loading

Your current environment

Environment Details

How would you like to use vllm

Current Configuration

Issue Description

Attempted Solutions

Questions

Additional Context

Before submitting a new issue...

iboiko-habana commented Nov 20, 2024 • edited Loading

michalkuligowski commented Dec 2, 2024

ppatel-eng commented Dec 16, 2024

ppatel-eng commented Nov 16, 2024 •

edited

Loading

iboiko-habana commented Nov 20, 2024 •

edited

Loading