Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage]: Cannot use max_model_len greater than 8192 Tokens for llama 3.1 70B #510

Closed
1 task done
ppatel-eng opened this issue Nov 16, 2024 · 3 comments
Closed
1 task done

Comments

@ppatel-eng
Copy link

ppatel-eng commented Nov 16, 2024

Your current environment

Environment Details

Running in a Kubernetes environment with Habana Gaudi2 accelerators:

  • Hardware: Habana Gaudi2 accelerators

  • Deployment: Kubernetes cluster

  • Node Resources:

    • CPU: 160 cores
    • Memory: 734GB
    • Gaudi2 Accelerators: 8 per node
  • vLLM Version: 1.17

  • Python Version: 3.10

How would you like to use vllm

I want to run Meta-Llama-3-1-70B-Instruct model with a large context window (ideally up to 132k tokens but at least 50k tokens) on Habana Gaudi2 accelerators. Currently, I can only achieve a context length of 8192 tokens before encountering OOM issues.

Current Configuration

yaml
Meta-Llama-3-1-70B-Instruct:
extraParams:
--tensor-parallel-size 2
--max-model-len 16384
--gpu-memory-utilization 0.90
--enable-chunked-prefill True
gpuLimit: 2
numGPU: 2

Issue Description

  1. Without max_model_len:

    • Single HPU: OOM error
    • Two HPUs: Warning about potential OOM during profiling and requests still fail
  2. With max_model_len > 8192 but less than 132k:

    • Model never completes warm-up phase or freezes immediately after
    • Tried various combinations of memory optimization parameters without success
  3. Working Configuration:

    • Only works with max-model-len <= 8192 and 2 HPUs

Attempted Solutions

  1. Increased tensor parallelism to 4 GPUs
  2. Modified memory optimization parameters based on similar issues on vLLM and TGI repo:
    • --gpu-memory-utilization
    • --swap-space
    • --block-size
  3. Tried various combinations of CPU offloading and swap space

Questions

  1. Is there a recommended configuration for running 70B models with large context windows on Gaudi2?
  2. Are there specific memory optimization parameters we should adjust for large context windows on Gaudi2?

Additional Context

  • No resource quotas or limit ranges are configured in the namespace
  • Node has sufficient free resources (CPU, Memory, and Gaudi accelerators)

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@iboiko-habana
Copy link

iboiko-habana commented Nov 20, 2024

llama3.1 is supported from 1.18.0.

Please set next flags for OOM/functional issues avoiding in 1.18.0
VLLM_ENGINE_ITERATION_TIMEOUT_S=3600
VLLM_RPC_TIMEOUT=100000
VLLM_PROMPT_USE_FUSEDSDPA=1
PT_HPU_ENABLE_LAZY_COLLECTIVES=true

Other flags, depending on context length. 32K context length flags example:

  1. decreasing of VLLM_GRAPH_RESERVED_MEM, depends on model and long context. VLLM_GRAPH_RESERVED_MEM=0.02 for llama3.1-8b. VLLM_GRAPH_RESERVED_MEM=0.1 for llama3.1-70b.
  2. VLLM_PROMPT_BS_BUCKET_MIN=1 # proposal for usage. depends on model. Can be increased if no OOM
  3. VLLM_PROMPT_BS_BUCKET_STEP=16 # proposal for usage. depends on model. Can be increased until no OOM or decreased if OOM
  4. VLLM_PROMPT_BS_BUCKET_MAX=16 # proposal for usage. depends on model. Can be increased until no OOM or decreased if OOM
  5. VLLM_PROMPT_SEQ_BUCKET_MIN=24576 # proposal for usage. depends on warmup results
  6. VLLM_PROMPT_SEQ_BUCKET_STEP=2048 # proposal for usage. depends on warmup results
  7. VLLM_PROMPT_SEQ_BUCKET_MAX=32768 # context length 32K, 16384 for 16K
  8. VLLM_DECODE_BLOCK_BUCKET_MIN=1024 # proposal for usage. depends on warmup results
  9. VLLM_DECODE_BLOCK_BUCKET_STEP=1024 # proposal for usage. depends on warmup results
  10. VLLM_DECODE_BLOCK_BUCKET_MAX=max_num_seqs * max_decode_seq // self.block_size # i.e. 128*(32 * 1024)/128 or 32*(32*1024)/128

@michalkuligowski
Copy link

@ppatel-eng Hi, did you try flags described by @iboiko-habana, did they fix the OOM issue?

@ppatel-eng
Copy link
Author

Hi @michalkuligowski, we were able to resolve the OOM issue by upgrading to 1.18 and updating the bucket configuration. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants