Skip to content

Commit

Permalink
Config hidden layer number to run in 1 lazy graph (#451)
Browse files Browse the repository at this point in the history
FILL IN THE PR DESCRIPTION HERE
Some models is hardcoded with running each hidden layer in computation
graph for lazy mode when TP =1 . For some use case that is limited by
TPOT, we can't run higher batch, we want to increase hidden layer to
have more efficient computation.
Use VLLM_CONFIG_HIDDEN_LAYER to config the layers to run. Default to 1.
  • Loading branch information
libinta authored Nov 14, 2024
1 parent c27899a commit 0548200
Show file tree
Hide file tree
Showing 4 changed files with 14 additions and 3 deletions.
3 changes: 2 additions & 1 deletion README_GAUDI.md
Original file line number Diff line number Diff line change
Expand Up @@ -277,7 +277,8 @@ INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of devi
- block size min (`VLLM_DECODE_BLOCK_BUCKET_MIN`): `block_size`
- block size step (`VLLM_DECODE_BLOCK_BUCKET_STEP`): `block_size`
- block size max (`VLLM_DECODE_BLOCK_BUCKET_MAX`): `max(128, (max_num_seqs*max_model_len)/block_size)`
- ``VLLM_HANDLE_TOPK_DUPLICATES``: if ``true``, will handle duplicates that are outside of top-k, ``false`` by default
- `VLLM_HANDLE_TOPK_DUPLICATES`: if ``true``, will handle duplicates that are outside of top-k, ``false`` by default
- `VLLM_CONFIG_HIDDEN_LAYERS`: configure how many hidden layers to run in a HPUGraph for model splitting among hidden layers when TP is 1. The default is 1. It helps with throughput improvement under inter-token latency limitation for some models.

Additionally, there are HPU PyTorch Bridge environment variables impacting vLLM execution:

Expand Down
1 change: 1 addition & 0 deletions docs/source/getting_started/gaudi-installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -379,6 +379,7 @@ Environment variables
- sequence length step (``VLLM_DECODE_BLOCK_BUCKET_STEP``): ``block_size``
- sequence length max (``VLLM_DECODE_BLOCK_BUCKET_MAX``): ``max(128, (max_num_seqs*max_model_len)/block_size)``
- ``VLLM_HANDLE_TOPK_DUPLICATES``: if ``true``, will handle duplicates that are outside of top-k, ``false`` by default
- ``VLLM_CONFIG_HIDDEN_LAYERS``: configure how many hidden layers to run in a HPUGraph for model splitting among hidden layers when TP is 1. The default is 1. It helps with throughput improvement under inter-token latency limitation for some models.

Additionally, there are HPU PyTorch Bridge environment variables impacting vLLM execution:

Expand Down
6 changes: 5 additions & 1 deletion vllm/model_executor/models/gpt_bigcode.py
Original file line number Diff line number Diff line change
Expand Up @@ -222,6 +222,10 @@ def __init__(
self.make_empty_intermediate_tensors = (
make_empty_intermediate_tensors_factory(["hidden_states"],
config.n_embd))
if is_hpu:
import os
self.config_hidden_layers = int(
os.getenv('VLLM_CONFIG_HIDDEN_LAYERS', '1'))

def forward(
self,
Expand All @@ -246,7 +250,7 @@ def forward(
hidden_states = layer(hidden_states,
kv_caches[i - self.start_layer],
attn_metadata)
if is_hpu:
if is_hpu and i % self.config_hidden_layers == 0:
htorch.core.mark_step()
if not get_pp_group().is_last_rank:
return IntermediateTensors({"hidden_states": hidden_states})
Expand Down
7 changes: 6 additions & 1 deletion vllm/model_executor/models/llama.py
Original file line number Diff line number Diff line change
Expand Up @@ -316,6 +316,11 @@ def __init__(
make_empty_intermediate_tensors_factory(
["hidden_states", "residual"], config.hidden_size))

if is_hpu:
import os
self.config_hidden_layers = int(
os.getenv('VLLM_CONFIG_HIDDEN_LAYERS', '1'))

def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor:
return self.embed_tokens(input_ids)

Expand Down Expand Up @@ -347,7 +352,7 @@ def forward(
hidden_states, residual = layer(positions, hidden_states,
kv_caches[i - self.start_layer],
attn_metadata, residual)
if is_hpu:
if is_hpu and i % self.config_hidden_layers == 0:
htorch.core.mark_step()
if not get_pp_group().is_last_rank:
return IntermediateTensors({
Expand Down

0 comments on commit 0548200

Please sign in to comment.