Config hidden layer number to run in 1 lazy graph (#451)

FILL IN THE PR DESCRIPTION HERE Some models is hardcoded with running each hidden layer in computation graph for lazy mode when TP =1 . For some use case that is limited by TPOT, we can't run higher batch, we want to increase hidden layer to have more efficient computation. Use VLLM_CONFIG_HIDDEN_LAYER to config the layers to run. Default to 1.
HabanaAI · Nov 14, 2024 · 0548200 · 0548200
1 parent c27899a
commit 0548200
Show file tree

Hide file tree

Showing 4 changed files with 14 additions and 3 deletions.
diff --git a/README_GAUDI.md b/README_GAUDI.md
@@ -277,7 +277,8 @@ INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of devi
       - block size min (`VLLM_DECODE_BLOCK_BUCKET_MIN`): `block_size`
       - block size step (`VLLM_DECODE_BLOCK_BUCKET_STEP`): `block_size`
       - block size max (`VLLM_DECODE_BLOCK_BUCKET_MAX`): `max(128, (max_num_seqs*max_model_len)/block_size)`
--  ``VLLM_HANDLE_TOPK_DUPLICATES``: if ``true``, will handle duplicates that are outside of top-k, ``false`` by default
+-  `VLLM_HANDLE_TOPK_DUPLICATES`: if ``true``, will handle duplicates that are outside of top-k, ``false`` by default
+-  `VLLM_CONFIG_HIDDEN_LAYERS`: configure how many hidden layers to run in a HPUGraph for model splitting among hidden layers when TP is 1. The default is 1. It helps with throughput improvement under inter-token latency limitation for some models.
 
 Additionally, there are HPU PyTorch Bridge environment variables impacting vLLM execution:
 

diff --git a/docs/source/getting_started/gaudi-installation.rst b/docs/source/getting_started/gaudi-installation.rst
@@ -379,6 +379,7 @@ Environment variables
          - sequence length step (``VLLM_DECODE_BLOCK_BUCKET_STEP``): ``block_size``
          - sequence length max (``VLLM_DECODE_BLOCK_BUCKET_MAX``): ``max(128, (max_num_seqs*max_model_len)/block_size)``
 -  ``VLLM_HANDLE_TOPK_DUPLICATES``: if ``true``, will handle duplicates that are outside of top-k, ``false`` by default
+-  ``VLLM_CONFIG_HIDDEN_LAYERS``: configure how many hidden layers to run in a HPUGraph for model splitting among hidden layers when TP is 1. The default is 1. It helps with throughput improvement under inter-token latency limitation for some models.
 
 Additionally, there are HPU PyTorch Bridge environment variables impacting vLLM execution:  
 

diff --git a/vllm/model_executor/models/gpt_bigcode.py b/vllm/model_executor/models/gpt_bigcode.py
@@ -222,6 +222,10 @@ def __init__(
         self.make_empty_intermediate_tensors = (
             make_empty_intermediate_tensors_factory(["hidden_states"],
                                                     config.n_embd))
+        if is_hpu:
+            import os
+            self.config_hidden_layers = int(
+                os.getenv('VLLM_CONFIG_HIDDEN_LAYERS', '1'))
 
     def forward(
         self,
@@ -246,7 +250,7 @@ def forward(
             hidden_states = layer(hidden_states,
                                   kv_caches[i - self.start_layer],
                                   attn_metadata)
-            if is_hpu:
+            if is_hpu and i % self.config_hidden_layers == 0:
                 htorch.core.mark_step()
         if not get_pp_group().is_last_rank:
             return IntermediateTensors({"hidden_states": hidden_states})

diff --git a/vllm/model_executor/models/llama.py b/vllm/model_executor/models/llama.py
@@ -316,6 +316,11 @@ def __init__(
             make_empty_intermediate_tensors_factory(
                 ["hidden_states", "residual"], config.hidden_size))
 
+        if is_hpu:
+            import os
+            self.config_hidden_layers = int(
+                os.getenv('VLLM_CONFIG_HIDDEN_LAYERS', '1'))
+
     def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor:
         return self.embed_tokens(input_ids)
 
@@ -347,7 +352,7 @@ def forward(
             hidden_states, residual = layer(positions, hidden_states,
                                             kv_caches[i - self.start_layer],
                                             attn_metadata, residual)
-            if is_hpu:
+            if is_hpu and i % self.config_hidden_layers == 0:
                 htorch.core.mark_step()
         if not get_pp_group().is_last_rank:
             return IntermediateTensors({