[Usage]: tensor-parallel-size=2 second token latency is higher than tensor_parallel_size=1 #204

Zjq9409 · 2024-08-26T12:06:04Z

Your current environment

vllm                              0.5.3.post1+gaudi117

tensor_parallel_size=1 script

export PT_HPU_ENABLE_LAZY_COLLECTIVES=true
export VLLM_GRAPH_RESERVED_MEM=0.1
export VLLM_GRAPH_PROMPT_RATIO=0.8
export VLLM_DECODE_BS_BUCKET_MIN=1
export VLLM_DECODE_BLOCK_BUCKET_STEP=64
export VLLM_DECODE_BLOCK_BUCKET_MIN=64
export VLLM_PROMPT_SEQ_BUCKET_MAX=1024
python -m vllm.entrypoints.openai.api_server \
  --model Qwen2-7B-Instruct/ \
  --trust-remote-code \
  --tensor-parallel-size 1 \
  --block-size 128 \
  --dtype bfloat16 \
  --max-num-seqs 128 \
  --max-model-len 2048 \
  --num-lookahead-slots 1 \
  --use-v2-block-manager \
  --gpu-memory-utilization 0.9 \
  --host 0.0.0.0 \
  --port 8111

tensor_parallel_size=2 script

export PT_HPU_ENABLE_LAZY_COLLECTIVES=true
export VLLM_GRAPH_RESERVED_MEM=0.1
export VLLM_GRAPH_PROMPT_RATIO=0.8
export VLLM_DECODE_BS_BUCKET_MIN=1
export VLLM_DECODE_BLOCK_BUCKET_STEP=64
export VLLM_DECODE_BLOCK_BUCKET_MIN=64
export VLLM_PROMPT_SEQ_BUCKET_MAX=1024
python -m vllm.entrypoints.openai.api_server \
  --model /home/jane/huggingface.model.references/Qwen2-7B-Instruct/ \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --block-size 128 \
  --dtype bfloat16 \
  --max-num-seqs 128 \
  --max-model-len 2048 \
  --num-lookahead-slots 1 \
  --use-v2-block-manager \
  --gpu-memory-utilization 0.9 \
  --host 0.0.0.0 \
  --port 8111

test_server script

for bsize in 1 ; do
  echo "benchmark serving bs${bsize}"
  python benchmark_serving.py \
    --backend vllm \
    --model  Qwen2-7B-Instruct/ \
    --trust-remote-code \
    --dataset-name sonnet \
    --dataset-path sonnet.txt \
    --sonnet-input-len 1024 \
    --sonnet-output-len 512 \
    --num-prompts ${bsize} \
    --request-rate inf \
    --port 8111
done

tensor_parallel_size=1 summary：

============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  6.36      
Total input tokens:                      1014      
Total generated tokens:                  512       
Request throughput (req/s):              0.16      
Input token throughput (tok/s):          159.33    
Output token throughput (tok/s):         80.45     
---------------Time to First Token----------------
Mean TTFT (ms):                          121.46    
Median TTFT (ms):                        121.46    
P99 TTFT (ms):                           121.46    
-----Time per Output Token (excl. 1st token)------
**Mean TPOT (ms):                          12.22     
Median TPOT (ms):                        12.22     
P99 TPOT (ms):                           12.22**     
---------------Inter-token Latency----------------
Mean ITL (ms):                           12.43     
Median ITL (ms):                         12.17     
P99 ITL (ms):                            13.33     
==================================================

tensor_parallel_size=2 summary

============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  48.13     
Total input tokens:                      1014      
Total generated tokens:                  512       
Request throughput (req/s):              0.02      
Input token throughput (tok/s):          21.07     
Output token throughput (tok/s):         10.64     
---------------Time to First Token----------------
Mean TTFT (ms):                          84.02     
Median TTFT (ms):                        84.02     
P99 TTFT (ms):                           84.02     
-----Time per Output Token (excl. 1st token)------
**Mean TPOT (ms):                          94.03     
Median TPOT (ms):                        94.03     
P99 TPOT (ms):                           94.03**     
---------------Inter-token Latency----------------
Mean ITL (ms):                           94.01     
Median ITL (ms):                         94.92     
P99 ITL (ms):                            114.84    
==================================================

Why tensor-parallel-size=2 second token latency is higher than tensor_parallel_size=1?

============ Serving Benchmark Result ============
Successful requests:                     28        
Benchmark duration (s):                  53.54     
Total input tokens:                      28482     
Total generated tokens:                  14166     
Request throughput (req/s):              0.52      
Input token throughput (tok/s):          532.01    
Output token throughput (tok/s):         264.60    
---------------Time to First Token----------------
Mean TTFT (ms):                          1389.00   
Median TTFT (ms):                        1415.89   
P99 TTFT (ms):                           2354.09   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          101.90    
Median TPOT (ms):                        101.85    
P99 TPOT (ms):                           104.01    
---------------Inter-token Latency----------------
Mean ITL (ms):                           104.47    
Median ITL (ms):                         96.14     
P99 ITL (ms):                            125.47    
==================================================

I'm not sure why P99 TTFT is so much higher than Median TTFT?

The text was updated successfully, but these errors were encountered:

github-actions · 2024-11-28T02:06:44Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

kzawora-intel added the intel Issues or PRs submitted by Intel label Aug 29, 2024

github-actions bot added the stale label Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: tensor-parallel-size=2 second token latency is higher than tensor_parallel_size=1 #204

[Usage]: tensor-parallel-size=2 second token latency is higher than tensor_parallel_size=1 #204

Zjq9409 commented Aug 26, 2024 •

edited

Loading

github-actions bot commented Nov 28, 2024

[Usage]: tensor-parallel-size=2 second token latency is higher than tensor_parallel_size=1 #204

[Usage]: tensor-parallel-size=2 second token latency is higher than tensor_parallel_size=1 #204

Comments

Zjq9409 commented Aug 26, 2024 • edited Loading

Your current environment

github-actions bot commented Nov 28, 2024

Zjq9409 commented Aug 26, 2024 •

edited

Loading