Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage]: tensor-parallel-size=2 second token latency is higher than tensor_parallel_size=1 #204

Open
Zjq9409 opened this issue Aug 26, 2024 · 1 comment
Labels
intel Issues or PRs submitted by Intel stale

Comments

@Zjq9409
Copy link

Zjq9409 commented Aug 26, 2024

Your current environment

vllm                              0.5.3.post1+gaudi117

tensor_parallel_size=1 script

export PT_HPU_ENABLE_LAZY_COLLECTIVES=true
export VLLM_GRAPH_RESERVED_MEM=0.1
export VLLM_GRAPH_PROMPT_RATIO=0.8
export VLLM_DECODE_BS_BUCKET_MIN=1
export VLLM_DECODE_BLOCK_BUCKET_STEP=64
export VLLM_DECODE_BLOCK_BUCKET_MIN=64
export VLLM_PROMPT_SEQ_BUCKET_MAX=1024
python -m vllm.entrypoints.openai.api_server \
  --model Qwen2-7B-Instruct/ \
  --trust-remote-code \
  --tensor-parallel-size 1 \
  --block-size 128 \
  --dtype bfloat16 \
  --max-num-seqs 128 \
  --max-model-len 2048 \
  --num-lookahead-slots 1 \
  --use-v2-block-manager \
  --gpu-memory-utilization 0.9 \
  --host 0.0.0.0 \
  --port 8111

tensor_parallel_size=2 script

export PT_HPU_ENABLE_LAZY_COLLECTIVES=true
export VLLM_GRAPH_RESERVED_MEM=0.1
export VLLM_GRAPH_PROMPT_RATIO=0.8
export VLLM_DECODE_BS_BUCKET_MIN=1
export VLLM_DECODE_BLOCK_BUCKET_STEP=64
export VLLM_DECODE_BLOCK_BUCKET_MIN=64
export VLLM_PROMPT_SEQ_BUCKET_MAX=1024
python -m vllm.entrypoints.openai.api_server \
  --model /home/jane/huggingface.model.references/Qwen2-7B-Instruct/ \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --block-size 128 \
  --dtype bfloat16 \
  --max-num-seqs 128 \
  --max-model-len 2048 \
  --num-lookahead-slots 1 \
  --use-v2-block-manager \
  --gpu-memory-utilization 0.9 \
  --host 0.0.0.0 \
  --port 8111

test_server script

for bsize in 1 ; do
  echo "benchmark serving bs${bsize}"
  python benchmark_serving.py \
    --backend vllm \
    --model  Qwen2-7B-Instruct/ \
    --trust-remote-code \
    --dataset-name sonnet \
    --dataset-path sonnet.txt \
    --sonnet-input-len 1024 \
    --sonnet-output-len 512 \
    --num-prompts ${bsize} \
    --request-rate inf \
    --port 8111
done

tensor_parallel_size=1 summary:

============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  6.36      
Total input tokens:                      1014      
Total generated tokens:                  512       
Request throughput (req/s):              0.16      
Input token throughput (tok/s):          159.33    
Output token throughput (tok/s):         80.45     
---------------Time to First Token----------------
Mean TTFT (ms):                          121.46    
Median TTFT (ms):                        121.46    
P99 TTFT (ms):                           121.46    
-----Time per Output Token (excl. 1st token)------
**Mean TPOT (ms):                          12.22     
Median TPOT (ms):                        12.22     
P99 TPOT (ms):                           12.22**     
---------------Inter-token Latency----------------
Mean ITL (ms):                           12.43     
Median ITL (ms):                         12.17     
P99 ITL (ms):                            13.33     
==================================================

tensor_parallel_size=2 summary

============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  48.13     
Total input tokens:                      1014      
Total generated tokens:                  512       
Request throughput (req/s):              0.02      
Input token throughput (tok/s):          21.07     
Output token throughput (tok/s):         10.64     
---------------Time to First Token----------------
Mean TTFT (ms):                          84.02     
Median TTFT (ms):                        84.02     
P99 TTFT (ms):                           84.02     
-----Time per Output Token (excl. 1st token)------
**Mean TPOT (ms):                          94.03     
Median TPOT (ms):                        94.03     
P99 TPOT (ms):                           94.03**     
---------------Inter-token Latency----------------
Mean ITL (ms):                           94.01     
Median ITL (ms):                         94.92     
P99 ITL (ms):                            114.84    
==================================================

Why tensor-parallel-size=2 second token latency is higher than tensor_parallel_size=1?

============ Serving Benchmark Result ============
Successful requests:                     28        
Benchmark duration (s):                  53.54     
Total input tokens:                      28482     
Total generated tokens:                  14166     
Request throughput (req/s):              0.52      
Input token throughput (tok/s):          532.01    
Output token throughput (tok/s):         264.60    
---------------Time to First Token----------------
Mean TTFT (ms):                          1389.00   
Median TTFT (ms):                        1415.89   
P99 TTFT (ms):                           2354.09   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          101.90    
Median TPOT (ms):                        101.85    
P99 TPOT (ms):                           104.01    
---------------Inter-token Latency----------------
Mean ITL (ms):                           104.47    
Median ITL (ms):                         96.14     
P99 ITL (ms):                            125.47    
==================================================

I'm not sure why P99 TTFT is so much higher than Median TTFT?

@kzawora-intel kzawora-intel added the intel Issues or PRs submitted by Intel label Aug 29, 2024
Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
intel Issues or PRs submitted by Intel stale
Projects
None yet
Development

No branches or pull requests

2 participants