split ReplicatedLinear used in MLA prefill computing along hidden_states[0] to save duplicated computing on all devices #3688

ZJLi2013 · 2025-02-19T07:05:00Z

Motivation

in MLA, there are a few ReplicatedLinear ops, .e.g q_a_proj, kv_a_proj_with_mqa, meaning the same hidden_states tensor are computing on all devices, which can be reduce by spliting the hidden_states by tp_size along batch_size * seqlen (a.k.a total_num_tokens) dim, to save duplicated gemm computing. currently it's only useful in prefill computing.

Modifications

replace ReplicatedLinear with dp_linear in deepseek-v2.py, which split input hidden_states along total_num_tokens dim and do all_gather at last step
add test_dp_linear.py for prefill/decoding benchmark

MI308 Benchmark Results

python3 -m sglang.bench_one_batch --batch-size 32 --input 1024 --output 1  --model /data/DeepSeek-V3/ --tp 8 --trust-remote-code

before 5122.06 toks/s, after 5518.23 toks/s

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

…deepseek-v2 model

ZJLi2013 · 2025-02-20T07:52:42Z

serving bench results update:

baseline without use_dp_linear

# prefill
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                32
Successful requests:                     200
Benchmark duration (s):                  364.37
Total input tokens:                      640000
Total generated tokens:                  200
Total generated tokens (retokenized):    197
Request throughput (req/s):              0.55
Input token throughput (tok/s):          1756.46
Output token throughput (tok/s):         0.55
Total token throughput (tok/s):          1757.01
Concurrency:                             29.91
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   54487.73
Median E2E Latency (ms):                 38634.45
---------------Time to First Token----------------
Mean TTFT (ms):                          53974.92
Median TTFT (ms):                        38597.53
P99 TTFT (ms):                           128695.85
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00
Median TPOT (ms):                        0.00
P99 TPOT (ms):                           0.00
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P99 ITL (ms):                            0.00
==================================================
# e2e decoding
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                32
Successful requests:                     200
Benchmark duration (s):                  728.66
Total input tokens:                      640000
Total generated tokens:                  100000
Total generated tokens (retokenized):    99595
Request throughput (req/s):              0.27
Input token throughput (tok/s):          878.32
Output token throughput (tok/s):         137.24
Total token throughput (tok/s):          1015.56
Concurrency:                             30.29
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   110342.08
Median E2E Latency (ms):                 111139.55
---------------Time to First Token----------------
Mean TTFT (ms):                          35991.87
Median TTFT (ms):                        37487.63
P99 TTFT (ms):                           62302.38
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          149.00
Median TPOT (ms):                        145.35
P99 TPOT (ms):                           210.27
---------------Inter-token Latency----------------
Mean ITL (ms):                           149.01
Median ITL (ms):                         104.52
P99 ITL (ms):                            168.93
==================================================

use_dp_linear

# prefill 
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                3
Successful requests:                     200
Benchmark duration (s):                  254.22
Total input tokens:                      640000
Total generated tokens:                  200
Total generated tokens (retokenized):    197
Request throughput (req/s):              0.79
Input token throughput (tok/s):          2517.47
Output token throughput (tok/s):         0.79
Total token throughput (tok/s):          2518.26
Concurrency:                             2.99
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   3800.61
Median E2E Latency (ms):                 3420.09
---------------Time to First Token----------------
Mean TTFT (ms):                          3748.08
Median TTFT (ms):                        3418.47
P99 TTFT (ms):                           7227.73
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00
Median TPOT (ms):                        0.00
P99 TPOT (ms):                           0.00
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P99 ITL (ms):                            0.00
==================================================

# e2e decoding
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                32
Successful requests:                     200
Benchmark duration (s):                  690.73
Total input tokens:                      640000
Total generated tokens:                  100000
Total generated tokens (retokenized):    99611
Request throughput (req/s):              0.29
Input token throughput (tok/s):          926.55
Output token throughput (tok/s):         144.77
Total token throughput (tok/s):          1071.33
Concurrency:                             30.53
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   105443.90
Median E2E Latency (ms):                 105185.21
---------------Time to First Token----------------
Mean TTFT (ms):                          32991.32
Median TTFT (ms):                        34231.23
P99 TTFT (ms):                           61317.65
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          145.20
Median TPOT (ms):                        143.91
P99 TPOT (ms):                           207.16
---------------Inter-token Latency----------------
Mean ITL (ms):                           145.20
Median ITL (ms):                         104.60
P99 ITL (ms):                            123.57
==================================================

zhyncs · 2025-02-20T19:36:48Z

Hi, please let me know when it's ready for review. Thanks!

ZJLi2013 · 2025-02-25T01:42:46Z

Hi, please let me know when it's ready for review. Thanks!

hi, @zhyncs many thanks for review. bench test covered just few isl/osl/num_promts on h20/mi30x

ZJLi2013 and others added 8 commits February 18, 2025 17:29

add use_dp_linear to replace ReplicatedLinear in forward_normal() in …

5eba084

…deepseek-v2 model

Merge branch 'sgl-project:main' into main

ce9296a

fix minor issues

2c9534f

Merge branch 'main' of https://github.com/ZJLi2013/sglang

c8417bb

add use_dp_linear unit test

a704ecd

Merge remote-tracking branch 'upstream/main'

e6e7aa4

Add BS Padding

0a1c6bb

Clang Format

b848db2

BruceXcluding force-pushed the main branch from 55869c0 to b848db2 Compare February 19, 2025 12:09

ZJLi2013 added 3 commits February 20, 2025 15:39

fix serving perf regression

2d5a5f8

fix conflict

7b9af49

clean debug print

66c0c6e

zhyncs self-assigned this Feb 20, 2025

ZJLi2013 and others added 4 commits February 21, 2025 14:45

add dp_linear in forward_absorb()

a62460b

Merge branch 'sgl-project:main' into main

b62d9a9

add enable dp linear arg

4ab546f

remove debug tag

3e54b73

ZJLi2013 marked this pull request as ready for review February 25, 2025 01:40

ZJLi2013 requested review from merrymercy, Ying1123, zhyncs, hnyls2002, ispobock and ByronHsu as code owners February 25, 2025 01:40

Merge branch 'sgl-project:main' into main

536e53b

Merge branch 'sgl-project:main' into main

54bc05e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

split ReplicatedLinear used in MLA prefill computing along hidden_states[0] to save duplicated computing on all devices #3688

split ReplicatedLinear used in MLA prefill computing along hidden_states[0] to save duplicated computing on all devices #3688

ZJLi2013 commented Feb 19, 2025 •

edited

Loading

ZJLi2013 commented Feb 20, 2025

zhyncs commented Feb 20, 2025

ZJLi2013 commented Feb 25, 2025

split ReplicatedLinear used in MLA prefill computing along hidden_states[0] to save duplicated computing on all devices #3688

Are you sure you want to change the base?

split ReplicatedLinear used in MLA prefill computing along hidden_states[0] to save duplicated computing on all devices #3688

Conversation

ZJLi2013 commented Feb 19, 2025 • edited Loading

Motivation

Modifications

MI308 Benchmark Results

Checklist

ZJLi2013 commented Feb 20, 2025

baseline without use_dp_linear

use_dp_linear

zhyncs commented Feb 20, 2025

ZJLi2013 commented Feb 25, 2025

ZJLi2013 commented Feb 19, 2025 •

edited

Loading