You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have applied dynamic quantization to two models based on the qwen2-72b-Instruct, which were fine-tuned on different SFT tasks. I've noticed that the acceleration effects vary significantly between the two tasks, even though both models are based on the same base model, qwen2-72b-Instruct.
I'm curious to know if you could shed some light on why different SFT tasks might influence the quantization effects on the first token's acceleration, especially considering that both tasks have similar input lengths.
Below are the deployment details for my two tasks.
Task A:
-- unquantized model
8*L40S vllm0.4.2
qps 0.15
average prompt tokens 7962.74
TTFT 3744.11 ms
TPOT 67.43 ms
Latency 22125.21 ms
-- quantized model
4*L40S vllm0.4.2
qps 0.15
average prompt tokens 7965.9
TTFT 3358.27 ms
TPOT 50.56 ms
Latency 17823.79 ms
Task B:
-- unquantized model
8*L40S vllm0.4.2
qps 0.145
average prompt tokens 8216.11
TTFT 3790.8 ms
TPOT 118.64 ms
Latency 57087.26 ms
-- quantized model
4*L40S vllm0.4.2
qps 0.15
average prompt tokens 8042.46
TTFT 3674.77 ms
TPOT 113.39 ms
Latency 50649.97 ms
As you can see, the benefits of FP8 quantization on Task B are not significant; it only saves 4 L40S GPUs. However, Task A not only saves GPU but also reduces inference time.
I also compared the proportion of parameters equal to 0 in the two FP8 quantized models, and the difference is not significant. I would like to know what other reasons might cause this kind of discrepancy?
The text was updated successfully, but these errors were encountered:
I have applied dynamic quantization to two models based on the qwen2-72b-Instruct, which were fine-tuned on different SFT tasks. I've noticed that the acceleration effects vary significantly between the two tasks, even though both models are based on the same base model, qwen2-72b-Instruct.
I'm curious to know if you could shed some light on why different SFT tasks might influence the quantization effects on the first token's acceleration, especially considering that both tasks have similar input lengths.
Below are the deployment details for my two tasks.
-- unquantized model
8*L40S vllm0.4.2
qps 0.15
average prompt tokens 7962.74
TTFT 3744.11 ms
TPOT 67.43 ms
Latency 22125.21 ms
-- quantized model
4*L40S vllm0.4.2
qps 0.15
average prompt tokens 7965.9
TTFT 3358.27 ms
TPOT 50.56 ms
Latency 17823.79 ms
-- unquantized model
8*L40S vllm0.4.2
qps 0.145
average prompt tokens 8216.11
TTFT 3790.8 ms
TPOT 118.64 ms
Latency 57087.26 ms
-- quantized model
4*L40S vllm0.4.2
qps 0.15
average prompt tokens 8042.46
TTFT 3674.77 ms
TPOT 113.39 ms
Latency 50649.97 ms
As you can see, the benefits of FP8 quantization on Task B are not significant; it only saves 4 L40S GPUs. However, Task A not only saves GPU but also reduces inference time.
I also compared the proportion of parameters equal to 0 in the two FP8 quantized models, and the difference is not significant. I would like to know what other reasons might cause this kind of discrepancy?
The text was updated successfully, but these errors were encountered: