Differences in Dynamic Quantization Speedup for Varying SFT Tasks on Qwen2-72b-Instruct Models #40

IPostYellow · 2024-08-15T02:32:24Z

I have applied dynamic quantization to two models based on the qwen2-72b-Instruct, which were fine-tuned on different SFT tasks. I've noticed that the acceleration effects vary significantly between the two tasks, even though both models are based on the same base model, qwen2-72b-Instruct.
I'm curious to know if you could shed some light on why different SFT tasks might influence the quantization effects on the first token's acceleration, especially considering that both tasks have similar input lengths.
Below are the deployment details for my two tasks.

Task A:
-- unquantized model
8*L40S vllm0.4.2
qps 0.15
average prompt tokens 7962.74
TTFT 3744.11 ms
TPOT 67.43 ms
Latency 22125.21 ms

-- quantized model
4*L40S vllm0.4.2
qps 0.15
average prompt tokens 7965.9
TTFT 3358.27 ms
TPOT 50.56 ms
Latency 17823.79 ms

Task B:
-- unquantized model
8*L40S vllm0.4.2
qps 0.145
average prompt tokens 8216.11
TTFT 3790.8 ms
TPOT 118.64 ms
Latency 57087.26 ms

-- quantized model
4*L40S vllm0.4.2
qps 0.15
average prompt tokens 8042.46
TTFT 3674.77 ms
TPOT 113.39 ms
Latency 50649.97 ms

As you can see, the benefits of FP8 quantization on Task B are not significant; it only saves 4 L40S GPUs. However, Task A not only saves GPU but also reduces inference time.
I also compared the proportion of parameters equal to 0 in the two FP8 quantized models, and the difference is not significant. I would like to know what other reasons might cause this kind of discrepancy?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Differences in Dynamic Quantization Speedup for Varying SFT Tasks on Qwen2-72b-Instruct Models #40

Differences in Dynamic Quantization Speedup for Varying SFT Tasks on Qwen2-72b-Instruct Models #40

IPostYellow commented Aug 15, 2024 •

edited

Loading

Differences in Dynamic Quantization Speedup for Varying SFT Tasks on Qwen2-72b-Instruct Models #40

Differences in Dynamic Quantization Speedup for Varying SFT Tasks on Qwen2-72b-Instruct Models #40

Comments

IPostYellow commented Aug 15, 2024 • edited Loading

IPostYellow commented Aug 15, 2024 •

edited

Loading