Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differences in Dynamic Quantization Speedup for Varying SFT Tasks on Qwen2-72b-Instruct Models #40

Open
IPostYellow opened this issue Aug 15, 2024 · 0 comments

Comments

@IPostYellow
Copy link

IPostYellow commented Aug 15, 2024

I have applied dynamic quantization to two models based on the qwen2-72b-Instruct, which were fine-tuned on different SFT tasks. I've noticed that the acceleration effects vary significantly between the two tasks, even though both models are based on the same base model, qwen2-72b-Instruct.
I'm curious to know if you could shed some light on why different SFT tasks might influence the quantization effects on the first token's acceleration, especially considering that both tasks have similar input lengths.
Below are the deployment details for my two tasks.

  • Task A:
    -- unquantized model
    8*L40S vllm0.4.2
    qps 0.15
    average prompt tokens 7962.74
    TTFT 3744.11 ms
    TPOT 67.43 ms
    Latency 22125.21 ms

-- quantized model
4*L40S vllm0.4.2
qps 0.15
average prompt tokens 7965.9
TTFT 3358.27 ms
TPOT 50.56 ms
Latency 17823.79 ms

  • Task B:
    -- unquantized model
    8*L40S vllm0.4.2
    qps 0.145
    average prompt tokens 8216.11
    TTFT 3790.8 ms
    TPOT 118.64 ms
    Latency 57087.26 ms

-- quantized model
4*L40S vllm0.4.2
qps 0.15
average prompt tokens 8042.46
TTFT 3674.77 ms
TPOT 113.39 ms
Latency 50649.97 ms

As you can see, the benefits of FP8 quantization on Task B are not significant; it only saves 4 L40S GPUs. However, Task A not only saves GPU but also reduces inference time.
I also compared the proportion of parameters equal to 0 in the two FP8 quantized models, and the difference is not significant. I would like to know what other reasons might cause this kind of discrepancy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant