Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Inference] use fp8 cuda core gemm kernel when M<=4 #9423

Merged
merged 1 commit into from
Nov 26, 2024

Conversation

zhink
Copy link
Contributor

@zhink zhink commented Nov 14, 2024

PR types

New features

PR changes

Others

Description

To speed up fp8 gemm calculations when m<=4, you should export FLAGS_cuda_core_fp8_gemm=1 to use it.
speed data is :

m k n L20(秒) L20(秒) 加速(%)
1 4096 4096 5.43 3.85 29.05
1 4096 12800 5.45 3.86 29.21
1 6144 4096 5.43 3.89 28.34
1 2048 2048 5.43 3.91 28.13
1 2048 5504 5.55 3.93 29.30
1 6144 2048 5.45 3.90 28.43
1 5120 5120 5.45 3.90 28.56
1 5120 13824 5.50 3.97 27.74
1 15360 5120 5.56 3.93 29.28
2 4096 4096 5.51 3.97 27.97
2 4096 12800 5.52 3.94 28.54
2 6144 4096 5.50 3.95 28.13
2 2048 2048 5.51 3.93 28.57
2 2048 5504 5.58 3.95 29.20
2 6144 2048 5.49 3.95 28.13
2 5120 5120 5.50 3.95 28.19
2 5120 13824 5.49 3.94 28.26
2 15360 5120 5.59 4.04 27.62
3 4096 4096 5.55 3.96 28.59
3 4096 12800 5.56 3.96 28.73
3 6144 4096 5.47 3.93 28.15
3 2048 2048 5.49 3.95 28.19
3 2048 5504 5.58 3.95 29.22
3 6144 2048 5.53 3.93 28.84
3 5120 5120 5.46 3.93 28.04
3 5120 13824 5.43 4.31 20.58
3 15360 5120 5.53 5.25 4.99
4 4096 4096 5.48 3.89 29.05
4 4096 12800 5.49 4.03 26.65
4 6144 4096 5.47 3.89 28.74
4 2048 2048 5.46 3.89 28.69
4 2048 5504 5.53 3.90 29.40
4 6144 2048 5.47 3.91 28.45
4 5120 5120 5.48 3.91 28.67
4 5120 13824 5.45 5.22 4.12
4 15360 5120 5.64 6.38 -13.14



Copy link

paddle-bot bot commented Nov 14, 2024

Thanks for your contribution!

Copy link

codecov bot commented Nov 14, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 53.09%. Comparing base (85333aa) to head (945e92d).
Report is 56 commits behind head on develop.

Current head 945e92d differs from pull request most recent head 2f5e5ea

Please upload reports for the commit 2f5e5ea to get more accurate results.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #9423      +/-   ##
===========================================
+ Coverage    52.95%   53.09%   +0.13%     
===========================================
  Files          682      685       +3     
  Lines       110667   108904    -1763     
===========================================
- Hits         58606    57824     -782     
+ Misses       52061    51080     -981     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zhink zhink force-pushed the develop branch 2 times, most recently from 5f40ee8 to 945e92d Compare November 22, 2024 03:25
@DrownFish19 DrownFish19 changed the title use fp8 cuda core gemm kernel when M<=4 [Inference] use fp8 cuda core gemm kernel when M<=4 Nov 26, 2024
llm/docs/predict/best_practices.md Outdated Show resolved Hide resolved
llm/docs/predict/best_practices.md Outdated Show resolved Hide resolved
@DrownFish19
Copy link
Collaborator

  1. 推荐把这个加速优化写到文档里;
  2. 在Flag说明里加上加速比和限制条件。

@DrownFish19 DrownFish19 merged commit 0b4b810 into PaddlePaddle:develop Nov 26, 2024
10 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants