[FEA] Better grid size for H100 GPU with SXM5 #1921

zhipeng93 · 2024-11-06T03:38:30Z

Is your feature request related to a problem? Please describe.
When doing gemm in Hopper, we need to decide the grid size based on problem size, cluster shape and the hopper architectures.

Currently, cutlass only considered GH100 GPU but not H100 GPU with SXM5 [1]. They have different architectures [2]:

GH100 GPU: 8 GPCs, 72 TPCs (9 TPCs per GPC), 144 SMs
H100 GPU with SXM5: 8 GPCs, 66 TPCs (uneven TPC for GPCs?), 132 SMs

In H100 GPU with SXM5, cutlass may fail to find the optimal grid size, leading to sub-optimal performance. For example, when doing gemm mnk=(4096, 4096, 4096) on H100 GPU with SXM5 and using cluster shape (4, 2, 1), the result grid size is (4, 28, 1). However, cublas used grid size (60, 2, 1), and cublas has a better performance by 17%.

cutlass settings:

using TileShape           = Shape<_128,_128,_64>; 
using ClusterShape        = Shape<_4,_2,_1>;
cutlass::gemm::KernelTmaWarpSpecializedCooperative

[1] https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/tile_scheduler_params.h#L249-L259
[2] https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/

Describe the solution you'd like
Compute grid size according to the GPU archi.

Describe alternatives you've considered

Additional context

The text was updated successfully, but these errors were encountered:

zhipeng93 · 2024-11-06T03:44:14Z

When I try to hack the cutlass code for better performance on H100 GPU with SXM5, I find that when using cluster shape (4, 2, 1) or (2, 2, 1), only 120 SMs can be used. That is, when I set grid size greater than 120, the waves per SM would be greater than 1.

For example,

cluster shape (4, 2, 1) and grid size (64, 2, 1), waves per SM is 1.07. (128/120~1.07)
cluster shape (2, 2, 1) and grid size (62, 2, 1), waves per SM is 1.03. (124/120~1.03)
cluster shape (4, 2, 1) and grid size (60, 2, 1), waves per SM is 1. (120/120~1)

I also check this setting: cluster shape (1, 1, 1) and grid size (132, 1, 1), waves per SM is 1.

Any ideas on the number of SMs that can be used when using thread block cluster?

zhipeng93 · 2024-11-06T03:52:00Z

cc @hwu36 @Junkai-Wu

hwu36 · 2024-11-06T05:22:58Z

@ANIKET-SHIVAM

zhipeng93 · 2024-11-11T08:46:31Z

@ANIKET-SHIVAM Hi, can you help to explain this? :)

ANIKET-SHIVAM · 2024-11-12T22:08:42Z

@zhipeng93 yes, your observation is correct and is a known issue that for cluster sizes >= 4 optimal grid size is not launched. We should be able to elevate that limitation with cudaOccupancyMaxActiveClusters API. Will try to upstream changes soon.

zhipeng93 · 2024-11-14T12:49:34Z

@ANIKET-SHIVAM Thanks for the reply!

It seems that I can never really get to use more than 120 sms when cluster sizes >=4, even I hack the heuristic for computing the grid size.

For example,

cluster shape (4, 2, 1) and grid size (64, 2, 1), waves per SM is 1.07. (128/120~1.07)
cluster shape (2, 2, 1) and grid size (62, 2, 1), waves per SM is 1.03. (124/120~1.03)

In Hopper SXM5, I should be able to use 132 sms. Can you also help to explain this?

zhipeng93 added ? - Needs Triage feature request New feature or request labels Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Better grid size for H100 GPU with SXM5 #1921

[FEA] Better grid size for H100 GPU with SXM5 #1921

zhipeng93 commented Nov 6, 2024 •

edited

Loading

zhipeng93 commented Nov 6, 2024 •

edited

Loading

zhipeng93 commented Nov 6, 2024

hwu36 commented Nov 6, 2024

zhipeng93 commented Nov 11, 2024

ANIKET-SHIVAM commented Nov 12, 2024

zhipeng93 commented Nov 14, 2024 •

edited

Loading

[FEA] Better grid size for H100 GPU with SXM5 #1921

[FEA] Better grid size for H100 GPU with SXM5 #1921

Comments

zhipeng93 commented Nov 6, 2024 • edited Loading

zhipeng93 commented Nov 6, 2024 • edited Loading

zhipeng93 commented Nov 6, 2024

hwu36 commented Nov 6, 2024

zhipeng93 commented Nov 11, 2024

ANIKET-SHIVAM commented Nov 12, 2024

zhipeng93 commented Nov 14, 2024 • edited Loading

zhipeng93 commented Nov 6, 2024 •

edited

Loading

zhipeng93 commented Nov 6, 2024 •

edited

Loading

zhipeng93 commented Nov 14, 2024 •

edited

Loading