[BUG] TMA Cooperative GeMM with Stream-K scheduler hangs #1917

NihalPotdar · 2024-11-04T19:58:46Z

Describe the bug
Gemm kernels with the following configurations hang for specific gemm shapes.

Type: uint4_t * half_t
Tile: m=16,n=2560,k=8192
Cluster: 1x1x1
Kernel Schedule: KernelTmaWarpSpecializedCooperative
Epilogue Schedule: TmaWarpSpecializedCooperative
Tile Scheduler: Stream-K

Expected behavior

With Cutlass 3.X, this kernel just hangs call with no changes. This is not expected.

Based on #1801, this should have been resolved.

thakkarV · 2024-11-04T20:16:42Z

@jackkosaian

jackkosaian · 2024-11-04T21:57:05Z

Hi, @NihalPotdar. Can you please provide a reproducer for this bug?

NihalPotdar · 2024-11-04T22:47:10Z

@jackkosaian set the scheduler in examples/55_hooper_mixed_dtype_gemm.cu to cutlass::gemm::StreamKScheduler and then run ./55_hopper_int4_bf16_gemm --m=16 --n=2560 --k=8192 --mode=1

thakkarV · 2024-11-04T23:37:41Z

please include your build flags and full steps to repro starting at a checkout of the repo. We find that often users do not use our build system generated flags. Please also provide your CUDA toolkit version

NihalPotdar · 2024-11-05T01:57:39Z

clone cutlass, https://github.com/NVIDIA/cutlass
follow https://github.com/NVIDIA/cutlass/blob/main/README.md to generate build files with cmake .. -DCUTLASS_NVCC_ARCHS=sm90a
update 55_hopper_mixed_dtype_gemm.cu to include the streamK scheduler, eg.

using GemmKernelScaleOnly = cutlass::gemm::kernel::GemmUniversal<
    Shape<int,int,int,int>, // Indicates ProblemShape
    CollectiveMainloopScaleOnly,
    CollectiveEpilogue,
    cutlass::gemm::StreamKScheduler
>;

run make from build/examples/55_hopper_mixed_dtype_gemm
call ./55_hopper_int4_bf16_gemm --m=16 --n=2560 --k=8192 --mode=1

jackkosaian · 2024-11-05T13:31:43Z

Thanks for the detailed steps.

It looks like that example is not calling gemm.initialize() before each run of the GEMM in the profiling loop here.

This is required for stream-K in order to initialize counters used for coordinating inter-CTA reduction. If these are not properly initialized, stream-K is likely to hang.

Can you please try changing the loop linked above to be the following?

    for (int iter = 0; iter < options.iterations; ++iter) {
      CUTLASS_CHECK(gemm.initialize(arguments, workspace.get())); // Added this line
      CUTLASS_CHECK(gemm.run());
    }

NihalPotdar · 2024-11-05T20:01:14Z

@jackkosaian that works, thank you. However, for smaller problem shapes, like in this case where the atomic write overheard becomes significant - using a separate reduction wave would make a lot of sense. I noticed that in the cutlass implementation, https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/tile_scheduler_params.h#L1051, this is turned off currently.

Are there any plans to fix this and what's the ETA there? do you have any suggested workarounds in the mean time?

jackkosaian · 2024-11-05T21:44:00Z

The current plan is to improve reduction performance in 3.7.

NihalPotdar · 2024-11-08T06:00:38Z

@jackkosaian Sounds good. I am also seeing correctness issues with the existing implementation, that is when compared with fp8 like in build/examples/55_hopper_mixed_dtype_gemm, the comparison fails only when using streamK. Any thoughts on why this might be the case?

jackkosaian · 2024-11-08T16:40:23Z

How are tensors initialized? Containing random floating point values, or random integer values within some tight range?

How is error checking being performed? Exact match or relative?

Since stream-K involves splitting a GEMM along the K mode, it can accumulate results in a different order than a non-stream-K GEMM. Since floating point addition is not associative, the results can be different.

NihalPotdar · 2024-11-24T22:21:39Z

@jackkosaian Sorry for the delayed reply. This is directly based on the verify method provided in the mixed-precision example in the cutlass-repo, https://github.com/NVIDIA/cutlass/blob/main/examples/55_hopper_mixed_dtype_gemm/55_hopper_mixed_dtype_gemm.cu.

For this specific case, this is testing matrix B as cutlass::uint4_t and Matrix as cutlass::half_t. Matrix B is being initialized to floating point values randomly within -2 and +2. Matrix A is being randomly initialized to integer values within +5 and -5. Here's a pointer to the code segment in the example in the cutlass repo, https://github.com/NVIDIA/cutlass/blob/main/examples/55_hopper_mixed_dtype_gemm/55_hopper_mixed_dtype_gemm.cu#L458.

The source of truth for the verification is fp8*fp8 matmul with the same inputs. The error checking is being performed with cutlass::reference::device::BlockCompareRelativelyEqual - ie: compare each float in the results. The bounds for the verification are very liberal with the relative error margin (epsilon) set to 1, ie: if I am comparing tensors A and B then their diff is within (abs(A) + abs(B)).

I understand that there's non-determinism coming from the hardware, but think these bounds are very lenient and I would expect this verification to succeed. Otherwise, this makes it really hard to use for any practical applications which need to preserve quality like with quantized models.

jackkosaian · 2024-11-25T11:00:21Z

Thanks for the details. Can you please post a full reproducer so that I can make sure I'm testing the same thing as you?

NihalPotdar added ? - Needs Triage bug Something isn't working labels Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] TMA Cooperative GeMM with Stream-K scheduler hangs #1917

[BUG] TMA Cooperative GeMM with Stream-K scheduler hangs #1917

NihalPotdar commented Nov 4, 2024

thakkarV commented Nov 4, 2024

jackkosaian commented Nov 4, 2024

NihalPotdar commented Nov 4, 2024

thakkarV commented Nov 4, 2024 •

edited

Loading

NihalPotdar commented Nov 5, 2024

jackkosaian commented Nov 5, 2024

NihalPotdar commented Nov 5, 2024

jackkosaian commented Nov 5, 2024

NihalPotdar commented Nov 8, 2024 •

edited

Loading

jackkosaian commented Nov 8, 2024

NihalPotdar commented Nov 24, 2024 •

edited

Loading

jackkosaian commented Nov 25, 2024

[BUG] TMA Cooperative GeMM with Stream-K scheduler hangs #1917

[BUG] TMA Cooperative GeMM with Stream-K scheduler hangs #1917

Comments

NihalPotdar commented Nov 4, 2024

thakkarV commented Nov 4, 2024

jackkosaian commented Nov 4, 2024

NihalPotdar commented Nov 4, 2024

thakkarV commented Nov 4, 2024 • edited Loading

NihalPotdar commented Nov 5, 2024

jackkosaian commented Nov 5, 2024

NihalPotdar commented Nov 5, 2024

jackkosaian commented Nov 5, 2024

NihalPotdar commented Nov 8, 2024 • edited Loading

jackkosaian commented Nov 8, 2024

NihalPotdar commented Nov 24, 2024 • edited Loading

jackkosaian commented Nov 25, 2024

thakkarV commented Nov 4, 2024 •

edited

Loading

NihalPotdar commented Nov 8, 2024 •

edited

Loading

NihalPotdar commented Nov 24, 2024 •

edited

Loading