[QST] Split-K: Reduce in Shared Memory instead of Global Memory #1421

HanGuo97 · 2024-03-24T20:49:00Z

I have a (somewhat naive/newbie) question about the Split-K implementation.

In CUTLASS, the Split-K kernel splits the K dimension so that multiple thread blocks compute (in parallel) the partial results of one column tile. Then, these partial results are written to global memory. Finally, these partial results are reduced via a separate reduction kernel launch. Hence, this implementation introduces a round trip of these partial results between off-chip memory and on-chip memory, as well as the overhead of one extra kernel launches.
Here is a potential alternative implementation, in which we split the K dimension so that multiple threads compute (in parallel) the partial results of one column tile. These partial results are written to and reduced through shared memory, before writing the results back to global memory.

Compared to method (1), method (2) saves a round trip between off-chip and on-chip memory and a separate kernel launch. Is there any reason why (1) is (overwhelmingly?) preferred to (2)?

This is (loosely) a follow-up question to #1391, which is more like method (2).

Thanks in advance for your help!

thakkarV · 2024-03-24T21:04:47Z

There's a third option that cutlass implements as well which is a semaphore based serial reduction across different CTAs and that doesn't require a separate reduction kernel or a round trip to gmem

1 is easy to implement and can have great perf depending on the problem size

2 is more difficult to implement and reduces the arithmetic intensity of the CTA level GEMM, which are usually highly tuned for a given tile size. It can be done but the benefits are limited to fewer scenarios

HanGuo97 · 2024-03-24T21:27:09Z

Thanks for the very informative response!

For the 3rd option, do you mind sharing the pointer to its code?

thakkarV · 2024-03-25T15:32:34Z

cutlass/include/cutlass/gemm/kernel/gemm_universal.h

Line 607 in c4e3e12

// If performing a reduction via split-K, fetch the initial synchronization

HanGuo97 · 2024-03-25T15:49:16Z

Thanks for the pointer!

In my case, we are mostly implementing things ourselves. Do you have any thoughts on the performance + implementability of the 3rd option versus the first two options (especially the first option)?

thakkarV · 2024-03-25T15:53:55Z

implementability wise its between the two in terms of difficult and complexity.

performance wise, it depends (on arch, problem size, your kernel schedule, pipelining strategy, fusions, etc...)

HanGuo97 · 2024-03-25T18:07:34Z

Understood, thanks again for the super helpful answers!

HanGuo97 added ? - Needs Triage question Question labels Mar 24, 2024

mnicely removed the ? - Needs Triage label Mar 25, 2024

HanGuo97 closed this as completed Mar 25, 2024

osayamenja mentioned this issue Jun 24, 2024

SplitK for multiblock_gemm in cuBLASdx NVIDIA/CUDALibrarySamples#192

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Split-K: Reduce in Shared Memory instead of Global Memory #1421

[QST] Split-K: Reduce in Shared Memory instead of Global Memory #1421

HanGuo97 commented Mar 24, 2024

thakkarV commented Mar 24, 2024

HanGuo97 commented Mar 24, 2024

thakkarV commented Mar 25, 2024

HanGuo97 commented Mar 25, 2024

thakkarV commented Mar 25, 2024

HanGuo97 commented Mar 25, 2024

[QST] Split-K: Reduce in Shared Memory instead of Global Memory #1421

[QST] Split-K: Reduce in Shared Memory instead of Global Memory #1421

Comments

HanGuo97 commented Mar 24, 2024

thakkarV commented Mar 24, 2024

HanGuo97 commented Mar 24, 2024

thakkarV commented Mar 25, 2024

HanGuo97 commented Mar 25, 2024

thakkarV commented Mar 25, 2024

HanGuo97 commented Mar 25, 2024