[QST] How are DRAM accesses optimized during Split K - reduction across threadblocks? #1406

Rya-Sanovar · 2024-03-16T12:10:30Z

Assuming partition=16 in PartitionedK GEMM during split-k reduction at threadblock level:
All of the 16 tiles belonging to the same "row" of block tiles in A matrix run on different SM's of a GPU. During the reduction phase, some of these tiles will need to be written to global memory so they can be loaded in by the consumer tiles that will perform the reduction.
Wouldn't this increase latency due to the DRAM BW bottleneck, since we now have extra gmem accesses compared to the un-partitioned case, even though we've achieved higher occupancy? How does cutlass optimize this?

thakkarV · 2024-03-16T18:53:36Z

Same as before - we optimize the CTA rasterization and work tile mapping such that they hit in L2 as much as possible. This applies to the partial reduction as well.

Rya-Sanovar · 2024-03-17T13:28:25Z

okay, doing this maximizes GPU occupancy, but what about latency? Isn't this a tradeoff here? Or by optimizing L2 hit rate does split-k reduction have almost the same inference latency as without it?

thakkarV · 2024-03-17T15:34:51Z

I'm not sure what you mean by this being a tradeoff between occupancy and latency. Latency of what? Individual gmem accesses or the time to completion of the kernel?

Split K increases utilization. Therefore the wallclock time of the kernel goes down. Regardless of gmem latencies for individual tiles, this implies that the end to end latency of the kernel decreases.

Rya-Sanovar · 2024-03-17T16:10:17Z

By latency I mean the total time it takes to compute the C matrix if we used split-k reduction, so yes, time to completion of the kernel.

So, if the wallclock time goes down that means that the additional parallelism that split-k achieves compensates for the extra gmem accesses, right? Does this happen for all different cases of matrix sizes M,N,K? as in, can there be some cases where using split-k actually slows down the time it takes to compute C?

thakkarV · 2024-03-17T17:19:39Z

Yes. Split K nets you speedups when MN are small and K is large. If the outer dims are big then using split K will worsen perf, especially if the contraction dim is small.

Rya-Sanovar · 2024-03-18T04:47:34Z

Can we say that the speedup with split K would be almost (#tiles along K mode)x the case without split K? Assuming that total #block tiles doesn't exceed #SM's on the GPU
And how would we know if K is "large" enough to use split K, does it only have to be relatively bigger than M and N, or is there a certain value of K after which split K would actually give us speedup?

thakkarV · 2024-03-18T05:55:57Z

yes
it depends on a lot of factors, such as exact problem size and architecture etc. you can use the profiler to decide this splitting factor. mind you, split K is not really a great load balancing strategy in general

Rya-Sanovar · 2024-03-18T06:35:37Z

mind you, split K is not really a great load balancing strategy in general

Could you elaborate on why this is?
How exactly is the split K reduction executed? Is it something similar to, say, how intra-warp reduction is done by shfl_down_sync()? Attaching an image for reference:

thakkarV · 2024-03-18T15:34:25Z

I recommend you read the stream k paper (linked in our readme) to understand the intricacies of load balancing.
Two ways to do reductions. First is what you show, which is parallel reductions that need to use atomics. Not tree based generally since CTA scheduling can be dynamic. There second method is serial reductions with semaphores.

Rya-Sanovar · 2024-03-18T16:21:34Z

Thanks! Can't find it on readme, can you link it here please?

thakkarV · 2024-03-18T16:29:29Z

its in publications.md:

"Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU"

Rya-Sanovar · 2024-03-20T13:38:57Z

Thanks. Also why is the split-k's job split into two kernels: GemmKernel and ReductionKernel? Wouldn't launching it as 1 fused kernel be less costly?

hwu36 · 2024-03-20T15:56:29Z

parallel splitk uses a separate reduction kernel. serial splitk fused the reduction inside the kernel. the former works better when split slices are big.

Rya-Sanovar · 2024-03-21T16:23:22Z

I see. I have a few questions on how parallel splitk works:

before reduction kernel is launched, all the partial products are present in smem of the SM's right? Or are they present in workspace (which resides in DRAM if I'm not wrong)?
why are CTA shapes in GemmKernel and ReductionKernel for split-k different? And why can't it be one fused kernel instead of launching 2 separate ones?
How exactly does parallel splitk reduction work if kPartitionsPerStage=4 and split_k_slices=16 for example?

hwu36 · 2024-03-26T16:16:43Z

Partial products are in the global memory (workspace) no matter it is serial or parallel splitk. serial splitk does not need a separate reduction kernel, but parallel splitk needs one.

If split_k_slices=16, there will be 16 partitions. kPartitionsPerStage=4 means we reduce 4 partitions every time.

Rya-Sanovar · 2024-04-04T15:19:37Z

@thakkarV In the streamK paper, they've compared “two-tile Stream-K + data-parallel" against data parallel CUTLASS, but I was wondering what the performance difference (in terms of both latency and occupancy) is for:

“two-tile Stream-K + data-parallel" vs basic streamK
“two-tile Stream-K + data-parallel" vs parallel split-K CUTLASS
Basic streamK vs parallel split-K CUTLASS.

In short, how much does the suboptimal cache hits in basic streamK affect performance and how does streamK compare to split-k.

I understand this must be contingent to problem sizes and hardware, but what's the general consensus?

cloudhan · 2024-04-15T06:05:39Z

I don't think there will be "general consensus" which one is better. They might be just be trail and error results. What they don't mention in the paper (explicitly) is that they are moving from "data parallel" to "task parallel". And the "tasking" is causing some kind of contention and they want to amortize.

So the covert plot behind the scenes (might be, educated guess):

The GPU is getting powerful and number of SMs goes up but the problem size will not change because of old algorithms.
- Thus the "quantization inefficiency" in the tail part of the DP GEMM, or reduction overhead of split-K GEMM is getting larger and larger.
If m,n is fixed, Compute and Bandwidth requirement is basically O(k). So they decide to assign k as large as possible (instead of fixed) and assign the task to specific CTAs.
- This means less CTAs being launched, at best O(num_SMs)
- This means more reduction along k-axis can happen toward registers or shared memory, amortizing split-k reduction overhead
They still need to load balance the SMs (hardware resource).
- The k is a little bit larger to achieve best occupancy.
- Sometimes a "task" needs a split along k (somewhere in between), thus global reduction
Those split parts again cause reduction overhead.
- They are doing single kernel computing, inplace reduction relies on somekind atomic/memory consistency and coherency.
They find hardware guys are lazy xD
- L2 cache is in (two) partitions in their cash cow (A100, H100, maybe B?00).
- For different problem sizes and hardware combinations, the naive stream-K might perform worse due to the L2 contention.
  - Some problem requires excessive global reduction, say MxN are small (basically degenerate to split K?)
Partitioned L2 is irritating.
- DP + one-tile SK and Two-tile SK + DP are develop to cover some cases.
  - They might just want to distribute the synchronization point across program execution trace to amortize the latency.

I believe the partitioned far+near L2 cache plays an important role here. They don't mention it in the paper as the the GPU is "hypothetical" in the paper.

github-actions · 2024-05-15T06:05:54Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Rya-Sanovar added ? - Needs Triage question Question labels Mar 16, 2024

mnicely removed the ? - Needs Triage label Mar 25, 2024

github-actions bot added the inactive-30d label May 15, 2024

mnicely closed this as completed Jul 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] How are DRAM accesses optimized during Split K - reduction across threadblocks? #1406

[QST] How are DRAM accesses optimized during Split K - reduction across threadblocks? #1406

Rya-Sanovar commented Mar 16, 2024

thakkarV commented Mar 16, 2024 •

edited

Loading

Rya-Sanovar commented Mar 17, 2024

thakkarV commented Mar 17, 2024

Rya-Sanovar commented Mar 17, 2024

thakkarV commented Mar 17, 2024

Rya-Sanovar commented Mar 18, 2024

thakkarV commented Mar 18, 2024

Rya-Sanovar commented Mar 18, 2024 •

edited

Loading

thakkarV commented Mar 18, 2024

Rya-Sanovar commented Mar 18, 2024

thakkarV commented Mar 18, 2024

Rya-Sanovar commented Mar 20, 2024 •

edited

Loading

hwu36 commented Mar 20, 2024

Rya-Sanovar commented Mar 21, 2024

hwu36 commented Mar 26, 2024

Rya-Sanovar commented Apr 4, 2024

cloudhan commented Apr 15, 2024

github-actions bot commented May 15, 2024

[QST] How are DRAM accesses optimized during Split K - reduction across threadblocks? #1406

[QST] How are DRAM accesses optimized during Split K - reduction across threadblocks? #1406

Comments

Rya-Sanovar commented Mar 16, 2024

thakkarV commented Mar 16, 2024 • edited Loading

Rya-Sanovar commented Mar 17, 2024

thakkarV commented Mar 17, 2024

Rya-Sanovar commented Mar 17, 2024

thakkarV commented Mar 17, 2024

Rya-Sanovar commented Mar 18, 2024

thakkarV commented Mar 18, 2024

Rya-Sanovar commented Mar 18, 2024 • edited Loading

thakkarV commented Mar 18, 2024

Rya-Sanovar commented Mar 18, 2024

thakkarV commented Mar 18, 2024

Rya-Sanovar commented Mar 20, 2024 • edited Loading

hwu36 commented Mar 20, 2024

Rya-Sanovar commented Mar 21, 2024

hwu36 commented Mar 26, 2024

Rya-Sanovar commented Apr 4, 2024

cloudhan commented Apr 15, 2024

github-actions bot commented May 15, 2024

thakkarV commented Mar 16, 2024 •

edited

Loading

Rya-Sanovar commented Mar 18, 2024 •

edited

Loading

Rya-Sanovar commented Mar 20, 2024 •

edited

Loading