-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] How are DRAM accesses optimized during Split K - reduction across threadblocks? #1406
Comments
Same as before - we optimize the CTA rasterization and work tile mapping such that they hit in L2 as much as possible. This applies to the partial reduction as well. |
okay, doing this maximizes GPU occupancy, but what about latency? Isn't this a tradeoff here? Or by optimizing L2 hit rate does split-k reduction have almost the same inference latency as without it? |
I'm not sure what you mean by this being a tradeoff between occupancy and latency. Latency of what? Individual gmem accesses or the time to completion of the kernel? Split K increases utilization. Therefore the wallclock time of the kernel goes down. Regardless of gmem latencies for individual tiles, this implies that the end to end latency of the kernel decreases. |
By latency I mean the total time it takes to compute the C matrix if we used split-k reduction, so yes, time to completion of the kernel. So, if the wallclock time goes down that means that the additional parallelism that split-k achieves compensates for the extra gmem accesses, right? Does this happen for all different cases of matrix sizes M,N,K? as in, can there be some cases where using split-k actually slows down the time it takes to compute C? |
Yes. Split K nets you speedups when MN are small and K is large. If the outer dims are big then using split K will worsen perf, especially if the contraction dim is small. |
|
|
|
Thanks! Can't find it on readme, can you link it here please? |
Thanks. Also why is the split-k's job split into two kernels: GemmKernel and ReductionKernel? Wouldn't launching it as 1 fused kernel be less costly? |
parallel splitk uses a separate reduction kernel. serial splitk fused the reduction inside the kernel. the former works better when split slices are big. |
I see. I have a few questions on how parallel splitk works:
|
Partial products are in the global memory (workspace) no matter it is serial or parallel splitk. serial splitk does not need a separate reduction kernel, but parallel splitk needs one. If |
@thakkarV In the streamK paper, they've compared “two-tile Stream-K + data-parallel" against data parallel CUTLASS, but I was wondering what the performance difference (in terms of both latency and occupancy) is for:
In short, how much does the suboptimal cache hits in basic streamK affect performance and how does streamK compare to split-k. I understand this must be contingent to problem sizes and hardware, but what's the general consensus? |
I don't think there will be "general consensus" which one is better. They might be just be trail and error results. What they don't mention in the paper (explicitly) is that they are moving from "data parallel" to "task parallel". And the "tasking" is causing some kind of contention and they want to amortize. So the covert plot behind the scenes (might be, educated guess):
I believe the partitioned far+near L2 cache plays an important role here. They don't mention it in the paper as the the GPU is "hypothetical" in the paper. |
This issue has been labeled |
Assuming partition=16 in PartitionedK GEMM during split-k reduction at threadblock level:
All of the 16 tiles belonging to the same "row" of block tiles in A matrix run on different SM's of a GPU. During the reduction phase, some of these tiles will need to be written to global memory so they can be loaded in by the consumer tiles that will perform the reduction.
Wouldn't this increase latency due to the DRAM BW bottleneck, since we now have extra gmem accesses compared to the un-partitioned case, even though we've achieved higher occupancy? How does cutlass optimize this?
The text was updated successfully, but these errors were encountered: