-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] Split-K: Reduce in Shared Memory instead of Global Memory #1421
Comments
There's a third option that cutlass implements as well which is a semaphore based serial reduction across different CTAs and that doesn't require a separate reduction kernel or a round trip to gmem 1 is easy to implement and can have great perf depending on the problem size 2 is more difficult to implement and reduces the arithmetic intensity of the CTA level GEMM, which are usually highly tuned for a given tile size. It can be done but the benefits are limited to fewer scenarios |
Thanks for the very informative response! For the 3rd option, do you mind sharing the pointer to its code? |
|
Thanks for the pointer! In my case, we are mostly implementing things ourselves. Do you have any thoughts on the performance + implementability of the 3rd option versus the first two options (especially the first option)? |
implementability wise its between the two in terms of difficult and complexity. performance wise, it depends (on arch, problem size, your kernel schedule, pipelining strategy, fusions, etc...) |
Understood, thanks again for the super helpful answers! |
I have a (somewhat naive/newbie) question about the Split-K implementation.
In CUTLASS, the Split-K kernel splits the K dimension so that multiple thread blocks compute (in parallel) the partial results of one column tile. Then, these partial results are written to global memory. Finally, these partial results are reduced via a separate reduction kernel launch. Hence, this implementation introduces a round trip of these partial results between off-chip memory and on-chip memory, as well as the overhead of one extra kernel launches.
Here is a potential alternative implementation, in which we split the K dimension so that multiple threads compute (in parallel) the partial results of one column tile. These partial results are written to and reduced through shared memory, before writing the results back to global memory.
Compared to method (1), method (2) saves a round trip between off-chip and on-chip memory and a separate kernel launch. Is there any reason why (1) is (overwhelmingly?) preferred to (2)?
This is (loosely) a follow-up question to #1391, which is more like method (2).
Thanks in advance for your help!
The text was updated successfully, but these errors were encountered: