Replies: 1 comment 1 reply
-
|
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am trying to understand the implementation in include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized_fp8.hpp
I am confused about some points in this implementation. The questions are so nOOb but I just started with CUTLASS
why we need these prologue iterations before mainloop. https://github.com/NVIDIA/cutlass/blob/637b15906358191cb4238af419d408a65819d7ec/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized_fp8.hpp#L452C5-L453C61
why fp8 gemm is implemented with warpspecialization? Does making use of TMA dataloading neccessarily means we need to do warpspecialization?
Appreciate any explanation or directing me to some documentation.
Beta Was this translation helpful? Give feedback.
All reactions