- Minimal SM90 WGMMA + TMA GEMM example in 100 lines of code.
- Exposure of L2
cache_hint
s in TMA copy atoms - Exposure of raster order and tile swizzle extent in CUTLASS library profiler, and
example 48. - TMA store based and EVT supported epilogues for Hopper pointer array batched kernels.
- A new
GemmSparseUniversal
API for CUTLASS 2.x Ampere kernels to enable serial and parallel split-k for sparse tensor cores and new tiny tile sizes to better support LLM inference. - CUDA host adapter extensions to support TMA descriptor construction driver APIs.
- Inclusion of more Hopper fprop, dgrad, and wgrad convolution kernels in CUTLASS library and profiler.
- Support for residual add (beta != 0) in convolution kernels.
- A new convolution epilogue for CUTLASS 2.x to support non-packed NHWC output.
- A refactor of include files throughout CUTLASS core directories to reduce circular dependencies and tests to guard against them.
- A guide for setting up VSCode to work well with CUTLASS and expanded code style guide.
- Better support for MSVC as a host compiler.
- Many performance optimizations, improvements, and bug fixes including fixes for FlashAttention-2.
- Optimal code generation with CUDA toolkit versions 12.4 and 12.5u1.
- NOTICE:
- Upcoming CUTLASS 3.6 release will include a breaking refactor to the CUTLASS 3.x convolution
kernel::ConvUniversal
API to bring it in line withgemm::GemmUniversal
. After this, the 3.x convolution API will no longer be considered as a beta API. - Upcoming CUTLASS 3.6 release will include a breaking refactor to the Hopper TMA pointer array batched epilogue in order to support grouped GEMMs.
- Upcoming CUTLASS 3.6 release will include a breaking refactor to the CUTLASS 3.x convolution