-
Hi, I'm studying the permuted shared memory layout using xor described in the slide https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21745-developing-cuda-kernels-to-push-tensor-cores-to-the-absolute-limit-on-nvidia-a100.pdf. I want to know where the actual implementation lives in the cutlass code base. For example, in the page 45 of the slide, a complicated mapping from a thread ID to a location in the shared memory to load from is illustrated visually, but the actual indexing math is not provided: I tried to find the corresponding implementation in https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/layout/tensor_op_multiplicand_sm80.h, but I'm a bit confused because layout implementations there don't seem to talk about thread IDs. And the use of the xor function looks less obvious compared to
described in https://github.com/NVIDIA/cutlass/blob/master/media/docs/implicit_gemm_convolution.md#shared-memory-layouts I also want to understand the terminology "Congruous" and "Crosswise" used in the code. For example, is the layout above "Crosswise"? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
The mapping of the thread to the data is in the iterators. The shared memory store iterator https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/layout/tensor_op_multiplicand_sm80.h is for fp64 tensor cores. The slides is for fp16 tensor cores which is in https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/layout/tensor_op_multiplicand_sm75.h (Turing and Ampere Fp16 layouts are the same). https://github.com/NVIDIA/cutlass/tree/master/examples/03_visualize_layout can help you visualize it.
|
Beta Was this translation helpful? Give feedback.
-
So, I see two distinct index maps defined in tensor_op_multiplicand_sm75.h https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/layout/tensor_op_multiplicand_sm75.h#L148-L202 Since I expect that storing to and loading from shmem require different index mappings (slides p39 and p45), initially I thought that two maps above correspond to one of them being for storing and the other one for loading. But it looks like both maps are used in What am I missing? |
Beta Was this translation helpful? Give feedback.
-
This is a general one which is used by the int8 and fp16 tensor core gemms. This is a special one used only by TF32 NT gemm
Your expectation is correct. |
Beta Was this translation helpful? Give feedback.
The mapping of the thread to the data is in the iterators. The shared memory store iterator
RegularTileAccessIterator
is in https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/transform/threadblock/regular_tile_access_iterator_tensor_op.h. The shared memory load iteratorMmaTensorOpMultiplicandTileIterator
in https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/warp/mma_tensor_op_tile_iterator.h. Check their constructors for the initial mapping. Recommend to insert many printfs if you need to dive into them. The iterators are very general which covers cases that you don't care. So, you don't need to figure out every variables or templates.https://github.com/NVIDIA…