Where is the XOR-permuted layout implemented? #510

masahi · 2022-06-01T12:42:15Z

masahi
Jun 1, 2022

Hi, I'm studying the permuted shared memory layout using xor described in the slide https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21745-developing-cuda-kernels-to-push-tensor-cores-to-the-absolute-limit-on-nvidia-a100.pdf. I want to know where the actual implementation lives in the cutlass code base.

For example, in the page 45 of the slide, a complicated mapping from a thread ID to a location in the shared memory to load from is illustrated visually, but the actual indexing math is not provided:

I tried to find the corresponding implementation in https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/layout/tensor_op_multiplicand_sm80.h, but I'm a bit confused because layout implementations there don't seem to talk about thread IDs. And the use of the xor function looks less obvious compared to

 int store_column = (lane_id % 8) ^ (lane_id / 8);

described in https://github.com/NVIDIA/cutlass/blob/master/media/docs/implicit_gemm_convolution.md#shared-memory-layouts

I also want to understand the terminology "Congruous" and "Crosswise" used in the code. For example, is the layout above "Crosswise"?

Answered by hwu36

Jun 1, 2022

The mapping of the thread to the data is in the iterators. The shared memory store iterator RegularTileAccessIterator is in https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/transform/threadblock/regular_tile_access_iterator_tensor_op.h. The shared memory load iterator MmaTensorOpMultiplicandTileIterator in https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/warp/mma_tensor_op_tile_iterator.h. Check their constructors for the initial mapping. Recommend to insert many printfs if you need to dive into them. The iterators are very general which covers cases that you don't care. So, you don't need to figure out every variables or templates.

https://github.com/NVIDIA…

View full answer

hwu36 · 2022-06-01T14:00:50Z

hwu36
Jun 1, 2022
Maintainer

The mapping of the thread to the data is in the iterators. The shared memory store iterator RegularTileAccessIterator is in https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/transform/threadblock/regular_tile_access_iterator_tensor_op.h. The shared memory load iterator MmaTensorOpMultiplicandTileIterator in https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/warp/mma_tensor_op_tile_iterator.h. Check their constructors for the initial mapping. Recommend to insert many printfs if you need to dive into them. The iterators are very general which covers cases that you don't care. So, you don't need to figure out every variables or templates.

https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/layout/tensor_op_multiplicand_sm80.h is for fp64 tensor cores. The slides is for fp16 tensor cores which is in https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/layout/tensor_op_multiplicand_sm75.h (Turing and Ampere Fp16 layouts are the same). https://github.com/NVIDIA/cutlass/tree/master/examples/03_visualize_layout can help you visualize it.

Congruous means the contiguous dimension is M or N dimension, the strided dimension is K dimension. Crosswise means the contiguous dimension is K dimension, the strided dimension is M or N dimension. The layout itself is the same as they map to the same base class.

0 replies

masahi · 2022-06-01T20:51:23Z

masahi
Jun 1, 2022
Author

So, I see two distinct index maps defined in tensor_op_multiplicand_sm75.h

https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/layout/tensor_op_multiplicand_sm75.h#L148-L202
https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/layout/tensor_op_multiplicand_sm75.h#L399-L411

Since I expect that storing to and loading from shmem require different index mappings (slides p39 and p45), initially I thought that two maps above correspond to one of them being for storing and the other one for loading. But it looks like both maps are used in regular_tile_access_iterator_tensor_op.h (and similarly in mma_tensor_op_tile_iterator.h).

What am I missing?

1 reply

masahi Jun 1, 2022
Author

Maybe, storing and loading can use the same index map, but the map is initialized in different ways for storing vs loading?

hwu36 · 2022-06-02T01:41:17Z

hwu36
Jun 2, 2022
Maintainer

https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/layout/tensor_op_multiplicand_sm75.h#L148-L202

This is a general one which is used by the int8 and fp16 tensor core gemms.

https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/layout/tensor_op_multiplicand_sm75.h#L399-L411

This is a special one used only by TF32 NT gemm

I expect that storing to and loading from shmem require different index mappings

Your expectation is correct.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where is the XOR-permuted layout implemented? #510

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Where is the XOR-permuted layout implemented? #510

masahi Jun 1, 2022

Replies: 3 comments · 1 reply

hwu36 Jun 1, 2022 Maintainer

masahi Jun 1, 2022 Author

masahi Jun 1, 2022 Author

hwu36 Jun 2, 2022 Maintainer

masahi
Jun 1, 2022

Replies: 3 comments 1 reply

hwu36
Jun 1, 2022
Maintainer

masahi
Jun 1, 2022
Author

masahi Jun 1, 2022
Author

hwu36
Jun 2, 2022
Maintainer