[QST] What is the canonical method to remap the coordinate in CuTe? #1277

cloudhan · 2023-12-22T02:17:49Z

Say, I want to tiling something, thread tiling, warp tiling and maybe cta tiling for L2, you name it. And then it comes to coordinate remapping. The only way (see code) I can come up with is map linear index and then unmap from it.

#include <cute/tensor.hpp>
#include <cute/layout.hpp>

using namespace cute;

int main() {
  auto inner = make_layout(make_shape(_2{}, _3{}));
  auto tiler = make_layout(make_shape(3, 4));
  auto tiled = blocked_product(inner, tiler);   // (_x, _y) -> linear_idx
  auto naive = make_layout(make_shape(6, 12));  // (.x, .y) -> linear_idx, naively linearize blockIdx

  print_layout(naive);
  // c'mon, you don't need a print for this...

  print_layout(tiled);
  // ((_2,3),(_3,4)):((_1,_6),(_2,18))
  //        0    1    2    3    4    5    6    7    8    9   10   11
  //     +----+----+----+----+----+----+----+----+----+----+----+----+
  //  0  |  0 |  2 |  4 | 18 | 20 | 22 | 36 | 38 | 40 | 54 | 56 | 58 |
  //     +----+----+----+----+----+----+----+----+----+----+----+----+
  //  1  |  1 |  3 |  5 | 19 | 21 | 23 | 37 | 39 | 41 | 55 | 57 | 59 |
  //     +----+----+----+----+----+----+----+----+----+----+----+----+
  //  2  |  6 |  8 | 10 | 24 | 26 | 28 | 42 | 44 | 46 | 60 | 62 | 64 |
  //     +----+----+----+----+----+----+----+----+----+----+----+----+
  //  3  |  7 |  9 | 11 | 25 | 27 | 29 | 43 | 45 | 47 | 61 | 63 | 65 |
  //     +----+----+----+----+----+----+----+----+----+----+----+----+
  //  4  | 12 | 14 | 16 | 30 | 32 | 34 | 48 | 50 | 52 | 66 | 68 | 70 |
  //     +----+----+----+----+----+----+----+----+----+----+----+----+
  //  5  | 13 | 15 | 17 | 31 | 33 | 35 | 49 | 51 | 53 | 67 | 69 | 71 |
  //     +----+----+----+----+----+----+----+----+----+----+----+----+

  dim3 blockIdx;
  for (blockIdx.y = 0; blockIdx.y < size<1>(naive); blockIdx.y++) {
    for (blockIdx.x = 0; blockIdx.x < size<0>(naive); blockIdx.x++) {
      auto linear_idx = naive(blockIdx.x, blockIdx.y);  // map to linear index
      // auto [blockIdx_x_tuple, blockIdx_y_tuple] = tiled[linear_idx];
      auto [blockIdx_x, blockIdx_xx, blockIdx_y, blockIdx_yy] = flatten(tiled)[linear_idx];  // unmap from linear index
      blockIdx_x += blockIdx_xx * inner.shape<0>();
      blockIdx_y += blockIdx_yy * inner.shape<1>();
      std::cout << linear_idx << "\t(.x,.y)=(" << blockIdx.x << "," << blockIdx.y << ")\t(_x,_y)=(" << blockIdx_x << "," << blockIdx_y << ")\n";
      // 0       (.x,.y)=(0,0)   (_x,_y)=(0,0)
      // 1       (.x,.y)=(1,0)   (_x,_y)=(1,0)
      // 2       (.x,.y)=(2,0)   (_x,_y)=(0,1)
      // 3       (.x,.y)=(3,0)   (_x,_y)=(1,1)
      // 4       (.x,.y)=(4,0)   (_x,_y)=(0,2)
      // 5       (.x,.y)=(5,0)   (_x,_y)=(1,2)
      // 6       (.x,.y)=(0,1)   (_x,_y)=(2,0)
      // 7       (.x,.y)=(1,1)   (_x,_y)=(3,0)
      // 8       (.x,.y)=(2,1)   (_x,_y)=(2,1)
      // 9       (.x,.y)=(3,1)   (_x,_y)=(3,1)
      // 10      (.x,.y)=(4,1)   (_x,_y)=(2,2)
      // 11      (.x,.y)=(5,1)   (_x,_y)=(3,2)
      // 12      (.x,.y)=(0,2)   (_x,_y)=(4,0)
      // 13      (.x,.y)=(1,2)   (_x,_y)=(5,0)
      // ...
    }
  }

  return 0;
}

It worked, but seems to be very fragile due to the involved hierarchical coordinate, because sometime I may want a 1d coord for each mode. Is there a canonical way achieve it?

cloudhan · 2023-12-22T02:19:27Z

@ccecka Any suggestion?

ccecka · 2023-12-22T17:52:28Z

I'm not sure what exactly you're trying to accomplish here. We don't have a need for "coordinate remapping" outside of predication applications:
https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/0y_predication.md
The canonical philosophy in CuTe is to always retain "logical consistency", so that the only thing we have to care about is accessing coordinates consistently in all of our tensors, which of course can map to any index/offset. In your application above, you appear to care more about consistency of the codomain rather than the domain. You don't care about the coordinates, and somehow want to transform between multiple different coordinates to access the same "linear idx". This is not our approach and I would recommend thinking about what it would mean to work with the inverses of the layouts you're describing above.

That said, the mapper you're interested in can be written as

  auto inner = make_layout(make_shape(_2{}, _3{}));
  auto tiler = make_layout(make_shape(_3{}, _4{}));
  auto tiled = blocked_product(inner, tiler);         // (_x, _y) -> linear_idx
  auto naive = make_layout(make_shape(_6{}, _12{}));  // (.x, .y) -> linear_idx, naively linearize blockIdx
  auto coord = make_identity_layout(shape(naive));    // (.x, .y) -> (.x, .y)

  auto cmap  = coord.compose(right_inverse(tiled)).compose(naive);  // (.x, .y) -> linear_idx -> (_x, _y)
  std::cout << cmap << std::endl;
  for (int c = 0; c < size<1>(cmap); ++c) {
    for (int r = 0; r < size<0>(cmap); ++r) {
      std::cout << naive(r,c) << "\t(.x,.y)=(" << r << "," << c << ")\t(_x,_y)=" << cmap(r,c) << std::endl;
      // 0       (.x,.y)=(0,0)   (_x,_y)=(0,0)
      // 1       (.x,.y)=(1,0)   (_x,_y)=(1,0)
      // 2       (.x,.y)=(2,0)   (_x,_y)=(0,1)
      // 3       (.x,.y)=(3,0)   (_x,_y)=(1,1)
      // 4       (.x,.y)=(4,0)   (_x,_y)=(0,2)
      // ...
    }
  }

where you may need to reference
https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/0z_tma_tensors.md
(which is going to be updated in a few days) to explain the fancy strides.

ccecka · 2023-12-22T18:20:02Z

In the case that you're actually interested in transforming blockIdx coordinates for L2 cache efficiency specifically, this is solved in CUTLASS 3.x with the following:

Classic blockIdx swizzling and reordering:
https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/sm90_tile_scheduler.hpp

"Stream-K" tile scheduling
https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/sm90_tile_scheduler_stream_k.hpp
https://dl.acm.org/doi/pdf/10.1145/3572848.3577479

These typically don't use CuTe-Layout transforms for a few technical reasons -- the dynamic shapes often cause lots of divmods that need to be carefully optimized, CuTe's admissibility of composition and inverse is weaker when dynamic shapes are involved, etc.

cloudhan · 2023-12-23T03:07:33Z

I'm not sure what exactly you're trying to accomplish here. We don't have a need for "coordinate remapping" outside of predication applications: https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/0y_predication.md The canonical philosophy in CuTe is to always retain "logical consistency", so that the only thing we have to care about is accessing coordinates consistently in all of our tensors, which of course can map to any index/offset. In your application above, you appear to care more about consistency of the codomain rather than the domain. You don't care about the coordinates, and somehow want to transform between multiple different coordinates to access the same "linear idx". This is not our approach and I would recommend thinking about what it would mean to work with the inverses of the layouts you're describing above.

That said, the mapper you're interested in can be written as
  auto inner = make_layout(make_shape(_2{}, _3{}));
  auto tiler = make_layout(make_shape(_3{}, _4{}));
  auto tiled = blocked_product(inner, tiler);         // (_x, _y) -> linear_idx
  auto naive = make_layout(make_shape(_6{}, _12{}));  // (.x, .y) -> linear_idx, naively linearize blockIdx
  auto coord = make_identity_layout(shape(naive));    // (.x, .y) -> (.x, .y)

  auto cmap  = coord.compose(right_inverse(tiled)).compose(naive);  // (.x, .y) -> linear_idx -> (_x, _y)
  std::cout << cmap << std::endl;
  for (int c = 0; c < size<1>(cmap); ++c) {
    for (int r = 0; r < size<0>(cmap); ++r) {
      std::cout << naive(r,c) << "\t(.x,.y)=(" << r << "," << c << ")\t(_x,_y)=" << cmap(r,c) << std::endl;
      // 0       (.x,.y)=(0,0)   (_x,_y)=(0,0)
      // 1       (.x,.y)=(1,0)   (_x,_y)=(1,0)
      // 2       (.x,.y)=(2,0)   (_x,_y)=(0,1)
      // 3       (.x,.y)=(3,0)   (_x,_y)=(1,1)
      // 4       (.x,.y)=(4,0)   (_x,_y)=(0,2)
      // ...
    }
  }
where you may need to reference https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/0z_tma_tensors.md (which is going to be updated in a few days) to explain the fancy strides.

Thanks, I actullay came up with the coord, but used auto coord = make_identity_layout(shape(tiled));, then recovered the coord hierarichally again😅, and then discarded the idea immediately.

cloudhan · 2023-12-27T16:34:01Z

The canonical philosophy in CuTe is to always retain "logical consistency", so that the only thing we have to care about is accessing coordinates consistently in all of our tensors

Sometime it is not that easy as the tensor don't have the same modes or have indirect mapping (just like paged attention), in this case you want to recover the mapped the coords and manually slice the portion of the data you (the thread or some other logical unit) care about and restart from there.

cloudhan · 2023-12-27T16:47:10Z

To expand the "indirect mapping", cute somehow reminds me the Taichi[1] paper, in taichi, the array can be hierarchical but the array indexing is flattened (very like the cute hierarchial coord), you just index to the data lied at very bottom. For example, it can somehow create an array of pointers to array, the pointed arrays are arrays of vec3s. You can index through the hierarchy and access vec3 directly. It will be very interesting to see cute to support these type of "pointer delimited" tensors.

[1] Yuanming Hu, Tzu-Mao Li, Luke Anderson, Jonathan Ragan-Kelley, and Frédo Durand. 2019. Taichi: a language for high-performance computation on spatially sparse data structures. ACM Trans. Graph. 38, 6, Article 201 (December 2019), 16 pages. https://doi.org/10.1145/3355089.3356506

cloudhan added ? - Needs Triage question Question labels Dec 22, 2023

cloudhan closed this as completed Dec 27, 2023

Maximilianxu mentioned this issue Jul 11, 2024

[QST] [CuTe] Why right_inverse does not work as expected when shape is not static? #1625

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] What is the canonical method to remap the coordinate in CuTe? #1277

[QST] What is the canonical method to remap the coordinate in CuTe? #1277

cloudhan commented Dec 22, 2023 •

edited

Loading

cloudhan commented Dec 22, 2023

ccecka commented Dec 22, 2023 •

edited

Loading

ccecka commented Dec 22, 2023

cloudhan commented Dec 23, 2023

cloudhan commented Dec 27, 2023

cloudhan commented Dec 27, 2023 •

edited

Loading

[QST] What is the canonical method to remap the coordinate in CuTe? #1277

[QST] What is the canonical method to remap the coordinate in CuTe? #1277

Comments

cloudhan commented Dec 22, 2023 • edited Loading

cloudhan commented Dec 22, 2023

ccecka commented Dec 22, 2023 • edited Loading

ccecka commented Dec 22, 2023

cloudhan commented Dec 23, 2023

cloudhan commented Dec 27, 2023

cloudhan commented Dec 27, 2023 • edited Loading

cloudhan commented Dec 22, 2023 •

edited

Loading

ccecka commented Dec 22, 2023 •

edited

Loading

cloudhan commented Dec 27, 2023 •

edited

Loading