Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] What is the canonical method to remap the coordinate in CuTe? #1277

Closed
cloudhan opened this issue Dec 22, 2023 · 6 comments
Closed

[QST] What is the canonical method to remap the coordinate in CuTe? #1277

cloudhan opened this issue Dec 22, 2023 · 6 comments

Comments

@cloudhan
Copy link

cloudhan commented Dec 22, 2023

Say, I want to tiling something, thread tiling, warp tiling and maybe cta tiling for L2, you name it. And then it comes to coordinate remapping. The only way (see code) I can come up with is map linear index and then unmap from it.

#include <cute/tensor.hpp>
#include <cute/layout.hpp>

using namespace cute;

int main() {
  auto inner = make_layout(make_shape(_2{}, _3{}));
  auto tiler = make_layout(make_shape(3, 4));
  auto tiled = blocked_product(inner, tiler);   // (_x, _y) -> linear_idx
  auto naive = make_layout(make_shape(6, 12));  // (.x, .y) -> linear_idx, naively linearize blockIdx

  print_layout(naive);
  // c'mon, you don't need a print for this...

  print_layout(tiled);
  // ((_2,3),(_3,4)):((_1,_6),(_2,18))
  //        0    1    2    3    4    5    6    7    8    9   10   11
  //     +----+----+----+----+----+----+----+----+----+----+----+----+
  //  0  |  0 |  2 |  4 | 18 | 20 | 22 | 36 | 38 | 40 | 54 | 56 | 58 |
  //     +----+----+----+----+----+----+----+----+----+----+----+----+
  //  1  |  1 |  3 |  5 | 19 | 21 | 23 | 37 | 39 | 41 | 55 | 57 | 59 |
  //     +----+----+----+----+----+----+----+----+----+----+----+----+
  //  2  |  6 |  8 | 10 | 24 | 26 | 28 | 42 | 44 | 46 | 60 | 62 | 64 |
  //     +----+----+----+----+----+----+----+----+----+----+----+----+
  //  3  |  7 |  9 | 11 | 25 | 27 | 29 | 43 | 45 | 47 | 61 | 63 | 65 |
  //     +----+----+----+----+----+----+----+----+----+----+----+----+
  //  4  | 12 | 14 | 16 | 30 | 32 | 34 | 48 | 50 | 52 | 66 | 68 | 70 |
  //     +----+----+----+----+----+----+----+----+----+----+----+----+
  //  5  | 13 | 15 | 17 | 31 | 33 | 35 | 49 | 51 | 53 | 67 | 69 | 71 |
  //     +----+----+----+----+----+----+----+----+----+----+----+----+

  dim3 blockIdx;
  for (blockIdx.y = 0; blockIdx.y < size<1>(naive); blockIdx.y++) {
    for (blockIdx.x = 0; blockIdx.x < size<0>(naive); blockIdx.x++) {
      auto linear_idx = naive(blockIdx.x, blockIdx.y);  // map to linear index
      // auto [blockIdx_x_tuple, blockIdx_y_tuple] = tiled[linear_idx];
      auto [blockIdx_x, blockIdx_xx, blockIdx_y, blockIdx_yy] = flatten(tiled)[linear_idx];  // unmap from linear index
      blockIdx_x += blockIdx_xx * inner.shape<0>();
      blockIdx_y += blockIdx_yy * inner.shape<1>();
      std::cout << linear_idx << "\t(.x,.y)=(" << blockIdx.x << "," << blockIdx.y << ")\t(_x,_y)=(" << blockIdx_x << "," << blockIdx_y << ")\n";
      // 0       (.x,.y)=(0,0)   (_x,_y)=(0,0)
      // 1       (.x,.y)=(1,0)   (_x,_y)=(1,0)
      // 2       (.x,.y)=(2,0)   (_x,_y)=(0,1)
      // 3       (.x,.y)=(3,0)   (_x,_y)=(1,1)
      // 4       (.x,.y)=(4,0)   (_x,_y)=(0,2)
      // 5       (.x,.y)=(5,0)   (_x,_y)=(1,2)
      // 6       (.x,.y)=(0,1)   (_x,_y)=(2,0)
      // 7       (.x,.y)=(1,1)   (_x,_y)=(3,0)
      // 8       (.x,.y)=(2,1)   (_x,_y)=(2,1)
      // 9       (.x,.y)=(3,1)   (_x,_y)=(3,1)
      // 10      (.x,.y)=(4,1)   (_x,_y)=(2,2)
      // 11      (.x,.y)=(5,1)   (_x,_y)=(3,2)
      // 12      (.x,.y)=(0,2)   (_x,_y)=(4,0)
      // 13      (.x,.y)=(1,2)   (_x,_y)=(5,0)
      // ...
    }
  }

  return 0;
}

It worked, but seems to be very fragile due to the involved hierarchical coordinate, because sometime I may want a 1d coord for each mode. Is there a canonical way achieve it?

@cloudhan
Copy link
Author

@ccecka Any suggestion?

@ccecka
Copy link

ccecka commented Dec 22, 2023

I'm not sure what exactly you're trying to accomplish here. We don't have a need for "coordinate remapping" outside of predication applications:
https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/0y_predication.md
The canonical philosophy in CuTe is to always retain "logical consistency", so that the only thing we have to care about is accessing coordinates consistently in all of our tensors, which of course can map to any index/offset. In your application above, you appear to care more about consistency of the codomain rather than the domain. You don't care about the coordinates, and somehow want to transform between multiple different coordinates to access the same "linear idx". This is not our approach and I would recommend thinking about what it would mean to work with the inverses of the layouts you're describing above.

That said, the mapper you're interested in can be written as

  auto inner = make_layout(make_shape(_2{}, _3{}));
  auto tiler = make_layout(make_shape(_3{}, _4{}));
  auto tiled = blocked_product(inner, tiler);         // (_x, _y) -> linear_idx
  auto naive = make_layout(make_shape(_6{}, _12{}));  // (.x, .y) -> linear_idx, naively linearize blockIdx
  auto coord = make_identity_layout(shape(naive));    // (.x, .y) -> (.x, .y)

  auto cmap  = coord.compose(right_inverse(tiled)).compose(naive);  // (.x, .y) -> linear_idx -> (_x, _y)
  std::cout << cmap << std::endl;
  for (int c = 0; c < size<1>(cmap); ++c) {
    for (int r = 0; r < size<0>(cmap); ++r) {
      std::cout << naive(r,c) << "\t(.x,.y)=(" << r << "," << c << ")\t(_x,_y)=" << cmap(r,c) << std::endl;
      // 0       (.x,.y)=(0,0)   (_x,_y)=(0,0)
      // 1       (.x,.y)=(1,0)   (_x,_y)=(1,0)
      // 2       (.x,.y)=(2,0)   (_x,_y)=(0,1)
      // 3       (.x,.y)=(3,0)   (_x,_y)=(1,1)
      // 4       (.x,.y)=(4,0)   (_x,_y)=(0,2)
      // ...
    }
  }

where you may need to reference
https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/0z_tma_tensors.md
(which is going to be updated in a few days) to explain the fancy strides.

@ccecka
Copy link

ccecka commented Dec 22, 2023

In the case that you're actually interested in transforming blockIdx coordinates for L2 cache efficiency specifically, this is solved in CUTLASS 3.x with the following:

Classic blockIdx swizzling and reordering:
https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/sm90_tile_scheduler.hpp

"Stream-K" tile scheduling
https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/sm90_tile_scheduler_stream_k.hpp
https://dl.acm.org/doi/pdf/10.1145/3572848.3577479

These typically don't use CuTe-Layout transforms for a few technical reasons -- the dynamic shapes often cause lots of divmods that need to be carefully optimized, CuTe's admissibility of composition and inverse is weaker when dynamic shapes are involved, etc.

@cloudhan
Copy link
Author

I'm not sure what exactly you're trying to accomplish here. We don't have a need for "coordinate remapping" outside of predication applications: https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/0y_predication.md The canonical philosophy in CuTe is to always retain "logical consistency", so that the only thing we have to care about is accessing coordinates consistently in all of our tensors, which of course can map to any index/offset. In your application above, you appear to care more about consistency of the codomain rather than the domain. You don't care about the coordinates, and somehow want to transform between multiple different coordinates to access the same "linear idx". This is not our approach and I would recommend thinking about what it would mean to work with the inverses of the layouts you're describing above.

That said, the mapper you're interested in can be written as

  auto inner = make_layout(make_shape(_2{}, _3{}));
  auto tiler = make_layout(make_shape(_3{}, _4{}));
  auto tiled = blocked_product(inner, tiler);         // (_x, _y) -> linear_idx
  auto naive = make_layout(make_shape(_6{}, _12{}));  // (.x, .y) -> linear_idx, naively linearize blockIdx
  auto coord = make_identity_layout(shape(naive));    // (.x, .y) -> (.x, .y)

  auto cmap  = coord.compose(right_inverse(tiled)).compose(naive);  // (.x, .y) -> linear_idx -> (_x, _y)
  std::cout << cmap << std::endl;
  for (int c = 0; c < size<1>(cmap); ++c) {
    for (int r = 0; r < size<0>(cmap); ++r) {
      std::cout << naive(r,c) << "\t(.x,.y)=(" << r << "," << c << ")\t(_x,_y)=" << cmap(r,c) << std::endl;
      // 0       (.x,.y)=(0,0)   (_x,_y)=(0,0)
      // 1       (.x,.y)=(1,0)   (_x,_y)=(1,0)
      // 2       (.x,.y)=(2,0)   (_x,_y)=(0,1)
      // 3       (.x,.y)=(3,0)   (_x,_y)=(1,1)
      // 4       (.x,.y)=(4,0)   (_x,_y)=(0,2)
      // ...
    }
  }

where you may need to reference https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/0z_tma_tensors.md (which is going to be updated in a few days) to explain the fancy strides.

Thanks, I actullay came up with the coord, but used auto coord = make_identity_layout(shape(tiled));, then recovered the coord hierarichally again😅, and then discarded the idea immediately.

@cloudhan
Copy link
Author

The canonical philosophy in CuTe is to always retain "logical consistency", so that the only thing we have to care about is accessing coordinates consistently in all of our tensors

Sometime it is not that easy as the tensor don't have the same modes or have indirect mapping (just like paged attention), in this case you want to recover the mapped the coords and manually slice the portion of the data you (the thread or some other logical unit) care about and restart from there.

@cloudhan
Copy link
Author

cloudhan commented Dec 27, 2023

To expand the "indirect mapping", cute somehow reminds me the Taichi[1] paper, in taichi, the array can be hierarchical but the array indexing is flattened (very like the cute hierarchial coord), you just index to the data lied at very bottom. For example, it can somehow create an array of pointers to array, the pointed arrays are arrays of vec3s. You can index through the hierarchy and access vec3 directly. It will be very interesting to see cute to support these type of "pointer delimited" tensors.

[1] Yuanming Hu, Tzu-Mao Li, Luke Anderson, Jonathan Ragan-Kelley, and Frédo Durand. 2019. Taichi: a language for high-performance computation on spatially sparse data structures. ACM Trans. Graph. 38, 6, Article 201 (December 2019), 16 pages. https://doi.org/10.1145/3355089.3356506

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants