Questions about the implementation of remsi #54

yuanfz98 · 2023-11-16T01:30:58Z

yuanfz98
Nov 16, 2023

Hello community,

I have some questions about the recently added implementation of remsi.

From my understanding, if we have an array of ptrs with offsets like:

[0, 1, 2, 3]

then with a remsi [2, 2, 2, 2] applied we will get:

[0, 1, 0, 1]

Finally we have 2 continuous memrefs to handle. Each of them has size=2.

Does it mean that one remsi may produce many memref.reinterpret_cast (and eventually many memref.copy) ?
Do "SideBySide" and "Stacked" intend to resolve cases when index exceeds ? e.g. If [0, 1, 2, 3, 4] then we have [0, 1, 0, 1, 0] with the last index exceeds.
Why do we only support 2D cases ? See #16.
Is it the same behaviour as arith.divOp ?

nhat-nguyen · 2023-12-01T16:59:01Z

nhat-nguyen
Dec 1, 2023
Maintainer

Does it mean that one remsi may produce many memref.reinterpret_cast (and eventually many memref.copy) ?

Yes if we have a remsi, we will end up with two memref.reinterpret_cast and two corresponding memref.copy

Do "SideBySide" and "Stacked" intend to resolve cases when index exceeds ? e.g. If [0, 1, 2, 3, 4] then we have [0, 1, 0, 1, 0] with the last index exceeds.

You're right. The remsi support we recently added is to handle code that is similar in nature to the triton matmul tutorial. Basically, if we distribute a tensor of [size], and each instance of our triton programs operates on a fixed BLOCK_SIZE, the last instance may load values that go out of bounds if size % BLOCK_SIZE != 0. To prevent from going out of bounds, we will mod the offsets with size: (offset + tl.arange(0, BLOCK_SIZE)) % size.

Our implementation is very primitive and only supports the above particular use case, and that we either mod by row or column but not both. For more details please take a look at these two diagrams from our PtrAnalysis file:

  //////////////////////////////////////////////////////////////////////////////
  //
  // Handling stacked wraparound
  //
  // We do not support cases where the target offset has already overflown the
  // number of rows. See side-by-side wraparound for details.
  //
  //////////////////////////////////////////////////////////////////////////////
  //    We're loading a tensor of dim (rowSize, colSize)
  //    d1 + d2 = rowSize
  //    d2 is the number of rows that overflow
  //
  //                       cols
  //
  //               wrappedAroundOff
  //      --------------*------------*--------
  //      |        d2   |            |       |
  //      |             |------------|       |
  //  rows|                                  |
  //      |                                  |
  //      |           targetOffset           |
  //      |             *------------|       |
  //      |             |            |       |
  //      |         d1  |            |       |
  //      |             | clampedOff |       |
  //      --------------*---------------------
  //                    |  overflow  |
  //                    *-------------
  //                 nextOff
  //
  //    wrappedAroundOff = targetOffset % cols
  //    clampedOff = (rows * strideRows) + wrappedAroundOff
  //
  //          clampedOff - targetOffset
  //    d1 = --------------------
  //              strideRows

  //////////////////////////////////////////////////////////////////////////////
  //
  // Handling side-by-side wraparound
  //
  // Note: We do not support cases where the target has already overflown the
  // number of columns! This is because in PtrAnalysis, the offset has already
  // been collapsed into a single dimension, so it is ambiguous to determine
  // whether the offset actually overflows or just refers to an element on the
  // subsequent rows.
  //
  // Same limitations apply to the stacked wraparound case.
  //
  //////////////////////////////////////////////////////////////////////////////
  //
  //    nextOffset - targetOffset = colSize
  //    d1 + d2 = colSize
  //                          N
  //                                x            clampedOffset
  //      --------------------------*----------------*-----*
  //      |                                          |     nextOffset (might
  //      |                    targetOffset          |             overflow)
  //  y   *-----                    *----------------|
  //      |    |                    |                |
  //  M   |-----                    -----------------|
  //      | d2                              d1       |
  //      --------------------------------------------
  //
  //    x = targetOffset % N
  //    nextOffset = x + colSize
  //    clampedOffset = min(nextOffset, N)
  //    d1 = clampedOffset - x
  //
  //////////////////////////////////////////////////////////////////////////////

Important assumptions:

We assume that the modulo is the same as the dimension we're loading from. So if we're loading from a tensor of size 2, the only modulo that we expect to work correctly is 2 due to the math above. So tl.arange(0, 4) % 2 on a tensor of size != 2 will give unexpected results.
We also assume that once wrapping around, the remaining values don't wrap again. So we will only ever have two contiguous blocks to load from.
- For instance, loading tl.arange(0, 4) % 2 on a tensor of size 2 [77, 88] will work because the first block is [77, 88]; same for the second block. Loading (1 + tl.arange(0, 4)) % 2 will not work on tensor [77, 88] because we end up with 3 chunks: first chunk being [88], the second chunk being [77, 88], and the final chunk being [77].

This is feature has lots of complexity due to the interaction with triton masks, incrementing the offsets during a loop,... so we aim to only have the basic and important cases working. It is hard to statically figure out everything when triton is dynamic by nature. 😄

Why do we only support 2D cases ? See addptr operand produced by an unsupported operation: divsi #16 (comment).

I have added some more support as we recently discussed in #68. 1D tensor doesn't work due to an assert, but we can get around that by doing (offset + tl.arange(0, size)[:, None]) % mod.

Is it the same behaviour as arith.divOp ?

Technically we can't support div op with the current approach because it can generate non-contiguous memory locations. For instance, [0, 1, 2, 3, 4, 5] // 2 is [0, 0, 1, 1, 2, 2], this is essentially the same as loading 6 individual elements. We can't generate a single memref load from these offset. I just replied to the other issue #15 where we can potentially have another fallback mode that can be used for these more dynamic cases.

I hope this helps! Thanks again for your interest in the project.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about the implementation of remsi #54

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Questions about the implementation of remsi #54

yuanfz98 Nov 16, 2023

Replies: 1 comment

nhat-nguyen Dec 1, 2023 Maintainer

yuanfz98
Nov 16, 2023

nhat-nguyen
Dec 1, 2023
Maintainer