Proposal: Support a new data access pattern. #138

colawithsauce · 2024-05-30T10:30:42Z

colawithsauce
May 30, 2024

Proposal: Support a new data access pattern.

the pattern is

When PtrAnalysis, addState doesn't support the situation that both two operands have modulo. And We think shapes of tensor in this case can also be somehow determined statically.

When will we adding two modulo state? Many cases, it is in the case that the input data is a high dimensional tensor, and user accessing it by pattern

a = load(a_ptr + ((xindex // num0) % size0) * stride0 # dimension 0
               + ((xindex // num1) % size1) * stride1 # dimension 1
               + ((xindex // num2) % size2) * stride2 # dimension 2
               # + ...
               , mask);

For each dimension, its pointer arith follows patterns:

((index // num0) % num1) * num2

where xindex must be an arange array (for example: [0, 1, 2, 3, 4, ...]), size is size of this dimension, stride is how many elements should be skip to fetch next element in this dimension. And num remove lower dimensional information, num{i} >= num{i-1} * size{i-1} and num{i} == size{i-1} * k, k == 0, 1, 2, 3, ..., a_ptr is a ptr or a tensor of ptr (and most of time, is simply a pointer, and implicitly boardcasted).

If an accessing pattern follows the regulation we give before, we can call this pattern is lirregular (just a name for convinience). Here is example and counter example of this pattern:

Above is an example of 'irregular', every permutation will exist once in this case, and we can simply structure this method with a two-dimensional matrix, as we'll show latter in this chapter.

Above is an conter-example. this picture shown addState of two modulo state, and the second modulo is so random, so this is hard to structured.

The 'ttir' generated by this data access pattern might looks like below:

%0 = tt.makerange {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
%1 = arith.divsi %0, %cst_4 : tensor<128xi32>
%2 = arith.remsi %1, %cst_16 : tensor<128xi32>
%3 = arith.remsi %0, %cst_4 : tensor<128xi32>
%4 = arith.addi %2, %3 : tensor<128xi32> 
%5 = tt.addptr %arg0, %4 : tensor<128x!tt.ptr<f32,1>>, tensor<128xi32>  
%6 = tt.load %5 {...}

there are two modulo (remsi) operation in "ttir", and which were asserted to failed. Data access pattern can be represented by:

However, we can change our vision of this transformation from above picture into the following picture.

In this picture, ptr of the load operation was considered as an 2D-tensor in logical (Although 1D in physical). Second operand of these two arith.remsi operation (%2, %3) indicates size of each dimension.

Are this pattern normal in real world programming?

We think it is at least normal in torch.compile generated triton code (and at least to our usecase). We test on some pytorch codes[1], and find that the operands of load operations in triton-DSL, which generated by torch.compile, are highly structured.

Here are some examples:

xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x4 = xindex
x0 = xindex % 128
x1 = (xindex // 128) % 2048
x2 = (xindex // 262144) % 32
x3 = (xindex // 8388608)
tmp0 = tl.load(in_ptr0 + (x4), None)
tmp1 = tl.load(in_ptr1 + (x0 + (128*x3) + (512*x2) + (16384*x1)), None)

xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x0 = xindex % 64
x2 = (xindex // 2048)
x3 = xindex
tmp0 = tl.load(in_ptr0 + (x0 + (64*x2)), None, eviction_policy='evict_last')

xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x4 = xindex
x0 = xindex % 128
x1 = (xindex // 128) % 2048
x2 = (xindex // 262144) % 32
x3 = (xindex // 8388608)
tmp0 = tl.load(in_ptr0 + (x4), None)
tmp1 = tl.load(in_ptr1 + (x0 + (128*x2) + (4096*x1) + (8388608*x3)), None)

yoffset = tl.program_id(1) * YBLOCK
yindex = yoffset + tl.arange(0, YBLOCK)[None, :]
ymask = yindex < ynumel
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
xmask = xindex < xnumel
x2 = xindex
y3 = yindex
y0 = yindex % 2048
y1 = (yindex // 2048)
tmp0 = tl.load(in_ptr0 + (x2 + (128*y3)), xmask,
eviction_policy='evict_last')
tmp1 = tl.load(in_ptr1 + (y0 + (2048*x2) + (262144*y1)), xmask,
eviction_policy='evict_last')

xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x3 = xindex
x0 = xindex % 262144
x1 = (xindex // 262144) % 32
x2 = (xindex // 8388608)
tmp0 = tl.load(in_ptr0 + (x3), None)
tmp1 = tl.load(in_ptr1 + (x0 + (262144*x2) + (1048576*x1)), None)

Here seems an counter-example, however notice the load operation: they doesn't be used together, but used in separate load operation instead.

xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x3 = (xindex // 128)
x4 = xindex % 4096
x5 = xindex
tmp0 = tl.load(in_ptr0 + (x3), None, eviction_policy='evict_last')
tmp1 = tl.load(in_ptr1 + (x4), None, eviction_policy='evict_last')

Conclusion

In conclusion, we propose an pattern that is common in real world programming, which hasn't been supported by triton-shared. And we define this pattern and meaning of its. And we also gives some code example for it.

In our opinion, this pattern is common and implementable. we are wondering that if your team have explore this idea and found it is unimplementable? or is there something that we doesn't noticed? or is there unclear on our post? We are desire to kown your opinion on this idea. And we are now working on this method and try to writing some code to add support of this pattern.

Thanks for your attention!

attachments

[1]: torch codes
TestBroadcast_TwoAxis.pdf
TestReduce.pdf
TestPermute.pdf
TestBroadcast.pdf

nhat-nguyen · 2024-06-04T18:06:43Z

nhat-nguyen
Jun 4, 2024
Maintainer

Thank you @colawithsauce for your detailed write-up. I will get back to you as soon as possible. :)

0 replies

nhat-nguyen · 2024-06-04T20:17:40Z

nhat-nguyen
Jun 4, 2024
Maintainer

Would you mind giving another explanation on what num0 in ((xindex // num0) % size0) * stride0 really means? I'm having a hard time understand what num0 conceptually really means. Specifically, I don't quite get the below:

And num remove lower dimensional information, num{i} >= num{i-1} * size{i-1} and num{i} == size{i-1} * k, k == 0, 1, 2, 3, ...

What if num0 is 0?

Also, if we plug num{i} = size {i - 1} * k into num{i} >= num{i-1} * size{i-1}, we get:

size{i - 1} * k >= size{i - 1} * num{i - 1}
<=> k >= num{i - 1} (because size{i - 1} != 0)

What does this mean?

It would be great if you could give a concrete example for a 2D tensor with explicit shapes.

0 replies

nhat-nguyen · 2024-06-04T21:18:48Z

nhat-nguyen
Jun 4, 2024
Maintainer

I also have another question. All of the diagrams seem to assume that we're first dividing by m = 4 and performing modulo by something greater than 4 in one dimension (k); and for the other dimension, performing the modulo by n = 4. What if m != n and k < m or n? The patterns wouldn't be as 'structured'.

1 reply

elstainniles Jun 5, 2024

Yes, so it won't look like a matrix. But if we look at it from an abstract point of view, and think of a matrix as an abstract expression determined only by offset, stride, and size, then they would be structured.It's like the following:

matrix[M][N] <=> matrix, size:{M, N}, stride:{N, 1}, offest: 0

When we look at a matrix in terms of the right-hand side, we can expand our view of the matrix.If there is no difference between them in technical implementation, is it really necessary to make size equal to sride?

matrix, size:{M, N}, stride:{N, 1}, offest: 0 <=> matrix, size:{size1, size0}, stride:{stride1, stride0}, offest: 0

Form ! = ncase, we need to expand a dimension whose stride is 0 and size is exactly m / n
This creates a problem: a 3-dimensional matrix is actually read, but only a 1-dimensional vector is needed for the computation. Therefore, the main ordering of the dimensions is important. Simply picking the main order in num increments ensures that they are stored in exactly the same way as one-dimensional.This allows you to read a data that is the same as the actual requirement in a vectorized way.

nhat-nguyen · 2024-06-04T21:22:23Z

nhat-nguyen
Jun 4, 2024
Maintainer

Aside from the above questions that I have, one major limitation that triton-shared has at the moment that will make supporting this pattern hard is that we don't support generating multiple loads from a single tt.load at the moment. The last diagram that you have isn't exactly a structured memory access:

We cannot describe a memref load with a single offset, static strides and shapes for this memory load. This pattern would require 4 loads with offset {0, 1, 2, 3}.

1 reply

elstainniles Jun 5, 2024

Sorry, our description may have been a bit unclear, so I'll briefly add it below：
When we use a matrix to represent data that is actually stored in one dimension, this is how we use matrix[i][j]:

for i in range(0, size1)
    for j in range(0, size0)
        *(ptr + offset + stride0 * i + stride1 * j) ...

In this primitive expression, stride and size are unrelated to each other.But many times, we get in the habit of thinking that stride1 = size0.Let's take a look at the following code：

 %reinterpret_cast_1 = memref.reinterpret_cast %arg1 to offset: [%49], sizes: [%c4, %c16], strides: [%c1, %c1] :...

This may be a bit different from our regular matrix representation.In this case, we'll see that stride and the other dimensions of size are now independent of each other.This example is what we would expect from the memory load in the figure.Hope this example could help you.

colawithsauce · 2024-06-05T10:17:21Z

colawithsauce
Jun 5, 2024
Author

@nhat-nguyen
Sorry for my mistakes here. The k should be k = 1, 2, 3, 4, 5, ..., and the regulation also should be num{i} == k * num{i-1} * size{i-1}.

Let me explain num here more clearly with examples.

An simple example

We can construct a simple example:

tmp1 = tl.load(in_ptr   + ((xindex // 1) % 4 ) * 1
                        + ((xindex // 4) % 16) * 4)

this example shows a linear matrix load (after ptr arith, index still be [0, 1, 2, 3, ...]). We define num as removing low dimensional information (in physical, physical and logical will explain on the complex example).

Here other example:

tmp1 = tl.load(in_ptr + ((xindex // 64) % 1) * 1)

this example shows a repeats, and

x0 = xindex % 64
x2 = (xindex // 2048)
x3 = xindex
tmp0 = tl.load(in_ptr0 + (x0 + (64*x2)), None, eviction_policy='evict_last')

this example shows the condition that k != 1, where k is the factor that num{i} == k * num{i-1} * size{i-1}.

The later example we introduce here, can be structured by adding a phamtom dimension. this load can be expressed by [offs, sizes, strides] as [[0], [64, 32, XBLOCK // 2048], [1, 0, 64]].

The former example we can also do this addition. we can adding a phamtom dimension_0. And the original dimension_0 is dimension_1 now.

An complex example:

import torch
def fn(x, y):
    return torch.permute(x, (0, 2, 1, 3)) + y
fnc = torch.compile(fn)
bsz = 4
num_head = 32
seq_len = 2048
head_dim = 128
x = torch.randn([bsz, num_head, seq_len, head_dim]).cuda()
y = torch.randn([bsz, seq_len, num_head, head_dim]).cuda()
z = fnc(x, y)
print(z[0,0,0,0])

And the triton DSL it generated is:

@triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 33554432
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x4 = xindex
    x0 = xindex % 128
    x1 = (xindex // 128) % 2048
    x2 = (xindex // 262144) % 32
    x3 = (xindex // 8388608)
    tmp0 = tl.load(in_ptr0 + (x4), None)
    tmp1 = tl.load(in_ptr1 + (x0 + (128*x2) + (4096*x1) + (8388608*x3)), None)
    tmp2 = tmp0 + tmp1
    tl.store(out_ptr0 + (x4), tmp2, None)

we expand operands of load operation:

# xindex = [0, 1, 2, ....]
tmp1 = tl.load (in_ptr1 + ((xindex // 1)            % 128 )     * 1 
                        + ((xindex // (128*2048))   % 32  )     * 128
                        + ((xindex // 128)          % 2048)     * 4096
                        + ((xindex // 8388608)))

Seens we have violate our regulation that num{i} == k * num{i-1} * size{i-1}, however, this pointer arith is commutative. we can exchange the order of it:

# xindex = [0, 1, 2, ....]
tmp1 = tl.load (in_ptr1 + ((xindex // 1)            % 128 )     * 1 
                        + ((xindex // 128)          % 2048)     * 4096
                        + ((xindex // (128*2048))   % 32  )     * 128
                        + ((xindex // 8388608)))

Now, it follows the rule. we can represent this in form of [offs, sizes, strides] as [[0, 0, 0, 0], [128, 2048, 32, 4], [1, 4096, 128, 8388608]]. The physical memory layout is [[0], [128, 2048, 32, 4], [1, 128, 4096, 8388608]], and the strides were permuted.

NOTE1: the strides here [1, 4096, 128, 8388608] seens strange here, but we can use aother vison on it: A matrix can always defined by this affine map:
affine_map<(d0, ... dN)[offset, stride0, ... strideN] ->
        (offset + d0 * stride0 + ... dN * strideN)>
in that this representation makes sense.

NOTE2: the last dimension's stride and size are not directly given. An permitive idea is to simply do divide, and it works on this case. We leave this as TODO, because we think it is temporarily not central to our problem.

The num here is to remove lower dimensional information physical (in contract with our logical matrix).

0 replies

nhat-nguyen · 2024-06-05T21:27:25Z

nhat-nguyen
Jun 5, 2024
Maintainer

@colawithsauce Thanks for the replies. It will take me some time to digest it all. I started another thread here to ask a different question.

In your attachments, specifically the TestBroadcast_TwoAxis file, there are two triton kernels generated by torch inductor. Both of the kernels have XBLOCK as constant kernel argument. However, the glue code that invokes those two kernels never defines the values for XBLOCK. This is the first time I'm reading torch-inductor generated code, so I'm wondering where the value for XBLOCK : tl.constexpr comes from? Does it come from size_hints=[33554432]?

3 replies

elstainniles Jun 7, 2024

It looks like XBLOCK is used as a type hint with tl.constexpr in the @triton.jit decorator for the triton_ function. It's not directly assigned a specific value in the code, which means its value is actually determined at compile time and it's treated as a constant.

Torch-inductor doesn't explicitly define XBLOCK here, but that's not a problem. When the Inductor compiles the code into Triton IR, XBLOCK is already set as a constant. So, when we look at the Triton IR, we don't need to worry about whether it's defined in the Triton DSL.

Take a look at the final Triton IR generated by TestBroadcast_TwoAxis. In this code snippet, XBLOCK is set to 1024. I thought this was an interesting find and thought you might find it useful too.

module {
  tt.func public @triton__0d1d2d3de(%arg0: !tt.ptr<f32, 1> {tt.divisibility = 16 : i32} loc("/tmp/torchinductor_root/4b/c4bhkqdd6fhp65njkwvzxrylvfyygzcze6rggxmfwtixbkh2i5ua.py":20:0), %arg1: !tt.ptr<f32, 1> {tt.divisibility = 16 : i32} loc("/tmp/torchinductor_root/4b/c4bhkqdd6fhp65njkwvzxrylvfyygzcze6rggxmfwtixbkh2i5ua.py":20:0), %arg2: !tt.ptr<f32, 1> {tt.divisibility = 16 : i32} loc("/tmp/torchinductor_root/4b/c4bhkqdd6fhp65njkwvzxrylvfyygzcze6rggxmfwtixbkh2i5ua.py":20:0), %arg3: i32 {tt.divisibility = 16 : i32, tt.max_divisibility = 16 : i32} loc("/tmp/torchinductor_root/4b/c4bhkqdd6fhp65njkwvzxrylvfyygzcze6rggxmfwtixbkh2i5ua.py":20:0)) attributes {noinline = false} {
    %cst = arith.constant dense<2048> : tensor<1024xi32> loc(#loc1)
    %cst_0 = arith.constant dense<4096> : tensor<1024xi32> loc(#loc2)
    %cst_1 = arith.constant dense<128> : tensor<1024xi32> loc(#loc3)
    %c1024_i32 = arith.constant 1024 : i32 loc(#loc4)
    %0 = tt.get_program_id x : i32 loc(#loc5)
    %1 = arith.muli %0, %c1024_i32 : i32 loc(#loc4)
    %2 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> loc(#loc6)
    %3 = tt.splat %1 : (i32) -> tensor<1024xi32> loc(#loc7)
    %4 = arith.addi %3, %2 : tensor<1024xi32> loc(#loc7)
    %5 = arith.remsi %4, %cst_1 : tensor<1024xi32> loc(#loc3)
    %6 = arith.divsi %4, %cst_0 : tensor<1024xi32> loc(#loc2)
    %7 = arith.remsi %6, %cst : tensor<1024xi32> loc(#loc1)
    %8 = tt.splat %arg0 : (!tt.ptr<f32, 1>) -> tensor<1024x!tt.ptr<f32, 1>> loc(#loc8)
    %9 = tt.addptr %8, %4 : tensor<1024x!tt.ptr<f32, 1>>, tensor<1024xi32> loc(#loc8)
    %10 = tt.load %9 {cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<1024xf32> loc(#loc9)
    %11 = arith.muli %7, %cst_1 : tensor<1024xi32> loc(#loc10)
    %12 = arith.addi %5, %11 : tensor<1024xi32> loc(#loc11)
    %13 = tt.splat %arg1 : (!tt.ptr<f32, 1>) -> tensor<1024x!tt.ptr<f32, 1>> loc(#loc12)
    %14 = tt.addptr %13, %12 : tensor<1024x!tt.ptr<f32, 1>>, tensor<1024xi32> loc(#loc12)
    %15 = tt.load %14 {cache = 1 : i32, evict = 3 : i32, isVolatile = false} : tensor<1024xf32> loc(#loc13)
    %16 = arith.addf %10, %15 : tensor<1024xf32> loc(#loc14)
    %17 = tt.splat %arg2 : (!tt.ptr<f32, 1>) -> tensor<1024x!tt.ptr<f32, 1>> loc(#loc15)
    %18 = tt.addptr %17, %4 : tensor<1024x!tt.ptr<f32, 1>>, tensor<1024xi32> loc(#loc15)
    tt.store %18, %16 {cache = 1 : i32, evict = 1 : i32} : tensor<1024xf32> loc(#loc16)
    tt.return loc(#loc17)
  } loc(#loc)
} loc(#loc)

colawithsauce Jun 7, 2024
Author

One of our torch.compile environment is here: https://colab.research.google.com/drive/10dwwo9nTDejsVxyWuIMjszQL7rav8hNQ?usp=sharing. Details of generated triton DSL(by torch.compile) and triton IR from former step can be found under directory /tmp/torchinductor_root/ after running all cell.

nhat-nguyen Jun 11, 2024
Maintainer

Thanks folks for sharing this. I'm mostly interested in the XBLOCK because looks like it's part of the pointer arithmetic.

nhat-nguyen · 2024-06-11T18:47:15Z

nhat-nguyen
Jun 11, 2024
Maintainer

@colawithsauce Hey sorry for not getting back to you last week. I have not had time to fully digest your formulas, but I think I'm able to understand it at a high level. Let me know if the following is correct. So, the gist of the problem here is that even though the triton IR is loading a 1d tensor, we are able to describe this 1d tensor using a combination of sizes, strides, and offsets from your formula which would resemble a 2d tensor.

Now, from all of your pytorch code, it looks like all of these are pretty basic operations (implicit broadcast, reduce,...). So we are definitely interested in having support for these cases. You mentioned that your group is working on an implementation already, that is great! We would appreciate the contribution here to make triton-shared more complete and robust.

One technical suggestion I have is because all of the code in AnalysisStructured/PtrAnalysis pass right now assumes that we're operating on the common rectangular tensors. For instance, if we're loading 1D tensors, we expect offsets, sizes, and strides to only contain one value. This assumption obviously would not work for your case. So I think the best approach is to leverage the existing tts dialect and create another pass that handles these data access patterns independent from the PtrAnalysis pass. I'm always happy to discuss the technical details further with you too.

0 replies

nhat-nguyen · 2024-06-18T09:01:03Z

nhat-nguyen
Jun 18, 2024
Maintainer

@colawithsauce torch-inductor is getting some improvements in their codegen and won't generate as many div and mod operations as before. I haven't tried it out yet but thought you might be interested in: pytorch/pytorch#125077. There's a related discussion over at #16 too.

1 reply

colawithsauce Jun 19, 2024
Author

Thanks, we'll check it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Support a new data access pattern. #138

{{title}}

Replies: 8 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Proposal: Support a new data access pattern. #138

colawithsauce May 30, 2024

Proposal: Support a new data access pattern.

TOC

the pattern is

Are this pattern normal in real world programming?

Conclusion

attachments

Replies: 8 comments · 6 replies

nhat-nguyen Jun 4, 2024 Maintainer

nhat-nguyen Jun 4, 2024 Maintainer

nhat-nguyen Jun 4, 2024 Maintainer

elstainniles Jun 5, 2024

nhat-nguyen Jun 4, 2024 Maintainer

elstainniles Jun 5, 2024

colawithsauce Jun 5, 2024 Author

An simple example

An complex example:

nhat-nguyen Jun 5, 2024 Maintainer

elstainniles Jun 7, 2024

colawithsauce Jun 7, 2024 Author

nhat-nguyen Jun 11, 2024 Maintainer

nhat-nguyen Jun 11, 2024 Maintainer

nhat-nguyen Jun 18, 2024 Maintainer

colawithsauce Jun 19, 2024 Author

colawithsauce
May 30, 2024

Replies: 8 comments 6 replies

nhat-nguyen
Jun 4, 2024
Maintainer

nhat-nguyen
Jun 4, 2024
Maintainer

nhat-nguyen
Jun 4, 2024
Maintainer

nhat-nguyen
Jun 4, 2024
Maintainer

colawithsauce
Jun 5, 2024
Author

nhat-nguyen
Jun 5, 2024
Maintainer

colawithsauce Jun 7, 2024
Author

nhat-nguyen Jun 11, 2024
Maintainer

nhat-nguyen
Jun 11, 2024
Maintainer

nhat-nguyen
Jun 18, 2024
Maintainer

colawithsauce Jun 19, 2024
Author