[Triton] Add `tl.gather` with a naive codegen implementation #5262

Mogball · 2024-11-26T21:39:36Z

This PR adds a tl.gather builtin that implements a local gather along a single axis, with semantics matching torch.gather. tl.gather generates a tt.gather op, which is piped through the compiler mostly untouched at the moment, since the codegen is very naive.

The tt.gather is implemented by writing the source tensor into shared memory and then performing a gather out of shared memory, thus it requires scratch space to be allocated. In a follow-up, I will implement an optimized layout rule for the op that ensures the gather axis fits into a single warp, allowing the gather to be implemented using warp shuffles.

There are other avenues for optimization as well: tt.gather(tt.load) where the load only has one use can be lowered into a DMA from global memory to shared, and then gather directly from shared.

ThomasRaoux · 2024-11-26T22:49:56Z

test/Triton/ops.mlir

+// CHECK-LABEL: @gather_op
+tt.func @gather_op(%arg0: tensor<128x16xf32>, %arg1: tensor<512x4xi32>) -> tensor<512x4xf32> {
+  // CHECK-NEXT: %0 = tt.gather %arg0[%arg1] {axis = 0 : i32} : (tensor<128x16xf32>, tensor<512x4xi32>) -> tensor<512x4xf32>
+  %0 = tt.gather %arg0[%arg1] {axis = 0 : i32} : (tensor<128x16xf32>, tensor<512x4xi32>) -> tensor<512x4xf32>


just starting to look at this but shouldn't the index tensor be a 1D tensor if we index only along 1 dimension?

Gather along a single axis d means that each column along dim d in the output is comprised of the elements from the corresponding column in the source tensor. E.g. out[i,j] = src[idx[i,j],j] for axis=0.

ah I see, ok yeah looks like that's what pytorch. I'll let @apgoucher confirm that it is what he wants but makes sense to me.

ThomasRaoux

Looks great! It's probably worth having @apgoucher check that the semantic is what he had in mind but other than that looks good to go

Jokeren · 2024-11-26T23:33:46Z

lib/Conversion/TritonGPUToLLVM/GatherOpToLLVM.cpp

+  }
+
+  // Synchronize the whole CTA.
+  // TODO(jeff): Should we teach Membar that gather synchronizes?


Membar cannot insert any "internal" synchronization barriers

This isn't super important, but what I mean is we can teach membar that certain ops implicitly act as a synchronization, which causes the analysis to reset pending memory transactions up to those ops.

lib/Conversion/TritonGPUToLLVM/GatherOpToLLVM.cpp

lib/Dialect/Triton/IR/Ops.cpp

peterbell10 · 2024-11-27T01:10:10Z

python/triton/language/semantic.py

+        assert index.type.shape[d] <= src.type.shape[
+            d], f"index dim {axis} cannot be greater than the corresponding source dim"


This is a bit strange. You're allowing the gather op to implicitly slice the src tensor to match the index tensor? If we're going to allow this I think it should be its own operation.

I thought that's what we wanted?

I guess broadcasting could be supported

I thought that's what we wanted?

I 100% expect that the gather axis can be a different shape, but's it's not normal for the other dimensions to be allowed to be a different shape. I find it very surprising coming from numpy/pytorch semantics.

Also I don't think this behavior is compatible with broadcasting as it would be ambiguous. If the index has a dimension of size 1 we can't tell if it's supposed to be a slice, or if it should be broadcasted.

I suppose there is an advantage to fusing the gather op with the slice as in general I think a slice op could have to go through shared memory to transfer redundant data. Perhaps this could be a pattern matched lowering instead of implicit behavior of tt.gather though?

Huh I guess torch.gather actually does have this behavior. Now I'm not sure what to think haha. It feels wrong that there are huge chunks of the input tensor that get completely ignored, and feels to me like two operations.

I'm also a bit confused what the use case for this would be, as there's no way to create a slice that doesn't start at 0.

I don't really have an opinion here on the semantics of tl.gather, so let me know what you two prefer!

actually I read too fast. I agree that I would expect the other dimensions to match the input dimension. Unless we have a specific use for it I think we should restrict the dimensions to match.

peterbell10 · 2024-11-27T01:10:54Z

python/triton/language/semantic.py

+    assert index.dtype.is_int(), "index must be an integer tensor"
+
+    rank = len(src.type.shape)
+    assert len(index.type.shape) == rank, "source and index tensors must have the same rank"


Would be nice to support broadcasting.

Can you elaborate on what the broadcasting semantics would be?

python/test/unit/language/test_core.py

peterbell10

Sorry, misclicked on the approval.

Mogball added 9 commits November 23, 2024 08:05

Add GatherOp with lit tests

6b5374d

implement gather op through to LLVM

f9bfec3

expose through frontend and add unit tests

c091f31

rename dim to axis

44dabb4

add LLVMIR test

d4a32c8

Merge remote-tracking branch 'origin/main' into mogball/gather

02a2f2b

newlines

56d1279

more newlines

6a2f788

more newlines

8f1358e

Mogball requested a review from ThomasRaoux November 26, 2024 21:39

Mogball requested review from antiagainst, zhanglx13, Jokeren and ptillet as code owners November 26, 2024 21:39

Mogball changed the title ~~Add tl.gather with a naive codegen implementation~~ [Triton] Add tl.gather with a naive codegen implementation Nov 26, 2024

Mogball added 2 commits November 26, 2024 13:55

format code

63d0de2

reduce test_gather smem usage for AMD

39a35ce

ThomasRaoux reviewed Nov 26, 2024

View reviewed changes

ThomasRaoux approved these changes Nov 26, 2024

View reviewed changes

Jokeren reviewed Nov 26, 2024

View reviewed changes

peterbell10 approved these changes Nov 27, 2024

View reviewed changes

peterbell10 self-requested a review November 27, 2024 01:23

peterbell10 requested changes Nov 27, 2024

View reviewed changes

Mogball added 4 commits November 26, 2024 22:13

Merge remote-tracking branch 'origin/main' into mogball/gather

508b981

assert_close

741d0a8

clarify gather impl comment

4189c09

add gather lowering test with dot layout

9fee613

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Triton] Add `tl.gather` with a naive codegen implementation #5262

[Triton] Add `tl.gather` with a naive codegen implementation #5262

Mogball commented Nov 26, 2024

ThomasRaoux Nov 26, 2024

Mogball Nov 26, 2024

ThomasRaoux Nov 26, 2024

ThomasRaoux left a comment

Jokeren Nov 26, 2024

Mogball Nov 27, 2024

peterbell10 Nov 27, 2024

ThomasRaoux Nov 27, 2024

peterbell10 Nov 27, 2024 •

edited

Loading

peterbell10 Nov 27, 2024 •

edited

Loading

peterbell10 Nov 27, 2024

Mogball Nov 27, 2024

ThomasRaoux Nov 27, 2024

peterbell10 Nov 27, 2024

Mogball Nov 27, 2024

peterbell10 left a comment

		assert index.type.shape[d] <= src.type.shape[
		d], f"index dim {axis} cannot be greater than the corresponding source dim"

[Triton] Add tl.gather with a naive codegen implementation #5262

Are you sure you want to change the base?

[Triton] Add tl.gather with a naive codegen implementation #5262

Conversation

Mogball commented Nov 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ThomasRaoux left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterbell10 Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

peterbell10 Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterbell10 left a comment

Choose a reason for hiding this comment

[Triton] Add `tl.gather` with a naive codegen implementation #5262

[Triton] Add `tl.gather` with a naive codegen implementation #5262

peterbell10 Nov 27, 2024 •

edited

Loading

peterbell10 Nov 27, 2024 •

edited

Loading