[Codegen] Bubble up Transpose attention V and try fuse with others before attention #19250

raikonenfnu · 2024-11-21T18:06:59Z

Flash Attention transpose_V variant is significantly faster than the non-transpose_V variant. This is due to many matmul intrinsics being mmtb by default. Hence, doing FA transpose_V will allow for better/more contiguous reads from shared memory to register, improving the attention performance quite a bit.

This PR exposes the attention_transposeV form by generating a linalg.transpose on the V during bubbling up of transpose S.T we can give the graph some opportunities to fuse the transpose-V to it's producer. I have also confirmed that if we do not find any producer, the transpose will indeed fuse back with the attenionOp. Hence worse case, we will get same perf as before this PR.

Additionally, we modify elementwise op fusion to try fuse transpose with other ops before letting it get fused back into attention.

raikonenfnu · 2024-11-21T18:14:27Z

@MaheshRavishankar @IanWood1 Any thoughts on exposing this transpose during the lowering of tm.tensor or alternatively, I was thinking we can also add it as a pattern in RaiseSpecialOps.

CC: @Groverkss

IanWood1 · 2024-11-21T19:02:00Z

@MaheshRavishankar @IanWood1 Any thoughts on exposing this transpose during the lowering of tm.tensor or alternatively, I was thinking we can also add it as a pattern in RaiseSpecialOps.

Maybe it makes sense to do this in preprocessing similar to TransposeMatmulPass

iree/compiler/src/iree/compiler/Preprocessing/Common/TransposeMatmul.cpp

Line 18 in b9d73cf

struct TransposeMatmulPass

. But during the lowering also makes sense.

I feel like ElementwiseOpFusion isn't a good place to be deciding if a transpose should be fused with its producer or consumer. It simply just fuses ops greedily and any choices like that should be made before this pass. I think we have the same problem when with transpose -> transposed matmul (when generalizing matmul ops), which is handled by PropagateLinalgTransposePass. This isn't enabled by default, but there is a TODO saying it should be. I think this would fuse the transpose with the surrounding elementwise ops.

Side note: could you accomplish the same thing in compiler/src/iree/compiler/DispatchCreation/ElementwiseOpFusion.cpp by changing the attention fusion pattern's benefit? I haven't really seen benefit used so i'm not sure if that would work/would be the proper way to use it.

MaheshRavishankar · 2024-11-21T19:34:43Z

So the main question is

Do we want to lower to qk_transpose version always (so introduce transpose and propagate it to producers), or
we want to fold transposes into attention no matter what.

The best approach is 1. But people usually reach for 2 when the transpose ends up not fusing with others. I would vote for 1, but the current patterns to fold transpose into attention does 2.

Can we try to make sure that we do 1 always. That might require dropping the current patterns to fold transposes into attention and see what happens

raikonenfnu · 2024-11-21T19:41:51Z

@MaheshRavishankar Why can't we keep 1 and 2? Woudn't better to do 2 if we have nothing left to fuse?

raikonenfnu · 2024-11-21T19:46:34Z

@MaheshRavishankar @IanWood1 Would it be then a good idea to be then to introduce something like FuseTransposeWithProducerLinalgOp but specifically for attentionOp. i.e FuseAttnTransposeVWithProducerLinalgOp,:

we detect that this an attentionOp do not have a transposed-V
we check the producer of V to see if we can fuse the the transpose to it,
if we fusable, directly fuse the transpose into the producer without even explicitly generating the transposeOp

Also @IanWood1 why is PropagateLinalgTranspose.cpp not on by default right now? and when do you think we can make it default?

IanWood1 · 2024-11-21T19:57:46Z

Also @IanWood1 why is PropagateLinalgTranspose.cpp not on by default right now? and when do you think we can make it default?

Here's the issue linked in GlobalOptimization/Passes.cpp #15973. It looks old, so maybe theres no reason to have it off by default.

raikonenfnu · 2024-11-21T20:28:41Z

Also @IanWood1 why is PropagateLinalgTranspose.cpp not on by default right now? and when do you think we can make it default?

Here's the issue linked in GlobalOptimization/Passes.cpp #15973. It looks old, so maybe theres no reason to have it off by default.

So any thoughts on the proposed FuseAttnTransposeVWithProducerLinalgOp pattern? I think this is kind of nice since we can keep attention lowering to "regular" attention, don't need to explicitly generate transposeOp, and this should work for the FP8 attention kernel from sharktank as well. Only downside is I am not sure if some of the sinking patterns would push this transpose back down to the attentionOp. 😆

IanWood1 · 2024-11-21T20:49:37Z

Doing it directly might be difficult because I don't think there is a good interface for transposing an operation, but it would give more control over when you want to apply the pattern. If the only use cases you are looking at are linalg.generic producers, it shouldn't be hard. I think a pattern to convert attention op -> attention op transpose v + transpose during the "propagating up" phase would also work. Then the transpose on the value operand would get fused with the producer if possible.

From the discussion on #19250, this should be made default. Testing locally revealed no regressions and the benchmarks from the linked issue have been removed (#15973). Signed-off-by: Ian Wood <[email protected]>

Flash Attention transpose_V variant is significantly faster than the non-transpose_V variant. This is due to many matmul intrinsics being mmtb by default. Hence, doing FA transpose_V will allow for better/more contiguous reads from shared memory to register, improving the attention performance quite a bit. This PR exposes the attention_transposeV form by default through the torch lowering by generating a linalg.transpose on the V S.T we can give the graph some opportunities to fuse the transpose-V to it's producer. I have also confirmed that if we do not find any producer, the transpose will indeed fuse back with the attenionOp. Hence worse case, we will get same perf as before this PR. Signed-off-by: Stanley Winata <[email protected]>

Signed-off-by: Stanley Winata <[email protected]>

raikonenfnu requested review from hanhanW, MaheshRavishankar and IanWood1 as code owners November 21, 2024 18:06

raikonenfnu requested a review from Groverkss November 21, 2024 18:32

IanWood1 mentioned this pull request Nov 21, 2024

[Global Opt] Turn on transpose propagation by default #19253

Merged

raikonenfnu force-pushed the transposeVAttentionByDefault branch from 93647e5 to ff3a672 Compare November 22, 2024 06:51

raikonenfnu changed the title ~~[Torch][Input] Transpose attention V by default and try fuse with others before attention~~ [Codegen] Bubble up Transpose attention V and try fuse with others before attention Nov 22, 2024

raikonenfnu added 3 commits November 21, 2024 22:51

Move transpose into bubbling patterns during transpose propagation

ff3a672

Signed-off-by: Stanley Winata <[email protected]>

fix precommit

df028ec

Signed-off-by: Stanley Winata <[email protected]>

raikonenfnu requested a review from ScottTodd as a code owner November 23, 2024 01:48

Add spec to get match tuned perf of transposeV variants of mmt

43bb987

Signed-off-by: Stanley Winata <[email protected]>

ScottTodd removed their request for review November 25, 2024 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Codegen] Bubble up Transpose attention V and try fuse with others before attention #19250

[Codegen] Bubble up Transpose attention V and try fuse with others before attention #19250

raikonenfnu commented Nov 21, 2024 •

edited

Loading

raikonenfnu commented Nov 21, 2024 •

edited

Loading

IanWood1 commented Nov 21, 2024

MaheshRavishankar commented Nov 21, 2024

raikonenfnu commented Nov 21, 2024

raikonenfnu commented Nov 21, 2024 •

edited

Loading

IanWood1 commented Nov 21, 2024

raikonenfnu commented Nov 21, 2024

IanWood1 commented Nov 21, 2024

[Codegen] Bubble up Transpose attention V and try fuse with others before attention #19250

Are you sure you want to change the base?

[Codegen] Bubble up Transpose attention V and try fuse with others before attention #19250

Conversation

raikonenfnu commented Nov 21, 2024 • edited Loading

raikonenfnu commented Nov 21, 2024 • edited Loading

IanWood1 commented Nov 21, 2024

MaheshRavishankar commented Nov 21, 2024

raikonenfnu commented Nov 21, 2024

raikonenfnu commented Nov 21, 2024 • edited Loading

IanWood1 commented Nov 21, 2024

raikonenfnu commented Nov 21, 2024

IanWood1 commented Nov 21, 2024

raikonenfnu commented Nov 21, 2024 •

edited

Loading

raikonenfnu commented Nov 21, 2024 •

edited

Loading

raikonenfnu commented Nov 21, 2024 •

edited

Loading