Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Codegen] Bubble up Transpose attention V and try fuse with others before attention #19250

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

raikonenfnu
Copy link
Collaborator

@raikonenfnu raikonenfnu commented Nov 21, 2024

Flash Attention transpose_V variant is significantly faster than the non-transpose_V variant. This is due to many matmul intrinsics being mmtb by default. Hence, doing FA transpose_V will allow for better/more contiguous reads from shared memory to register, improving the attention performance quite a bit.

This PR exposes the attention_transposeV form by generating a linalg.transpose on the V during bubbling up of transpose S.T we can give the graph some opportunities to fuse the transpose-V to it's producer. I have also confirmed that if we do not find any producer, the transpose will indeed fuse back with the attenionOp. Hence worse case, we will get same perf as before this PR.

Additionally, we modify elementwise op fusion to try fuse transpose with other ops before letting it get fused back into attention.

@raikonenfnu
Copy link
Collaborator Author

raikonenfnu commented Nov 21, 2024

@MaheshRavishankar @IanWood1 Any thoughts on exposing this transpose during the lowering of tm.tensor or alternatively, I was thinking we can also add it as a pattern in RaiseSpecialOps.

CC: @Groverkss

@IanWood1
Copy link
Contributor

@MaheshRavishankar @IanWood1 Any thoughts on exposing this transpose during the lowering of tm.tensor or alternatively, I was thinking we can also add it as a pattern in RaiseSpecialOps.

Maybe it makes sense to do this in preprocessing similar to TransposeMatmulPass

. But during the lowering also makes sense.

I feel like ElementwiseOpFusion isn't a good place to be deciding if a transpose should be fused with its producer or consumer. It simply just fuses ops greedily and any choices like that should be made before this pass. I think we have the same problem when with transpose -> transposed matmul (when generalizing matmul ops), which is handled by PropagateLinalgTransposePass. This isn't enabled by default, but there is a TODO saying it should be. I think this would fuse the transpose with the surrounding elementwise ops.

Side note: could you accomplish the same thing in compiler/src/iree/compiler/DispatchCreation/ElementwiseOpFusion.cpp by changing the attention fusion pattern's benefit? I haven't really seen benefit used so i'm not sure if that would work/would be the proper way to use it.

@MaheshRavishankar
Copy link
Contributor

So the main question is

  1. Do we want to lower to qk_transpose version always (so introduce transpose and propagate it to producers), or
  2. we want to fold transposes into attention no matter what.

The best approach is 1. But people usually reach for 2 when the transpose ends up not fusing with others. I would vote for 1, but the current patterns to fold transpose into attention does 2.

Can we try to make sure that we do 1 always. That might require dropping the current patterns to fold transposes into attention and see what happens

@raikonenfnu
Copy link
Collaborator Author

@MaheshRavishankar Why can't we keep 1 and 2? Woudn't better to do 2 if we have nothing left to fuse?

@raikonenfnu
Copy link
Collaborator Author

raikonenfnu commented Nov 21, 2024

@MaheshRavishankar @IanWood1 Would it be then a good idea to be then to introduce something like FuseTransposeWithProducerLinalgOp but specifically for attentionOp. i.e FuseAttnTransposeVWithProducerLinalgOp,:

  1. we detect that this an attentionOp do not have a transposed-V
  2. we check the producer of V to see if we can fuse the the transpose to it,
  3. if we fusable, directly fuse the transpose into the producer without even explicitly generating the transposeOp

Also @IanWood1 why is PropagateLinalgTranspose.cpp not on by default right now? and when do you think we can make it default?

@IanWood1
Copy link
Contributor

Also @IanWood1 why is PropagateLinalgTranspose.cpp not on by default right now? and when do you think we can make it default?

Here's the issue linked in GlobalOptimization/Passes.cpp #15973. It looks old, so maybe theres no reason to have it off by default.

@raikonenfnu
Copy link
Collaborator Author

Also @IanWood1 why is PropagateLinalgTranspose.cpp not on by default right now? and when do you think we can make it default?

Here's the issue linked in GlobalOptimization/Passes.cpp #15973. It looks old, so maybe theres no reason to have it off by default.

So any thoughts on the proposed FuseAttnTransposeVWithProducerLinalgOp pattern? I think this is kind of nice since we can keep attention lowering to "regular" attention, don't need to explicitly generate transposeOp, and this should work for the FP8 attention kernel from sharktank as well. Only downside is I am not sure if some of the sinking patterns would push this transpose back down to the attentionOp. 😆

@IanWood1
Copy link
Contributor

Doing it directly might be difficult because I don't think there is a good interface for transposing an operation, but it would give more control over when you want to apply the pattern. If the only use cases you are looking at are linalg.generic producers, it shouldn't be hard. I think a pattern to convert attention op -> attention op transpose v + transpose during the "propagating up" phase would also work. Then the transpose on the value operand would get fused with the producer if possible.

IanWood1 added a commit that referenced this pull request Nov 21, 2024
From the discussion on #19250, this
should be made default. Testing locally revealed no regressions and the
benchmarks from the linked issue have been removed
(#15973).

Signed-off-by: Ian Wood <[email protected]>
@raikonenfnu raikonenfnu changed the title [Torch][Input] Transpose attention V by default and try fuse with others before attention [Codegen] Bubble up Transpose attention V and try fuse with others before attention Nov 22, 2024
Flash Attention transpose_V variant is significantly faster than the
non-transpose_V variant. This is due to many matmul intrinsics being
mmtb by default. Hence, doing FA transpose_V will allow for better/more
contiguous reads from shared memory to register, improving the attention
performance quite a bit.

This PR exposes the attention_transposeV form by default through the
torch lowering by generating a linalg.transpose on the V S.T we can give
the graph some opportunities to fuse the transpose-V to it's producer. I
have also confirmed that if we do not find any producer, the transpose
will indeed fuse back with the attenionOp. Hence worse case, we will get
same perf as before this PR.

Signed-off-by: Stanley Winata <[email protected]>
Signed-off-by: Stanley Winata <[email protected]>
@ScottTodd ScottTodd removed their request for review November 25, 2024 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants