-
Notifications
You must be signed in to change notification settings - Fork 621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Codegen] Bubble up Transpose attention V and try fuse with others before attention #19250
base: main
Are you sure you want to change the base?
[Codegen] Bubble up Transpose attention V and try fuse with others before attention #19250
Conversation
@MaheshRavishankar @IanWood1 Any thoughts on exposing this transpose during the lowering of tm.tensor or alternatively, I was thinking we can also add it as a pattern in RaiseSpecialOps. CC: @Groverkss |
Maybe it makes sense to do this in preprocessing similar to
I feel like Side note: could you accomplish the same thing in |
So the main question is
The best approach is 1. But people usually reach for 2 when the transpose ends up not fusing with others. I would vote for 1, but the current patterns to fold transpose into attention does 2. Can we try to make sure that we do 1 always. That might require dropping the current patterns to fold transposes into attention and see what happens |
@MaheshRavishankar Why can't we keep 1 and 2? Woudn't better to do 2 if we have nothing left to fuse? |
@MaheshRavishankar @IanWood1 Would it be then a good idea to be then to introduce something like FuseTransposeWithProducerLinalgOp but specifically for attentionOp. i.e
Also @IanWood1 why is |
So any thoughts on the proposed |
Doing it directly might be difficult because I don't think there is a good interface for transposing an operation, but it would give more control over when you want to apply the pattern. If the only use cases you are looking at are |
From the discussion on #19250, this should be made default. Testing locally revealed no regressions and the benchmarks from the linked issue have been removed (#15973). Signed-off-by: Ian Wood <[email protected]>
93647e5
to
ff3a672
Compare
Flash Attention transpose_V variant is significantly faster than the non-transpose_V variant. This is due to many matmul intrinsics being mmtb by default. Hence, doing FA transpose_V will allow for better/more contiguous reads from shared memory to register, improving the attention performance quite a bit. This PR exposes the attention_transposeV form by default through the torch lowering by generating a linalg.transpose on the V S.T we can give the graph some opportunities to fuse the transpose-V to it's producer. I have also confirmed that if we do not find any producer, the transpose will indeed fuse back with the attenionOp. Hence worse case, we will get same perf as before this PR. Signed-off-by: Stanley Winata <[email protected]>
Signed-off-by: Stanley Winata <[email protected]>
Signed-off-by: Stanley Winata <[email protected]>
Signed-off-by: Stanley Winata <[email protected]>
Flash Attention transpose_V variant is significantly faster than the non-transpose_V variant. This is due to many matmul intrinsics being mmtb by default. Hence, doing FA transpose_V will allow for better/more contiguous reads from shared memory to register, improving the attention performance quite a bit.
This PR exposes the attention_transposeV form by generating a linalg.transpose on the V during bubbling up of transpose S.T we can give the graph some opportunities to fuse the transpose-V to it's producer. I have also confirmed that if we do not find any producer, the transpose will indeed fuse back with the attenionOp. Hence worse case, we will get same perf as before this PR.
Additionally, we modify elementwise op fusion to try fuse transpose with other ops before letting it get fused back into attention.