Support for transformer / attention #589
pommedeterresautee
started this conversation in
General
Replies: 1 comment
-
Yes.
no english. sorry. maybe google translate can help. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, I have seen a now closed pr in this repo to support attention mechanism with cutlass and a discussion on this subject in flash attention repo.
Do you have plan to offer composable templates to cover key parts of the transformer computation graph in cutlass? (In a more performant way than PyTorch for instance)
Or do you think this lib targets lower level computation and it's up to the final user to build this kind of high level layer?
Obviously some pieces already exist like batched and / or grouped gemm, but then, not sure if that's possible to use it combined with gemm b2b fusion to limit global memory bottleneck, or if you can apply easily different epilogues in group gemm to perform several projections in parallel (to limit the nb of kernel to launch), if a residual connection may be modeled in a performant way on cutlass, if that's hard to add layernorm, etc.
Beta Was this translation helpful? Give feedback.
All reactions