Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Triton] Add
tl.gather
with a naive codegen implementation #5262[Triton] Add
tl.gather
with a naive codegen implementation #5262Changes from 11 commits
6b5374d
f9bfec3
c091f31
44dabb4
d4a32c8
02a2f2b
56d1279
6a2f788
8f1358e
63d0de2
39a35ce
508b981
741d0a8
4189c09
9fee613
5b47ed6
b670068
dad8ca5
309a380
64630e6
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be left as a TODO, but we should do a masked store with
getRedundantDataMask
triton/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/LoadStoreOpToLLVM.cpp
Line 32 in dbebe10
(TBH I think there are quite a few places we need to cleanup redundant operations)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think a masked store is better, in general there isn't performance penalty for storing multiple times to the same address in shared memory. Using a store without mask allows better code generation in general
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would I be right to say this is only true if there are no bank conflicts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, as far as I know it is orthogonal to bank conflicts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. Are you saying that
@p st.shared
wherep
is false on all threads in the warp can still take multiple cycles?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah no that's not what I meant. I meant when doing a
st.shared
with duplicated addresses there isn't a penalty for storing multiple time. The HW will pick one of them and store only once.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree that on a single instruction level redundancy is okay. I guess that could be expressed by ignoring redundancy within a warp.
The problematic cases are:
st.shared
more times than necessary.st.shared
to transfer the same data.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah fair point.
I do wonder if it is always better to predicate as redundant data should not be the common case and there are some downsides to using predicates. The main downside is that it prevents the backend from using a larger element bitwidth.
Might be worth measuring at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, I'll note that for thread level redundancy we shouldn't actually need a masked load though. We could just not emit the instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true. Also another problem I remember running into with predicates (although not this exact case) is that it tends to throw off ptxas a lot in term of scheduling and liveranges. This applies more to
load
than tostore
(the liveness problem doesn't exist with loads, the scheduling I'm not sure) but that's one thing to keep in mind.