-
Notifications
You must be signed in to change notification settings - Fork 744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ESIMD] Optimize the simd stride constructor #12553
[ESIMD] Optimize the simd stride constructor #12553
Conversation
bdd5a06
to
1e75f5b
Compare
✅ With the latest revision this PR passed the C/C++ code formatter. |
1e75f5b
to
8868ab1
Compare
8868ab1
to
ef137d4
Compare
simd(base, stride) calls previously were lowered into a long sequence of INSERT and ADD operations. That sequence is replaced with a vector equivalent: vbase = broadcast base vstride = broadcast stride vstride_coef = {0, 1, 2, 3, ... N-1} vec_result = vbase + vstride * vstride_coef; Signed-off-by: Klochkov, Vyacheslav N <[email protected]>
ef137d4
to
e96865e
Compare
std::index_sequence<Is...>) { | ||
return vector_type_t<T, N>{(T)(Base + ((T)Is) * Stride)...}; | ||
constexpr auto make_vector_impl(T Base, T Stride, std::index_sequence<Is...>) { | ||
using CppT = typename element_type_traits<T>::EnclosingCppT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember you considering optimizing this for low values of N, did that end up not being worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did the initial research for float
types and found such tuning worthless.
Just to answer your question here and to show the IR I used int
type this time,
and found 2 cases where the old code is 1 instruction faster/shorter:
old {num math ops : ops} new {num math ops : ops}
simd<int, 1> 0: 0:
simd<int, 2> * 1: 1xADD 2: 1xADD, 1xMUL
simd<int, 3> * 3: 2xADD, 1xSHL 4: 2xADD, 2xMUL (it split 3-elem vec to 2-elem vec + 1-elem vec)
simd<int, 4> 5: 3xADD, 1xSHL, 1xMUL 2: 1xADD, 1xMUL
simd<float, 1> 0: 0:
simd<float, 2> 1: 1xADD 1: 1xMAD
simd<float, 3> 2: 2xADD 2: 2xMAD (3-elem vector ops were split -> 2-elem + 1-elem)
simd<float, 4> 3: 3xADD 1: 1xMAD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added few lines of code to tune for integral types and N <= 3: ea002b5
The old sequence prodices 1 less instruction in the final gpu code. Signed-off-by: Klochkov, Vyacheslav N <[email protected]>
simd(base, stride) calls previously were lowered into a long sequence of INSERT and ADD operations. That sequence is replaced with a vector equivalent:
vbase = broadcast base
vstride = broadcast stride
vstride_coef = {0, 1, 2, 3, ... N-1}
vec_result = vbase + vstride * vstride_coef;