-
Notifications
You must be signed in to change notification settings - Fork 7.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ext/bcmath: In the arm processor environment, NEON is used to use SIMD. #18130
Conversation
218ed4b
to
000ae40
Compare
000ae40
to
d027390
Compare
1c5ad1c
to
e720877
Compare
Fixed a typo in the comment |
Ready for review |
If it creates a slowdown on short inputs, do you know at what input length SIMD on NEON becomes faster than the regular loop? |
@nielsdos
NEON does not have an instruction that corresponds to |
# define bc_simd_add_8x16(a, b) vaddq_s8(a, b) | ||
# define bc_simd_cmpeq_8x16(a, b) (vreinterpretq_s8_u8(vceqq_s8(a, b))) | ||
# define bc_simd_cmplt_8x16(a, b) (vreinterpretq_s8_u8(vcltq_s8(a, b))) | ||
static inline int bc_simd_movemask_8x16(int8x16_t vec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be interesting to know if there's a way to get the byte position without something like movemask. Anyway this is likely good enough for now.
Looking at the assembly code, it appears that the original
bc_count_digits()
was treated as inline on Arm because the amount of code was small.This change increased the amount of code and made it no longer inline, which made it a bit slower, so I specified it as inline.
Benchmarks
When
bc_count_digits()
is made an inline function1:
2:
3:
If
bc_count_digits()
is not an inline function1:
2:
3: