Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ext/bcmath: In the arm processor environment, NEON is used to use SIMD. #18130

Merged
merged 4 commits into from
Mar 25, 2025

Conversation

SakiTakamachi
Copy link
Member

@SakiTakamachi SakiTakamachi commented Mar 22, 2025

Looking at the assembly code, it appears that the original bc_count_digits() was treated as inline on Arm because the amount of code was small.
This change increased the amount of code and made it no longer inline, which made it a bit slower, so I specified it as inline.

Benchmarks

When bc_count_digits() is made an inline function

1:

for ($i = 0; $i < 4000000; $i++) {
    bcadd('1.23456789', '-2.12345678', 10);
}
Benchmark 1: /php-dev2/sapi/cli/php /mount/bc/1.php
  Time (mean ± σ):     236.6 ms ±   1.2 ms    [User: 231.8 ms, System: 2.8 ms]
  Range (min … max):   234.6 ms … 239.2 ms    12 runs
 
Benchmark 2: /master/sapi/cli/php /mount/bc/1.php
  Time (mean ± σ):     235.5 ms ±   1.9 ms    [User: 230.0 ms, System: 3.5 ms]
  Range (min … max):   233.2 ms … 241.0 ms    12 runs
 
Summary
  '/master/sapi/cli/php /mount/bc/1.php' ran
    1.00 ± 0.01 times faster than '/php-dev2/sapi/cli/php /mount/bc/1.php'

2:

for ($i = 0; $i < 4000000; $i++) {
    bcadd('12345678901234567890.12345678901234567890', '-212345678901234567890.12345678901234567890', 20);
}
Benchmark 1: /php-dev2/sapi/cli/php /mount/bc/2.php
  Time (mean ± σ):     321.0 ms ±   0.7 ms    [User: 315.9 ms, System: 3.1 ms]
  Range (min … max):   319.8 ms … 322.3 ms    10 runs
 
Benchmark 2: /master/sapi/cli/php /mount/bc/2.php
  Time (mean ± σ):     423.4 ms ±   2.1 ms    [User: 418.1 ms, System: 3.2 ms]
  Range (min … max):   421.9 ms … 429.3 ms    10 runs
 
Summary
  '/php-dev2/sapi/cli/php /mount/bc/2.php' ran
    1.32 ± 0.01 times faster than '/master/sapi/cli/php /mount/bc/2.php'

3:

for ($i = 0; $i < 400000; $i++) {
    bcadd(str_repeat('12345678', 100), str_repeat('12345678', 100), 0);
}
Benchmark 1: /php-dev2/sapi/cli/php /mount/bc/3.php
  Time (mean ± σ):     178.0 ms ±   1.9 ms    [User: 173.1 ms, System: 3.0 ms]
  Range (min … max):   175.0 ms … 182.5 ms    16 runs
 
Benchmark 2: /master/sapi/cli/php /mount/bc/3.php
  Time (mean ± σ):     572.6 ms ±   4.5 ms    [User: 566.7 ms, System: 3.6 ms]
  Range (min … max):   566.8 ms … 581.0 ms    10 runs
 
Summary
  '/php-dev2/sapi/cli/php /mount/bc/3.php' ran
    3.22 ± 0.04 times faster than '/master/sapi/cli/php /mount/bc/3.php'

If bc_count_digits() is not an inline function

1:

Benchmark 1: /php-dev2/sapi/cli/php /mount/bc/1.php
  Time (mean ± σ):     249.9 ms ±   1.0 ms    [User: 244.0 ms, System: 3.8 ms]
  Range (min … max):   248.6 ms … 251.2 ms    11 runs
 
Benchmark 2: /master/sapi/cli/php /mount/bc/1.php
  Time (mean ± σ):     233.5 ms ±   1.0 ms    [User: 228.1 ms, System: 3.3 ms]
  Range (min … max):   231.1 ms … 234.6 ms    12 runs
 
Summary
  '/master/sapi/cli/php /mount/bc/1.php' ran
    1.07 ± 0.01 times faster than '/php-dev2/sapi/cli/php /mount/bc/1.php'

2:

Benchmark 1: /php-dev2/sapi/cli/php /mount/bc/2.php
  Time (mean ± σ):     332.8 ms ±   1.7 ms    [User: 327.7 ms, System: 3.0 ms]
  Range (min … max):   329.8 ms … 335.1 ms    10 runs
 
Benchmark 2: /master/sapi/cli/php /mount/bc/2.php
  Time (mean ± σ):     423.5 ms ±   1.2 ms    [User: 418.5 ms, System: 2.8 ms]
  Range (min … max):   421.2 ms … 425.3 ms    10 runs
 
Summary
  '/php-dev2/sapi/cli/php /mount/bc/2.php' ran
    1.27 ± 0.01 times faster than '/master/sapi/cli/php /mount/bc/2.php'

3:

Benchmark 1: /php-dev2/sapi/cli/php /mount/bc/3.php
  Time (mean ± σ):     178.1 ms ±   0.4 ms    [User: 172.9 ms, System: 3.2 ms]
  Range (min … max):   177.5 ms … 178.7 ms    16 runs
 
Benchmark 2: /master/sapi/cli/php /mount/bc/3.php
  Time (mean ± σ):     569.2 ms ±   2.3 ms    [User: 563.6 ms, System: 3.2 ms]
  Range (min … max):   566.3 ms … 573.4 ms    10 runs
 
Summary
  '/php-dev2/sapi/cli/php /mount/bc/3.php' ran
    3.20 ± 0.02 times faster than '/master/sapi/cli/php /mount/bc/3.php'

@SakiTakamachi SakiTakamachi force-pushed the bcmath/neon branch 4 times, most recently from 218ed4b to 000ae40 Compare March 23, 2025 00:08
@SakiTakamachi SakiTakamachi marked this pull request as ready for review March 23, 2025 05:23
@SakiTakamachi
Copy link
Member Author

Fixed a typo in the comment

@SakiTakamachi SakiTakamachi marked this pull request as draft March 23, 2025 10:08
@SakiTakamachi SakiTakamachi marked this pull request as ready for review March 23, 2025 10:57
@SakiTakamachi
Copy link
Member Author

Ready for review

@nielsdos
Copy link
Member

If it creates a slowdown on short inputs, do you know at what input length SIMD on NEON becomes faster than the regular loop?
I remember hearing that NEON has a large "startup overhead".

@SakiTakamachi
Copy link
Member Author

SakiTakamachi commented Mar 25, 2025

@nielsdos
Measurements were taken under the condition that SIMD branches were used.
In this implementation, when SIMD can be used, it seems to be fast enough even with the minimum number of digits.

for ($i = 0; $i < 8000000; $i++) {
    $a = new BcMath\Number('1234567890123456');
}
Benchmark 1: /php-dev2/sapi/cli/php /mount/bc/0.php
  Time (mean ± σ):     301.5 ms ±   7.2 ms    [User: 295.1 ms, System: 4.2 ms]
  Range (min … max):   296.4 ms … 318.7 ms    10 runs
 
Benchmark 2: /master/sapi/cli/php /mount/bc/0.php
  Time (mean ± σ):     356.5 ms ±   2.0 ms    [User: 350.2 ms, System: 4.0 ms]
  Range (min … max):   353.8 ms … 360.6 ms    10 runs
 
Summary
  '/php-dev2/sapi/cli/php /mount/bc/0.php' ran
    1.18 ± 0.03 times faster than '/master/sapi/cli/php /mount/bc/0.php'

NEON does not have an instruction that corresponds to _mm_movemask_epi8.
Perhaps when I tried this previously, the implementation of the function equivalent to _mm_movemask_epi8 was poor, which resulted in overhead.
(In fact, I have confirmed that this time too, differences in implementation methods can result in a maximum of three times slower execution time.)

# define bc_simd_add_8x16(a, b) vaddq_s8(a, b)
# define bc_simd_cmpeq_8x16(a, b) (vreinterpretq_s8_u8(vceqq_s8(a, b)))
# define bc_simd_cmplt_8x16(a, b) (vreinterpretq_s8_u8(vcltq_s8(a, b)))
static inline int bc_simd_movemask_8x16(int8x16_t vec)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be interesting to know if there's a way to get the byte position without something like movemask. Anyway this is likely good enough for now.

@SakiTakamachi SakiTakamachi merged commit 1ce79eb into php:master Mar 25, 2025
9 checks passed
@SakiTakamachi SakiTakamachi deleted the bcmath/neon branch March 25, 2025 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants