Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up is_ascii_hex_digit #479

Closed
wants to merge 1 commit into from

Conversation

ttsugriy
Copy link
Contributor

It's possible to reduce the number of required comparisons by converting c into lowercase, so that instead of checking both lowercase and uppercase ranges, only lowercase range needs to be checked.

This allows clang to generate branchless x86_64 asm (https://compiler-explorer.com/z/P6f7c8e4T). Instead of

is_ascii_hex_digit(char):                # @is_ascii_hex_digit(char)
        lea     ecx, [rdi - 48]
        mov     al, 1
        cmp     cl, 10
        jb      .LBB0_3
        lea     ecx, [rdi - 65]
        cmp     cl, 6
        jb      .LBB0_3
        add     dil, -97
        cmp     dil, 6
        setb    al
.LBB0_3:
        ret

clang generates

is_ascii_hex_digit(char):                # @is_ascii_hex_digit(char)
        lea     eax, [rdi - 48]
        cmp     al, 10
        setb    cl
        or      dil, 32
        add     dil, -97
        cmp     dil, 6
        setb    al
        or      al, cl
        ret

that is 2 instructions shorter and contains no branches.

It's possible to reduce the number of required comparisons by converting `c` into lowercase, so that instead of checking both lowercase and uppercase ranges, only lowercase range needs to be checked.

This allows clang to generate branchless x86_64 asm (https://compiler-explorer.com/z/P6f7c8e4T). Instead of
```
is_ascii_hex_digit(char):                # @is_ascii_hex_digit(char)
        lea     ecx, [rdi - 48]
        mov     al, 1
        cmp     cl, 10
        jb      .LBB0_3
        lea     ecx, [rdi - 65]
        cmp     cl, 6
        jb      .LBB0_3
        add     dil, -97
        cmp     dil, 6
        setb    al
.LBB0_3:
        ret
```
clang generates
```
is_ascii_hex_digit(char):                # @is_ascii_hex_digit(char)
        lea     eax, [rdi - 48]
        cmp     al, 10
        setb    cl
        or      dil, 32
        add     dil, -97
        cmp     dil, 6
        setb    al
        or      al, cl
        ret
```
that is 2 instructions shorter and contains no branches.
@anonrig anonrig requested review from lemire and anonrig August 22, 2023 02:25
@anonrig
Copy link
Member

anonrig commented Aug 22, 2023

Can you run and share the results of the benchmarks?

@lemire
Copy link
Member

lemire commented Aug 22, 2023

I am using GCC 11 on an Ice Lake-based server.

benchdata

Main branch...

--------------------------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------
BasicBench_AdaURL_href              37300975 ns     37258253 ns           19 GHz=3.18743 cycle/byte=13.644 cycles/url=1.1851k instructions/byte=37.2856 instructions/cycle=2.73275 instructions/ns=8.71046 instructions/url=3.2386k ns/url=371.806 speed=233.186M/s time/byte=4.28843ns time/url=372.489ns url/s=2.68464M/s
BasicBench_AdaURL_aggregator_href   23117267 ns     23089445 ns           30 GHz=3.1885 cycle/byte=8.47386 cycles/url=736.033 instructions/byte=25.2896 instructions/cycle=2.98443 instructions/ns=9.51583 instructions/url=2.19663k ns/url=230.84 speed=376.28M/s time/byte=2.6576ns time/url=230.837ns url/s=4.33207M/s

This PR:

--------------------------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------
BasicBench_AdaURL_href              37287069 ns     37245642 ns           19 GHz=3.18801 cycle/byte=13.6441 cycles/url=1.18512k instructions/byte=37.2856 instructions/cycle=2.73272 instructions/ns=8.71193 instructions/url=3.23859k ns/url=371.742 speed=233.265M/s time/byte=4.28698ns time/url=372.363ns url/s=2.68555M/s
BasicBench_AdaURL_aggregator_href   23225765 ns     23195167 ns           30 GHz=3.1879 cycle/byte=8.50118 cycles/url=738.406 instructions/byte=25.2641 instructions/cycle=2.97184 instructions/ns=9.47392 instructions/url=2.19442k ns/url=231.628 speed=374.565M/s time/byte=2.66977ns time/url=231.894ns url/s=4.31232M/s

bench_search_params

Main branch...


-----------------------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------
url_search_params_AdaURL      55819 ns        55726 ns        10737 GHz=3.19652 cycle/byte=11.596 cycles/url=32.4862k instructions/byte=40.3667 instructions/cycle=3.48108 instructions/ns=11.1273 instructions/url=113.087k ns/url=10.163k speed=201.092M/s time/byte=4.97284ns time/url=13.9314us url/s=71.7803k/s

This PR:

-----------------------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------
url_search_params_AdaURL      60966 ns        60869 ns        11080 GHz=3.19684 cycle/byte=10.7248 cycles/url=30.0455k instructions/byte=40.6287 instructions/cycle=3.7883 instructions/ns=12.1106 instructions/url=113.821k ns/url=9.3985k speed=184.099M/s time/byte=5.43185ns time/url=15.2173us url/s=65.7146k/s

Conclusion

It does not appear that this PR improves the performance, at least according to these results.

However, the PR does suggest that we can do better.

@lemire
Copy link
Member

lemire commented Aug 22, 2023

@ttsugriy Can you try...

cmake -B build -D ADA_BENCHMARKS=ON
cmake --build build
./build/benchmarks/bench_search_params

@ttsugriy
Copy link
Contributor Author

@lemire the reason you're not seeing any improvement is because you're using gcc 11 that produces a very different assembly (https://compiler-explorer.com/z/16eEbPezb) from the one I've shared above using clang 16.

@ttsugriy
Copy link
Contributor Author

ttsugriy commented Aug 22, 2023

what's even more interesting is that GCC 13 generates an even better assembly for the original version (https://compiler-explorer.com/z/1j9oMTrbc)

is_ascii_hex_digit(char):
        sub     edi, 48
        xor     eax, eax
        cmp     dil, 54
        ja      .L1
        movabs  rax, 35465847073801215
        bt      rax, rdi
        setc    al
.L1:
        ret

Basically it's using a lookup table approach where all hex chars are packed into an integer value.

@anonrig
Copy link
Member

anonrig commented Aug 22, 2023

@lemire the reason you're not seeing any improvement is because you're using gcc 11 that produces a very different assembly (https://compiler-explorer.com/z/16eEbPezb) from the one I've shared above using clang 16.

@ttsugriy Can you share your benchmark results with clang 16?

@ttsugriy
Copy link
Contributor Author

@lemire the reason you're not seeing any improvement is because you're using gcc 11 that produces a very different assembly (https://compiler-explorer.com/z/16eEbPezb) from the one I've shared above using clang 16.

@ttsugriy Can you share your benchmark results with clang 16?

@anonrig I have only clang 15 on my machine and the results are not very promising :)
main:

Running ./build/benchmarks/bench_search_params
Run on (8 X 24.0427 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x8)
  L1 Instruction 128 KiB (x8)
  L2 Unified 4096 KiB (x2)
Load Average: 3.73, 3.89, 3.53
bytes/URL: 2801.500000
input bytes: 11206
number of URLs: 4
performance counters: No privileged access (sudo may help).
-----------------------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------
url_search_params_AdaURL      60745 ns        60725 ns         9986 speed=184.538M/s time/byte=5.41895ns time/url=15.1812us url/s=65.871k/s

this PR

Running ./build/benchmarks/bench_search_params
Run on (8 X 24.1212 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x8)
  L1 Instruction 128 KiB (x8)
  L2 Unified 4096 KiB (x2)
Load Average: 3.18, 3.66, 3.47
bytes/URL: 2801.500000
input bytes: 11206
number of URLs: 4
performance counters: No privileged access (sudo may help).
-----------------------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------
url_search_params_AdaURL      63980 ns        63980 ns         9582 speed=175.147M/s time/byte=5.70948ns time/url=15.9951us url/s=62.5192k/s

clang info:

% clang --version
Apple clang version 15.0.0 (clang-1500.0.40.1)

As such this PR should probably be closed and it's probably best to try to write some sort of lookup table implementation so that clang generates the same super efficient assembly as GCC.

@ttsugriy
Copy link
Contributor Author

The closest I could get with clang is by using

bool is_ascii_hex_digit(const char c) noexcept { 
    const unsigned ch = static_cast<unsigned>(c) - '0';
    if (ch > '6') return false;
    constexpr static bool table[] = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
                                         0, 0, 0, 0, 0, 0, 0, 1, 1, 
                                         1, 1, 1, 1, 0, 0, 0, 0, 
                                          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                                          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                                          0, 0, 1, 1, 1, 1, 1, 1};
    return table[ch];
}

that generates

is_ascii_hex_digit(char):                # @is_ascii_hex_digit(char)
        add     edi, -48
        cmp     edi, 55
        setb    cl
        movabs  rax, 35465847073801215
        bt      rax, rdi
        setb    al
        and     al, cl
        ret

https://compiler-explorer.com/z/b93bYhn4j

but it's still much slower on the benchmark running on M1 ARM.

@ttsugriy ttsugriy closed this Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants