Improve dec_digits performance and benchmarking methodology #250

bnbajwa · 2024-10-07T15:32:45Z

Note to reviewers: This pull request consists of 2 commits, so view separately for easier reviewing.

The first commit changes how dec_digits is benchmarked (i.e. using randomly generated numbers).

In this commit, you may notice K=7 for the benchmarks. This means that 1 to 7 digit random numbers will be used for benchmarking i.e. [1, 10^7). The reasoning for that is because I think very large numbers (>7 digits) are much less common in production.

BEFORE commit (baseline results):

NAME                                                   COUNT       MIN       %50       %95       %99     %99.9    %99.99
------------------------------------------------------------------------------------------------------------------------
dec_digits_lib                                    3516840000         0         0         0         0         0         0
dec_digits_div                                    1205850000     0.002     0.002     0.002     0.003     0.004     0.005
dec_digits_math                                    152790000     0.018     0.019      0.02     0.021     0.027     0.034
dec_digits_stdio                                    90960000     0.028     0.032     0.033     0.034     0.046     0.062
dec_digits_less                                   3694740000         0         0         0         0         0         0

Note that the results indicate dec_digits_less is faster than than the dec_digits_lib by 1.05x (i.e. 5%)

AFTER commit (K=7):

NAME                                                   COUNT       MIN       %50       %95       %99     %99.9    %99.99
------------------------------------------------------------------------------------------------------------------------
dec_digits_lib                                     478000000     0.005     0.006     0.006     0.006     0.007     0.007
dec_digits_div                                     266000000     0.010     0.011     0.011     0.012     0.012     0.012
dec_digits_math                                    170000000     0.017     0.017     0.017     0.018     0.018     0.018
dec_digits_stdio                                    73000000     0.040     0.041     0.041     0.043     0.043     0.043
dec_digits_less                                    367000000     0.008     0.008     0.008     0.008     0.009     0.009

Now, the benchmark shows dec_digits_lib is faster than dec_digits_less by 1.3x when dealing with integers up to 10^7.

AFTER commit (K=19).

Adding K=19 benchmark results here if anyone is curious how dec_digits performs when dealing with integers of all digit sizes possible (for int64).

NAME                                                   COUNT       MIN       %50       %95       %99     %99.9    %99.99
------------------------------------------------------------------------------------------------------------------------
dec_digits_lib                                     291000000     0.010     0.010     0.010     0.010     0.010     0.010
dec_digits_div                                     137000000     0.020     0.021     0.023     0.024     0.024     0.024
dec_digits_math                                    150000000     0.018     0.019     0.022     0.028     0.032     0.032
dec_digits_stdio                                    60000000     0.049     0.050     0.052     0.055     0.055     0.055
dec_digits_less                                    322000000     0.008     0.009     0.009     0.009     0.010     0.010

But when dealing all integers up to 10^19, dec_digits_less takes the lead again -- 10% faster than it's rival.

The second commit provides a faster implementation of dec_digits along with support for negative integers.

Results with new implementation (K=7):

NAME                                                   COUNT       MIN       %50       %95       %99     %99.9    %99.99
------------------------------------------------------------------------------------------------------------------------
dec_digits_lib_unsigned                           4258000000         0         0         0         0         0         0
dec_digits_lib_signed                             3370000000         0         0         0         0         0         0
dec_digits_branch                                  441000000     0.006     0.006     0.006     0.006     0.007     0.007
dec_digits_div                                     257000000     0.011     0.011     0.011     0.012     0.012     0.012
dec_digits_math                                    164000000     0.018     0.018     0.018     0.018     0.018     0.018
dec_digits_stdio                                    70000000     0.042     0.043     0.044     0.045     0.045     0.045
dec_digits_less                                    386000000     0.007     0.007     0.007     0.008     0.008     0.008

The previous dec_digits_lib is now dec_digits_branch.

New dec_digits unsigned overload is about 9.6x faster than previous.
And the signed overload is about 7.6x faster.

Results with new implementation (K=19):

NAME                                                   COUNT       MIN       %50       %95       %99     %99.9    %99.99
------------------------------------------------------------------------------------------------------------------------
dec_digits_lib_unsigned                           4286000000         0         0         0         0         0         0
dec_digits_lib_signed                             3364000000         0         0         0         0         0         0
dec_digits_branch                                  263000000      0.01     0.011     0.011     0.012     0.013     0.013
dec_digits_div                                     130000000     0.022     0.023     0.023     0.023     0.024     0.024
dec_digits_math                                    150000000     0.018      0.02      0.02      0.02      0.02      0.02
dec_digits_stdio                                    57000000     0.052     0.053     0.055     0.055     0.055     0.055
dec_digits_less                                    308000000     0.009     0.009      0.01      0.01      0.01      0.01

Unsigned overload about 16x faster.
Signed overload about 12.8x faster

Previously, dec_digits was invoked with inputs starting from 0 and increasing by 1. This is a very predictable pattern, and easy for the branch predictor to learn. Thus, the benchmarks show unrepresentative performance and favour branchy implementations of dec_digits. Now, dec_digits is invoked with random numbers of mixed digits. pow10 was previously in reactive-cpp but moved as it's now required in toolbox

The implementation of dec_digits is improved for better performance and added support for negative numbers.

bnbajwa requested a review from a team as a code owner October 7, 2024 15:32

perf(util): improve dec_digits performance

eb420f0

The implementation of dec_digits is improved for better performance and added support for negative numbers.

bnbajwa force-pushed the perf branch from cd83344 to eb420f0 Compare October 7, 2024 16:39

markaylett approved these changes Oct 8, 2024

View reviewed changes

bnbajwa merged commit 74bc557 into reactivemarkets:master Oct 8, 2024
5 checks passed

bnbajwa deleted the perf branch October 8, 2024 11:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve dec_digits performance and benchmarking methodology #250

Improve dec_digits performance and benchmarking methodology #250

bnbajwa commented Oct 7, 2024

Improve dec_digits performance and benchmarking methodology #250

Improve dec_digits performance and benchmarking methodology #250

Conversation

bnbajwa commented Oct 7, 2024

The first commit changes how dec_digits is benchmarked (i.e. using randomly generated numbers).

BEFORE commit (baseline results):

AFTER commit (K=7):

AFTER commit (K=19).

The second commit provides a faster implementation of dec_digits along with support for negative integers.

Results with new implementation (K=7):

Results with new implementation (K=19):