Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add x86_64 asm codegen for PrimeField mul and square #176

Merged
merged 48 commits into from
May 4, 2020

Conversation

jon-chuang
Copy link
Contributor

@jon-chuang jon-chuang commented Apr 8, 2020

I was able to achieve a 1.7x speedup on 6-limb mul_assign against my previous pull request #163, bringing the total speedup to about 1.84x. This is achieved with a generic code-generation script utilising build.rs that generates x64 assembly utilising the mulxq, adoxq and adcxq instructions for 2 up to 8 limbs.

Update: was able to extend work to 18 limbs, achieving 1.34x speedup on Fp832. May be worth looking into translating the old square_in_place into assembly, but this is a significant amount of work.

Update: Put behind feature. Build with:

RUSTFLAGS="--emit=asm -C target-feature=+bmi2,+adx" cargo +nightly bench/test/build --features asm

Next steps: more limbs with less data movement using kilic's strategy. Extend strategy to square_in_place.

Details

I implemented the assembly version of square_in_place as applying mul_assign to itself. That's because the goff squaring formula is not really good, while the original squaring formula requires twice the number of registers and hence goes beyond the scope of this PR, and may in fact be slower in the end, as this PR takes massive advantage of the add and carry chains present in mul_assign.

Update: I cleaned up the messy build.rs by creating a standalone subcrate and importing the generate function.

Future work Currently, the maximum number of limbs supported is 8, and could be possibly be extended by current methods to about 10 for mul and 11 for squaring. However, going beyond that may require additional work to optimally move data to and from registers into memory (i.e. L1 cache). One should study how a compiler would reason about this to write a reasonable code generation procedure script to do this.

Edit: our data movement is extremely suboptimal. https://raw.githubusercontent.com/kilic/fp/master/generic/x86_arithmetic.s is able to achieve 149ns on 13 limbs, whereas we achieve 190ns. There's a lot to learn here.

We are also somehow still trailing behind MCL, probably the gold-standard for pairings. Here are the benchmarks, run on a similar i7-7700 to mine. Although our Fp performance totally outclasses it, everything else is lagging behind. It manages to achieve a staggering ~0.7ms BLS12_381 pairing (?). Their webassembly target achieves a sickening 3.5ms on my Chrome browser.

I suspect there could be bottlenecks in adc, sbb, negate. Negate is much slower than sub_assign. This I suspect could be fixed with a little more inline assembly, since it's unclear the compiler knows how to do adc. In particular, the modulus can be hard-coded for negate. Assuming we can shave of 3ns per add/sub, we can probably improve performance by another 5%. (Given this, it may indeed be worthwhile to write a procedural macro and do a cleanup. This may turn out not to be possible. But it is worth trying.)

Areas for Improvement

This however does not explain the discrepancy. Looking at the benchmarks more carefully, for BLS12_381, Fq12 mul is ~6800ns, whereas for MCL it is ~3000ns. We're better for Miller loop, 413us vs 527us (but not the precomputed Miller loop (fixed Q in G12) - 405us (which are we using?)). Nonetheless, those timings were for the BN curve, so really we need to improve the Miller loop (7500Mq) down to about 350us.

But this makes it apparent that the significant weakness is in the final exponentiation (9000Mq). We must get timings down from 850us down to 414us, a 2x speedup. Part of this is probably due to our slow Fq inverse timings. This is clear also by examining pairing formulas, where the number of field ops for the final exponentiation should be roughly similar to the Miller loop, but for us the former has a timing double the latter. I estimate fixing this could give another 35% boost to our pairings.

G2 should also be fixed - 25% slowdown - 2.7us vs 2.1us. for add, 1.85us vs 1.49us for double. Our mul assign is apparently ~3x slower (930us vs 360us).

G1 add 529us vs MCL's 472us. Suggests, as suspected, some inefficiency (probably from the very slow Fq doubling). Our G1 doubling beats it however. Should investigate why. Mul is 25% slower (v.s. 154us), but should be fixable with standalone pippenger.

MCL also implements GLV.

I will try to study MCL and obtain more detailed and up to date benchmarks for it.

Main results:
384-bit mul_assign

bls12_377::fq::bench_fq_mul_assign                               51,780       30,340            -21,440  -41.41%   x 1.71
bls12_381::fq::bench_fq_mul_assign                               52,143       30,280            -21,863  -41.93%   x 1.72
sw6::fr::bench_fr_mul_assign                                     50,664       30,005            -20,659  -40.78%   x 1.69

384-bit square_in_place:

bls12_377::fq::bench_fq_square                                   43,704       30,089            -13,615  -31.15%   x 1.45
bls12_381::fq::bench_fq_square                                   44,295       31,291            -13,004  -29.36%   x 1.42
sw6::fr::bench_fr_square                                         42,764       30,561            -12,203  -28.54%   x 1.40

256-bit mul_assign:

bls12_377::fr::bench_fr_mul_assign                               25,583       16,173             -9,410  -36.78%   x 1.58
bls12_381::fr::bench_fr_mul_assign                               27,166       17,113            -10,053  -37.01%   x 1.59

256-bit square_in_place:

bls12_377::fr::bench_fr_square                                   21,929       15,620             -6,309  -28.77%   x 1.40
bls12_381::fr::bench_fr_square                                   24,908       16,834             -8,074  -32.42%   x 1.48

384-bit G1:

 bls12_377::ec::g1::bench_g1_add_assign                           1,048,201    755,061          -293,140  -27.97%   x 1.39 
 bls12_377::ec::g1::bench_g1_add_assign_mixed                     749,362      529,950          -219,412  -29.28%   x 1.41 
 bls12_377::ec::g1::bench_g1_double                               535,695      418,700          -116,995  -21.84%   x 1.28 
 bls12_377::ec::g1::bench_g1_mul_assign                           269,918      202,057           -67,861  -25.14%   x 1.34
 bls12_381::ec::g1::bench_g1_add_assign                           1,059,324    769,470          -289,854  -27.36%   x 1.38 
 bls12_381::ec::g1::bench_g1_add_assign_mixed                     758,426      534,781          -223,645  -29.49%   x 1.42 
 bls12_381::ec::g1::bench_g1_double                               539,424      419,918          -119,506  -22.15%   x 1.28 
 bls12_381::ec::g1::bench_g1_mul_assign                           274,129      206,160           -67,969  -24.79%   x 1.33 
 bls12_381::ec::g1::bench_g1_rand                                 274,444      204,898           -69,546  -25.34%   x 1.34

384-bit pairings:

 bls12_377::pairing::pairing::bench_pairing_final_exponentiation  1,292,296    1,087,830        -204,466  -15.82%   x 1.19 
 bls12_377::pairing::pairing::bench_pairing_full                  2,235,430    1,845,299        -390,131  -17.45%   x 1.21 
 bls12_377::pairing::pairing::bench_pairing_miller_loop           631,072      510,977          -120,095  -19.03%   x 1.24
 bls12_381::pairing::pairing::bench_pairing_final_exponentiation  1,072,725    847,028          -225,697  -21.04%   x 1.27 
 bls12_381::pairing::pairing::bench_pairing_full                  1,871,067    1,472,014        -399,053  -21.33%   x 1.27 
 bls12_381::pairing::pairing::bench_pairing_miller_loop           546,691      413,627          -133,064  -24.34%   x 1.32 

SW6 (updated)

sw6::fq::bench_fq_mul_assign                               253,632               189,117               -64,515  -25.44%   x 1.34
sw6::fq::bench_fq_square                                   208,612               192,834               -15,778   -7.56%   x 1.08
sw6::ec::g1::bench_g1_add_assign                           4,087,676             3,464,448            -623,228  -15.25%   x 1.18 
sw6::ec::g1::bench_g1_add_assign_mixed                     2,888,104             2,429,125            -458,979  -15.89%   x 1.19 
sw6::ec::g1::bench_g1_double                               2,608,558             2,417,105            -191,453   -7.34%   x 1.08 
sw6::ec::g1::bench_g1_mul_assign                           1,758,755             1,576,919            -181,836  -10.34%   x 1.12
sw6::pairing::pairing::bench_pairing_final_exponentiation  8,648,924             7,737,242            -911,682  -10.54%   x 1.12 
sw6::pairing::pairing::bench_pairing_full                  95,098,636            85,313,158         -9,785,478  -10.29%   x 1.11 
sw6::pairing::pairing::bench_pairing_miller_loop           86,559,010            77,266,953         -9,292,057  -10.73%   x 1.12
Full results
 bls12_377::ec::g1::bench_g1_add_assign                           1,048,201    755,061          -293,140  -27.97%   x 1.39 
 bls12_377::ec::g1::bench_g1_add_assign_mixed                     749,362      529,950          -219,412  -29.28%   x 1.41 
 bls12_377::ec::g1::bench_g1_double                               535,695      418,700          -116,995  -21.84%   x 1.28 
 bls12_377::ec::g1::bench_g1_mul_assign                           269,918      202,057           -67,861  -25.14%   x 1.34 
 bls12_377::ec::g1::bench_g1_rand                                 267,739      202,269           -65,470  -24.45%   x 1.32 
 bls12_377::ec::g2::bench_g2_add_assign                           4,612        3,798                -814  -17.65%   x 1.21 
 bls12_377::ec::g2::bench_g2_add_assign_mixed                     3,232        2,703                -529  -16.37%   x 1.20 
 bls12_377::ec::g2::bench_g2_double                               2,158        1,877                -281  -13.02%   x 1.15 
 bls12_377::ec::g2::bench_g2_mul_assign                           1,102,020    937,104          -164,916  -14.96%   x 1.18 
 bls12_377::ec::g2::bench_g2_rand                                 1,080,859    922,946          -157,913  -14.61%   x 1.17 
 bls12_377::fq12::bench_fq12_add_assign                           133          131                    -2   -1.50%   x 1.02 
 bls12_377::fq12::bench_fq12_double                               126          114                   -12   -9.52%   x 1.11 
 bls12_377::fq12::bench_fq12_inverse                              25,742       28,005              2,263    8.79%   x 0.92 
 bls12_377::fq12::bench_fq12_mul_assign                           6,874        5,566              -1,308  -19.03%   x 1.23 
 bls12_377::fq12::bench_fq12_square                               4,723        3,882                -841  -17.81%   x 1.22 
 bls12_377::fq12::bench_fq12_sub_assign                           119          116                    -3   -2.52%   x 1.03 
 bls12_377::fq2::bench_fq2_add_assign                             17           16                     -1   -5.88%   x 1.06 
 bls12_377::fq2::bench_fq2_double                                 17           20                      3   17.65%   x 0.85 
 bls12_377::fq2::bench_fq2_inverse                                13,930       18,062              4,132   29.66%   x 0.77 
 bls12_377::fq2::bench_fq2_mul_assign                             249          191                   -58  -23.29%   x 1.30 
 bls12_377::fq2::bench_fq2_sqrt                                   127,995      96,823            -31,172  -24.35%   x 1.32 
 bls12_377::fq2::bench_fq2_square                                 231          227                    -4   -1.73%   x 1.02 
 bls12_377::fq2::bench_fq2_sub_assign                             19           19                      0    0.00%   x 1.00 
 bls12_377::fq::bench_fq_add_assign                               10           10                      0    0.00%   x 1.00 
 bls12_377::fq::bench_fq_double                                   9,108        9,111                   3    0.03%   x 1.00 
 bls12_377::fq::bench_fq_from_repr                                59           38                    -21  -35.59%   x 1.55 
 bls12_377::fq::bench_fq_into_repr                                32           31                     -1   -3.12%   x 1.03 
 bls12_377::fq::bench_fq_inverse                                  13,708       17,631              3,923   28.62%   x 0.78 
 bls12_377::fq::bench_fq_mul_assign                               51,780       30,340            -21,440  -41.41%   x 1.71 
 bls12_377::fq::bench_fq_negate                                   13           14                      1    7.69%   x 0.93 
 bls12_377::fq::bench_fq_repr_add_nocarry                         8            10                      2   25.00%   x 0.80 
 bls12_377::fq::bench_fq_repr_div2                                6            6                       0    0.00%   x 1.00 
 bls12_377::fq::bench_fq_repr_mul2                                5            13                      8  160.00%   x 0.38 
 bls12_377::fq::bench_fq_repr_num_bits                            4            3                      -1  -25.00%   x 1.33 
 bls12_377::fq::bench_fq_repr_sub_noborrow                        9            11                      2   22.22%   x 0.82 
 bls12_377::fq::bench_fq_sqrt                                     77,501       54,993            -22,508  -29.04%   x 1.41 
 bls12_377::fq::bench_fq_square                                   43,704       30,089            -13,615  -31.15%   x 1.45 
 bls12_377::fq::bench_fq_sub_assign                               11           11                      0    0.00%   x 1.00 
 bls12_377::fr::bench_fr_add_assign                               7            7                       0    0.00%   x 1.00 
 bls12_377::fr::bench_fr_double                                   10,316       5,832              -4,484  -43.47%   x 1.77 
 bls12_377::fr::bench_fr_from_repr                                31           22                     -9  -29.03%   x 1.41 
 bls12_377::fr::bench_fr_into_repr                                20           19                     -1   -5.00%   x 1.05 
 bls12_377::fr::bench_fr_inverse                                  7,516        5,451              -2,065  -27.47%   x 1.38 
 bls12_377::fr::bench_fr_mul_assign                               25,583       16,173             -9,410  -36.78%   x 1.58 
 bls12_377::fr::bench_fr_negate                                   10           7                      -3  -30.00%   x 1.43 
 bls12_377::fr::bench_fr_repr_add_nocarry                         6            7                       1   16.67%   x 0.86 
 bls12_377::fr::bench_fr_repr_div2                                4            4                       0    0.00%   x 1.00 
 bls12_377::fr::bench_fr_repr_mul2                                4            7                       3   75.00%   x 0.57 
 bls12_377::fr::bench_fr_repr_num_bits                            4            3                      -1  -25.00%   x 1.33 
 bls12_377::fr::bench_fr_repr_sub_noborrow                        6            8                       2   33.33%   x 0.75 
 bls12_377::fr::bench_fr_sqrt                                     30,712       22,193             -8,519  -27.74%   x 1.38 
 bls12_377::fr::bench_fr_square                                   21,929       15,620             -6,309  -28.77%   x 1.40 
 bls12_377::fr::bench_fr_sub_assign                               8            7                      -1  -12.50%   x 1.14 
 bls12_377::pairing::pairing::bench_pairing_final_exponentiation  1,292,296    1,087,830        -204,466  -15.82%   x 1.19 
 bls12_377::pairing::pairing::bench_pairing_full                  2,235,430    1,845,299        -390,131  -17.45%   x 1.21 
 bls12_377::pairing::pairing::bench_pairing_miller_loop           631,072      510,977          -120,095  -19.03%   x 1.24 
 bls12_381::ec::g1::bench_g1_add_assign                           1,059,324    769,470          -289,854  -27.36%   x 1.38 
 bls12_381::ec::g1::bench_g1_add_assign_mixed                     758,426      534,781          -223,645  -29.49%   x 1.42 
 bls12_381::ec::g1::bench_g1_double                               539,424      419,918          -119,506  -22.15%   x 1.28 
 bls12_381::ec::g1::bench_g1_mul_assign                           274,129      206,160           -67,969  -24.79%   x 1.33 
 bls12_381::ec::g1::bench_g1_rand                                 274,444      204,898           -69,546  -25.34%   x 1.34 
 bls12_381::ec::g2::bench_g2_add_assign                           4,660        3,785                -875  -18.78%   x 1.23 
 bls12_381::ec::g2::bench_g2_add_assign_mixed                     3,267        2,691                -576  -17.63%   x 1.21 
 bls12_381::ec::g2::bench_g2_double                               2,175        1,871                -304  -13.98%   x 1.16 
 bls12_381::ec::g2::bench_g2_mul_assign                           1,116,571    936,037          -180,534  -16.17%   x 1.19 
 bls12_381::ec::g2::bench_g2_rand                                 1,084,845    919,647          -165,198  -15.23%   x 1.18 
 bls12_381::fq12::bench_fq12_add_assign                           130          125                    -5   -3.85%   x 1.04 
 bls12_381::fq12::bench_fq12_double                               126          112                   -14  -11.11%   x 1.12 
 bls12_381::fq12::bench_fq12_inverse                              24,139       25,376              1,237    5.12%   x 0.95 
 bls12_381::fq12::bench_fq12_mul_assign                           6,199        4,687              -1,512  -24.39%   x 1.32 
 bls12_381::fq12::bench_fq12_square                               4,118        3,150                -968  -23.51%   x 1.31 
 bls12_381::fq12::bench_fq12_sub_assign                           119          120                     1    0.84%   x 0.99 
 bls12_381::fq2::bench_fq2_add_assign                             18           20                      2   11.11%   x 0.90 
 bls12_381::fq2::bench_fq2_double                                 17           19                      2   11.76%   x 0.89 
 bls12_381::fq2::bench_fq2_inverse                                14,351       18,160              3,809   26.54%   x 0.79 
 bls12_381::fq2::bench_fq2_mul_assign                             225          162                   -63  -28.00%   x 1.39 
 bls12_381::fq2::bench_fq2_sqrt                                   124,951      83,086            -41,865  -33.51%   x 1.50 
 bls12_381::fq2::bench_fq2_square                                 177          142                   -35  -19.77%   x 1.25 
 bls12_381::fq2::bench_fq2_sub_assign                             19           19                      0    0.00%   x 1.00 
 bls12_381::fq::bench_fq_add_assign                               10           10                      0    0.00%   x 1.00 
 bls12_381::fq::bench_fq_double                                   9,240        9,062                -178   -1.93%   x 1.02 
 bls12_381::fq::bench_fq_from_repr                                59           39                    -20  -33.90%   x 1.51 
 bls12_381::fq::bench_fq_into_repr                                32           31                     -1   -3.12%   x 1.03 
 bls12_381::fq::bench_fq_inverse                                  14,949       17,811              2,862   19.15%   x 0.84 
 bls12_381::fq::bench_fq_mul_assign                               52,143       30,280            -21,863  -41.93%   x 1.72 
 bls12_381::fq::bench_fq_negate                                   14           14                      0    0.00%   x 1.00 
 bls12_381::fq::bench_fq_repr_add_nocarry                         8            10                      2   25.00%   x 0.80 
 bls12_381::fq::bench_fq_repr_div2                                6            6                       0    0.00%   x 1.00 
 bls12_381::fq::bench_fq_repr_mul2                                6            13                      7  116.67%   x 0.46 
 bls12_381::fq::bench_fq_repr_num_bits                            4            4                       0    0.00%   x 1.00 
 bls12_381::fq::bench_fq_repr_sub_noborrow                        9            11                      2   22.22%   x 0.82 
 bls12_381::fq::bench_fq_sqrt                                     61,857       38,010            -23,847  -38.55%   x 1.63 
 bls12_381::fq::bench_fq_square                                   44,295       31,291            -13,004  -29.36%   x 1.42 
 bls12_381::fq::bench_fq_sub_assign                               11           10                     -1   -9.09%   x 1.10 
 bls12_381::fr::bench_fr_add_assign                               7            7                       0    0.00%   x 1.00 
 bls12_381::fr::bench_fr_double                                   10,284       5,845              -4,439  -43.16%   x 1.76 
 bls12_381::fr::bench_fr_from_repr                                31           22                     -9  -29.03%   x 1.41 
 bls12_381::fr::bench_fr_into_repr                                20           19                     -1   -5.00%   x 1.05 
 bls12_381::fr::bench_fr_inverse                                  7,600        5,440              -2,160  -28.42%   x 1.40 
 bls12_381::fr::bench_fr_mul_assign                               27,166       17,113            -10,053  -37.01%   x 1.59 
 bls12_381::fr::bench_fr_negate                                   10           8                      -2  -20.00%   x 1.25 
 bls12_381::fr::bench_fr_repr_add_nocarry                         6            7                       1   16.67%   x 0.86 
 bls12_381::fr::bench_fr_repr_div2                                4            5                       1   25.00%   x 0.80 
 bls12_381::fr::bench_fr_repr_mul2                                4            7                       3   75.00%   x 0.57 
 bls12_381::fr::bench_fr_repr_num_bits                            4            3                      -1  -25.00%   x 1.33 
 bls12_381::fr::bench_fr_repr_sub_noborrow                        7            8                       1   14.29%   x 0.88 
 bls12_381::fr::bench_fr_sqrt                                     27,321       19,871             -7,450  -27.27%   x 1.37 
 bls12_381::fr::bench_fr_square                                   24,908       16,834             -8,074  -32.42%   x 1.48 
 bls12_381::fr::bench_fr_sub_assign                               8            7                      -1  -12.50%   x 1.14 
 bls12_381::pairing::pairing::bench_pairing_final_exponentiation  1,072,725    847,028          -225,697  -21.04%   x 1.27 
 bls12_381::pairing::pairing::bench_pairing_full                  1,871,067    1,472,014        -399,053  -21.33%   x 1.27 
 bls12_381::pairing::pairing::bench_pairing_miller_loop           546,691      413,627          -133,064  -24.34%   x 1.32 
 sw6::ec::g1::bench_g1_add_assign                                 4,185,546    4,048,507        -137,039   -3.27%   x 1.03 
 sw6::ec::g1::bench_g1_add_assign_mixed                           2,963,674    2,836,688        -126,986   -4.28%   x 1.04 
 sw6::ec::g1::bench_g1_double                                     2,624,621    2,602,247         -22,374   -0.85%   x 1.01 
 sw6::ec::g1::bench_g1_mul_assign                                 1,795,032    1,730,592         -64,440   -3.59%   x 1.04 
 sw6::ec::g1::bench_g1_rand                                       1,798,136    1,731,877         -66,259   -3.68%   x 1.04 
 sw6::ec::g2::bench_g2_add_assign                                 4,600        3,771                -829  -18.02%   x 1.22 
 sw6::ec::g2::bench_g2_add_assign_mixed                           3,206        2,663                -543  -16.94%   x 1.20 
 sw6::ec::g2::bench_g2_double                                     2,140        1,863                -277  -12.94%   x 1.15 
 sw6::ec::g2::bench_g2_mul_assign                                 1,098,494    935,481          -163,013  -14.84%   x 1.17 
 sw6::ec::g2::bench_g2_rand                                       1,076,092    915,453          -160,639  -14.93%   x 1.18 
 sw6::fq3::bench_fq3_add_assign                                   75           62                    -13  -17.33%   x 1.21 
 sw6::fq3::bench_fq3_double                                       50           55                      5   10.00%   x 0.91 
 sw6::fq3::bench_fq3_inverse                                      50,737       42,375             -8,362  -16.48%   x 1.20 
 sw6::fq3::bench_fq3_mul_assign                                   2,121        2,046                 -75   -3.54%   x 1.04 
 sw6::fq3::bench_fq3_sqrt                                         3,321,726    3,201,614        -120,112   -3.62%   x 1.04 
 sw6::fq3::bench_fq3_square                                       1,696        1,645                 -51   -3.01%   x 1.03 
 sw6::fq3::bench_fq3_sub_assign                                   67           66                     -1   -1.49%   x 1.02 
 sw6::fq6::bench_fq6_add_assign                                   144          114                   -30  -20.83%   x 1.26 
 sw6::fq6::bench_fq6_double                                       96           96                      0    0.00%   x 1.00 
 sw6::fq6::bench_fq6_inverse                                      58,863       50,288             -8,575  -14.57%   x 1.17 
 sw6::fq6::bench_fq6_mul_assign                                   7,118        6,819                -299   -4.20%   x 1.04 
 sw6::fq6::bench_fq6_square                                       5,248        4,976                -272   -5.18%   x 1.05 
 sw6::fq6::bench_fq6_sub_assign                                   131          127                    -4   -3.05%   x 1.03 
 sw6::fq::bench_fq_add_assign                                     22           19                     -3  -13.64%   x 1.16 
 sw6::fq::bench_fq_double                                         16,932       17,126                194    1.15%   x 0.99 
 sw6::fq::bench_fq_from_repr                                      245          248                     3    1.22%   x 0.99 
 sw6::fq::bench_fq_into_repr                                      138          132                    -6   -4.35%   x 1.05 
 sw6::fq::bench_fq_inverse                                        50,218       38,487            -11,731  -23.36%   x 1.30 
 sw6::fq::bench_fq_mul_assign                                     242,531      240,596            -1,935   -0.80%   x 1.01 
 sw6::fq::bench_fq_negate                                         24           22                     -2   -8.33%   x 1.09 
 sw6::fq::bench_fq_repr_add_nocarry                               14           14                      0    0.00%   x 1.00 
 sw6::fq::bench_fq_repr_div2                                      15           7                      -8  -53.33%   x 2.14 
 sw6::fq::bench_fq_repr_mul2                                      9            13                      4   44.44%   x 0.69 
 sw6::fq::bench_fq_repr_num_bits                                  4            3                      -1  -25.00%   x 1.33 
 sw6::fq::bench_fq_repr_sub_noborrow                              17           17                      0    0.00%   x 1.00 
 sw6::fq::bench_fq_sqrt                                           526,690      518,031            -8,659   -1.64%   x 1.02 
 sw6::fq::bench_fq_square                                         208,952      210,092             1,140    0.55%   x 0.99 
 sw6::fq::bench_fq_sub_assign                                     19           19                      0    0.00%   x 1.00 
 sw6::fr::bench_fr_add_assign                                     10           10                      0    0.00%   x 1.00 
 sw6::fr::bench_fr_double                                         9,174        9,065                -109   -1.19%   x 1.01 
 sw6::fr::bench_fr_from_repr                                      58           38                    -20  -34.48%   x 1.53 
 sw6::fr::bench_fr_into_repr                                      32           31                     -1   -3.12%   x 1.03 
 sw6::fr::bench_fr_inverse                                        13,712       17,668              3,956   28.85%   x 0.78 
 sw6::fr::bench_fr_mul_assign                                     50,664       30,005            -20,659  -40.78%   x 1.69 
 sw6::fr::bench_fr_negate                                         13           14                      1    7.69%   x 0.93 
 sw6::fr::bench_fr_repr_add_nocarry                               8            10                      2   25.00%   x 0.80 
 sw6::fr::bench_fr_repr_div2                                      6            6                       0    0.00%   x 1.00 
 sw6::fr::bench_fr_repr_mul2                                      5            13                      8  160.00%   x 0.38 
 sw6::fr::bench_fr_repr_num_bits                                  4            3                      -1  -25.00%   x 1.33 
 sw6::fr::bench_fr_repr_sub_noborrow                              9            11                      2   22.22%   x 0.82 
 sw6::fr::bench_fr_sqrt                                           77,926       54,407            -23,519  -30.18%   x 1.43 
 sw6::fr::bench_fr_square                                         42,764       30,561            -12,203  -28.54%   x 1.40 
 sw6::fr::bench_fr_sub_assign                                     11           10                     -1   -9.09%   x 1.10 
 sw6::pairing::pairing::bench_pairing_final_exponentiation        8,881,217    8,317,250        -563,967   -6.35%   x 1.07 
 sw6::pairing::pairing::bench_pairing_full                        97,824,713   86,015,907    -11,808,806  -12.07%   x 1.14 
 sw6::pairing::pairing::bench_pairing_miller_loop                 88,591,942   77,475,292    -11,116,650  -12.55%   x 1.14 

@Pratyush
Copy link
Member

Pratyush commented Apr 8, 2020

This speedup is amazing! Thanks for your work on this!

Couple of quick points:

  • Let's add an off-by-default asm feature to Cargo.toml, both in algebra-core and in algebra. The idea is to gate the use of assembly behind this feature, so that by default, all arithmetic (even for fields with fewer than 8 limbs) is safe. Then, downstream users can opt into using assembly by enabling the feature. This way we can also ensure that the library compiles on stable by default, because inline assembly requires nightly AFAIK.

  • Along with the asm feature, we should also use target_arch and target_feature to enable this assembly only on x86-64 architectures that support the assembly instructions we use.

  • Do you think replacing build.rs with a custom proc-macro that directly generated the assembly would be a cleaner approach? (Currently we have a build.rs that generates a macro that creates the assembly.)

  • Is it possible to run the code through a sanitizer or fuzzer? It might catch some safety issues. This can be done after we merge the PR, because it'll be behind a feature-gate.

@Pratyush
Copy link
Member

Pratyush commented Apr 8, 2020

I suspect the major bottlenecks are in adc, sbb, negate. Negate is much slower than sub_assign. This I suspect could be fixed with a little more inline assembly, since it's unclear the compiler knows how to do adc. In particular, the modulus can be hard-coded for negate. Assuming we can shave of 3ns per add/sub, we can probably improve performance by another 5%.

It seems that now field addition is only ~1.6x faster than field multiplication, which is surprising. Do you think a similar speedup is possible also for addition

@Pratyush
Copy link
Member

Pratyush commented Apr 9, 2020

Oh and it would also be nice to get some more comments on the structure of the code, and maybe some references to any prior work that this is might be based on (to make auditing and comparison easier).

@jon-chuang
Copy link
Contributor Author

Link: examples of fuzzing with cargo-fuzz in wasmer (wasmer also does a lot of inline assembly/unsafe code)

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 16, 2020

Hi @Pratyush , my apologies for not getting back to you sooner. Things were a work in progress. I have made progress on feature-gating and forcing explicit target declarations.

I have also created abstractions that make the code look cleaner, but I am still utilising a build script for codegen as it is simpler. I did however resort to using procedural macros to obtain something close to a DSL, similar to xbyak (although that is far more sophisticated). Apart from a few macro calls to define arrays of &str tokens, one simply need call #[assemble] over the function signature to invoke the DSL.

xorq(RCX, RCX);
for i in 0..limbs {
    if i == 0 {
        movq(a1[0], RDX);
        mulxq(b1[0], R[0], R[1]);
        for j in 1..limbs - 1 {
            mulxq(b1[j], RAX, R[((j + 1) % limbs)]);
            adcxq(RAX, R[j]);
        }
        mulxq(b1[limbs - 1], RAX, RCX);
        movq(zero, RBX);
        adcxq(RAX, R[limbs - 1]);
        adcxq(RBX, RCX);
    } else {
        movq(a1[i], RDX);
        for j in 0..limbs - 1 {
            mulxq(b1[j], RAX, RBX);
            adcxq(RAX, R[(j + i) % limbs]);
            adoxq(RBX, R[(j + i + 1) % limbs]);
        }
        mulxq(b1[limbs - 1], RAX, RCX);
        movq(zero, RBX);
        adcxq(RAX, R[(i + limbs - 1) % limbs]);
        adoxq(RBX, RCX);
        adcxq(RBX, RCX);
    }
    movq(inverse, RDX);
    mulxq(R[i], RDX, RAX);
    mulxq(m1[0], RAX, RBX);
    adcxq(R[i % limbs], RAX);
    adoxq(RBX, R[(i + 1) % limbs]);
    for j in 1..limbs - 1 {
        mulxq(m1[j], RAX, RBX);
        adcxq(RAX, R[(j + i) % limbs]);
        adoxq(RBX, R[(j + i + 1) % limbs]);
    }
    mulxq(m1[limbs - 1], RAX, R[i % limbs]);
    movq(zero, RBX);
    adcxq(RAX, R[(i + limbs - 1) % limbs]);
    adoxq(RCX, R[i % limbs]);
    adcxq(RBX, R[i % limbs]);
}
for i in 0..limbs {
    movq(R[i], a1[i]);
}

With a bit more toying around, it may be possible to have a function-like procedural macro similar to ordinary macros that generates the code. However, there are advantages to the current way of doing things which is that you can easily inspect the generated assembly code.

There are also limitations with my current method such as requiring everything to be in the context of the current function body. This could lead to code bloat. Hence, I am exploring user-defined TokenStreams that are consumed by a procedural macro to inline the macro routines as required. Other extensions include changing from AT&T syntax to Intel syntax based on an attribute e.g. #[assemble(Intel)]. I have separated the procedural macro portion of the code into a new crate, called mince. A somewhat grotesque take on assemblers.

Remaining work, stated above, is to add support for more routines, as well as extend efficiently to more limbs as per kilic's work.

Comment on lines 52 to 56
full_asm = [ "algebra-core/asm", "bls12_377", "bls12_381", "sw6", "mnt4_298", "mnt4_753", "mnt6_298", "mnt6_753", "edwards_bls12", "edwards_sw6", "jubjub" ]
small_asm = ["algebra-core/asm", "mnt4_298", "mnt6_298" ]
mid_asm = [ "algebra-core/asm", "bls12_377", "bls12_381", "edwards_bls12"]
big_asm = [ "algebra-core/asm", "sw6", "mnt4_753", "mnt6_753", "edwards_sw6" ]
mix_asm = [ "algebra-core/asm", "sw6", "mnt4_753", "bls12_381", "mnt6_298" ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need these features; because features are additive, if you enable asm and any of these other features, the underlying arithmetic will automatically use the asm routines.

Copy link
Contributor Author

@jon-chuang jon-chuang Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that is the case. At least not with the current cfg logic. If I try to run

RUSTFLAGS="--emit=asm -C target-cpu=native -C target-feature=+bmi2,+adx" cargo +nightly test --features asm bls12_381

none of the tests run.

Ok, looks like you need to double quote all the features together. Should be put in the readme.

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 22, 2020

@Pratyush to avoid code drift, I manually merged my suggested changes in #188 into this PR, and also added a n_fold bench cfg, as well as clean up the benches with code-reusing macros, especially unifying all fields, to make it much more ergonomic. As a result, implementing benchmarks for all the MNT curves took less than 1 minute.

However, there was already some drift and I may not have dealt with the merge with some serialisation/repr stuff correctly.

I suggest that the above changes I proposed be saved for a future PR, as I have to focus on my schoolwork for the next 6+ weeks.

@paberr I added benchmarking support for MNT curves, let me know if there are any issues.

@Pratyush
Copy link
Member

@jon-chuang, is the following an accurate summary of the architecture?.

There are two crates:

  • mince
  • field_assembly

mince defines a kind of DSL for writing assembly programs, and uses this DSL to define three higher-level chunks: mul_1, mul_add_1, and mul_add_shift_1. These are used by field_assembly to define the multiplication routines.

field_assembly defines a macro asm_mul roughly of the following form:

asm_mul ($limbs) {
	2 => {{...}},
	...
}

algebra-core now has a build.rs which uses field_assembly to generate a file containing this macro, and then field/arithmetic.rs includes this file depending on the asm feature flag.

If this understanding is correct, it seems to me that it might be possible to eliminate the build.rs and simplify the architecture as follows:

Let s be the String that is the output of the i-th iteration of the body of the loop in generate_matches (i.e. s contains the contents of one match arm in the above asm_mul macro). If one now calls syn::parse(s), one obtains a TokenStream which contains the assembly implementation of the multiplication routine for fields with i limbs. If we define a function that outputs this TokenStream, then we have obtained a proc-macro that takes as input the number of limbs, and outputs the optimized assembly implementation. (We can define it as a function-like macro to make usage particularly pleasant)

There are also ergonomic benefits to this: we can throw compile errors when the proc macro receives an unsupported number of limbs as input, instead of run-time errors as is the case right now.

Let me know what you think of this idea. I think it's worth pursuing in a follow-up PR, but if you've tried this approach and run into problems I'd like to hear about that as well =)

@jon-chuang
Copy link
Contributor Author

@Pratyush You're right on the money with the code organisation. While what you suggest its pros, as mentioned previously, one of the upsides of the current strategy is that one can inspect the generated code easily. Nonetheless, I will take your suggestions into consideration for a future PR.

@Pratyush
Copy link
Member

Pratyush commented Apr 23, 2020

Hmm cargo expand does allow you to view the post-macro-expansion code, but you're right that it's not as nice as having a generated rust file. It's a good point to think about. (There are other potential benefits to using proc-macros; for instance in the future we might want to migrate to something like this to help reduce compile times. However let's leave it to a future PR)

@Pratyush
Copy link
Member

By the way, the latest version of Rust allows using #[cfg(...)] on of statements, so maybe that will eliminate the the issue with needing to wrap the if-statements in blocks:

https://github.com/rust-lang/rust/blob/master/RELEASES.md#version-1430-2020-04-23

@jon-chuang
Copy link
Contributor Author

I gather this is stable?

@Pratyush
Copy link
Member

Yes, it’s the latest stable release

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 24, 2020

I'm still getting 04-20 with rustup update. The compiler warning is not due to these extra brackets (I can't identify the actual problem). But errors can result from lack of them. For backwards compatibility for people building with older stable, I suggest we keep things the way they are.

(sorry, accidental close)

@jon-chuang jon-chuang closed this Apr 24, 2020
@jon-chuang jon-chuang reopened this Apr 24, 2020
@Pratyush
Copy link
Member

Pratyush commented May 4, 2020

@jon-chuang I'd like to merge this as is to prevent the PR from getting out of sync with master. I've done a review, and the performance and generated code seems great. Because it's gated behind an off-by-default feature, and because it's available only on nightly, I think merging is a low-risk option.

@Pratyush Pratyush changed the title Added asm code generation for montgomery multiplication and squaring + Field and Bench cleanup/refactoring Add x86_64 asm codegen for PrimeField mul and square May 4, 2020
@Pratyush Pratyush merged commit 4369be9 into arkworks-rs:master May 4, 2020
@jon-chuang
Copy link
Contributor Author

Awesome! Thanks for helping cleanup the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants