Add x86_64 asm codegen for PrimeField mul and square #176

jon-chuang · 2020-04-08T12:27:42Z

I was able to achieve a 1.7x speedup on 6-limb mul_assign against my previous pull request #163, bringing the total speedup to about 1.84x. This is achieved with a generic code-generation script utilising build.rs that generates x64 assembly utilising the mulxq, adoxq and adcxq instructions for 2 up to 8 limbs.

Update: was able to extend work to 18 limbs, achieving 1.34x speedup on Fp832. May be worth looking into translating the old square_in_place into assembly, but this is a significant amount of work.

Update: Put behind feature. Build with:

RUSTFLAGS="--emit=asm -C target-feature=+bmi2,+adx" cargo +nightly bench/test/build --features asm

Next steps: more limbs with less data movement using kilic's strategy. Extend strategy to square_in_place.

Details

I implemented the assembly version of square_in_place as applying mul_assign to itself. That's because the goff squaring formula is not really good, while the original squaring formula requires twice the number of registers and hence goes beyond the scope of this PR, and may in fact be slower in the end, as this PR takes massive advantage of the add and carry chains present in mul_assign.

Update: I cleaned up the messy build.rs by creating a standalone subcrate and importing the generate function.

Future work

Currently, the maximum number of limbs supported is 8, and could be possibly be extended by current methods to about 10 for mul and 11 for squaring. However, going beyond that may require additional work to optimally move data to and from registers into memory (i.e. L1 cache). One should study how a compiler would reason about this to write a reasonable code generation procedure script to do this.

Edit: our data movement is extremely suboptimal. https://raw.githubusercontent.com/kilic/fp/master/generic/x86_arithmetic.s is able to achieve 149ns on 13 limbs, whereas we achieve 190ns. There's a lot to learn here.

We are also somehow still trailing behind MCL, probably the gold-standard for pairings. Here are the benchmarks, run on a similar i7-7700 to mine. Although our Fp performance totally outclasses it, everything else is lagging behind. It manages to achieve a staggering ~0.7ms BLS12_381 pairing (?). Their webassembly target achieves a sickening 3.5ms on my Chrome browser.

I suspect there could be bottlenecks in adc, sbb, negate. Negate is much slower than sub_assign. This I suspect could be fixed with a little more inline assembly, since it's unclear the compiler knows how to do adc. In particular, the modulus can be hard-coded for negate. Assuming we can shave of 3ns per add/sub, we can probably improve performance by another 5%. (Given this, it may indeed be worthwhile to write a procedural macro and do a cleanup. This may turn out not to be possible. But it is worth trying.)

Areas for Improvement

This however does not explain the discrepancy. Looking at the benchmarks more carefully, for BLS12_381, Fq12 mul is ~6800ns, whereas for MCL it is ~3000ns. We're better for Miller loop, 413us vs 527us (but not the precomputed Miller loop (fixed Q in G12) - 405us (which are we using?)). Nonetheless, those timings were for the BN curve, so really we need to improve the Miller loop (7500Mq) down to about 350us.

But this makes it apparent that the significant weakness is in the final exponentiation (9000Mq). We must get timings down from 850us down to 414us, a 2x speedup. Part of this is probably due to our slow Fq inverse timings. This is clear also by examining pairing formulas, where the number of field ops for the final exponentiation should be roughly similar to the Miller loop, but for us the former has a timing double the latter. I estimate fixing this could give another 35% boost to our pairings.

G2 should also be fixed - 25% slowdown - 2.7us vs 2.1us. for add, 1.85us vs 1.49us for double. Our mul assign is apparently ~3x slower (930us vs 360us).

G1 add 529us vs MCL's 472us. Suggests, as suspected, some inefficiency (probably from the very slow Fq doubling). Our G1 doubling beats it however. Should investigate why. Mul is 25% slower (v.s. 154us), but should be fixable with standalone pippenger.

MCL also implements GLV.

I will try to study MCL and obtain more detailed and up to date benchmarks for it.

Main results:
384-bit mul_assign

bls12_377::fq::bench_fq_mul_assign                               51,780       30,340            -21,440  -41.41%   x 1.71
bls12_381::fq::bench_fq_mul_assign                               52,143       30,280            -21,863  -41.93%   x 1.72
sw6::fr::bench_fr_mul_assign                                     50,664       30,005            -20,659  -40.78%   x 1.69

384-bit square_in_place:

bls12_377::fq::bench_fq_square                                   43,704       30,089            -13,615  -31.15%   x 1.45
bls12_381::fq::bench_fq_square                                   44,295       31,291            -13,004  -29.36%   x 1.42
sw6::fr::bench_fr_square                                         42,764       30,561            -12,203  -28.54%   x 1.40

256-bit mul_assign:

bls12_377::fr::bench_fr_mul_assign                               25,583       16,173             -9,410  -36.78%   x 1.58
bls12_381::fr::bench_fr_mul_assign                               27,166       17,113            -10,053  -37.01%   x 1.59

256-bit square_in_place:

bls12_377::fr::bench_fr_square                                   21,929       15,620             -6,309  -28.77%   x 1.40
bls12_381::fr::bench_fr_square                                   24,908       16,834             -8,074  -32.42%   x 1.48

384-bit G1:

 bls12_377::ec::g1::bench_g1_add_assign                           1,048,201    755,061          -293,140  -27.97%   x 1.39 
 bls12_377::ec::g1::bench_g1_add_assign_mixed                     749,362      529,950          -219,412  -29.28%   x 1.41 
 bls12_377::ec::g1::bench_g1_double                               535,695      418,700          -116,995  -21.84%   x 1.28 
 bls12_377::ec::g1::bench_g1_mul_assign                           269,918      202,057           -67,861  -25.14%   x 1.34
 bls12_381::ec::g1::bench_g1_add_assign                           1,059,324    769,470          -289,854  -27.36%   x 1.38 
 bls12_381::ec::g1::bench_g1_add_assign_mixed                     758,426      534,781          -223,645  -29.49%   x 1.42 
 bls12_381::ec::g1::bench_g1_double                               539,424      419,918          -119,506  -22.15%   x 1.28 
 bls12_381::ec::g1::bench_g1_mul_assign                           274,129      206,160           -67,969  -24.79%   x 1.33 
 bls12_381::ec::g1::bench_g1_rand                                 274,444      204,898           -69,546  -25.34%   x 1.34

384-bit pairings:

 bls12_377::pairing::pairing::bench_pairing_final_exponentiation  1,292,296    1,087,830        -204,466  -15.82%   x 1.19 
 bls12_377::pairing::pairing::bench_pairing_full                  2,235,430    1,845,299        -390,131  -17.45%   x 1.21 
 bls12_377::pairing::pairing::bench_pairing_miller_loop           631,072      510,977          -120,095  -19.03%   x 1.24
 bls12_381::pairing::pairing::bench_pairing_final_exponentiation  1,072,725    847,028          -225,697  -21.04%   x 1.27 
 bls12_381::pairing::pairing::bench_pairing_full                  1,871,067    1,472,014        -399,053  -21.33%   x 1.27 
 bls12_381::pairing::pairing::bench_pairing_miller_loop           546,691      413,627          -133,064  -24.34%   x 1.32

SW6 (updated)

sw6::fq::bench_fq_mul_assign                               253,632               189,117               -64,515  -25.44%   x 1.34
sw6::fq::bench_fq_square                                   208,612               192,834               -15,778   -7.56%   x 1.08
sw6::ec::g1::bench_g1_add_assign                           4,087,676             3,464,448            -623,228  -15.25%   x 1.18 
sw6::ec::g1::bench_g1_add_assign_mixed                     2,888,104             2,429,125            -458,979  -15.89%   x 1.19 
sw6::ec::g1::bench_g1_double                               2,608,558             2,417,105            -191,453   -7.34%   x 1.08 
sw6::ec::g1::bench_g1_mul_assign                           1,758,755             1,576,919            -181,836  -10.34%   x 1.12
sw6::pairing::pairing::bench_pairing_final_exponentiation  8,648,924             7,737,242            -911,682  -10.54%   x 1.12 
sw6::pairing::pairing::bench_pairing_full                  95,098,636            85,313,158         -9,785,478  -10.29%   x 1.11 
sw6::pairing::pairing::bench_pairing_miller_loop           86,559,010            77,266,953         -9,292,057  -10.73%   x 1.12

Full results

 bls12_377::ec::g1::bench_g1_add_assign                           1,048,201    755,061          -293,140  -27.97%   x 1.39 
 bls12_377::ec::g1::bench_g1_add_assign_mixed                     749,362      529,950          -219,412  -29.28%   x 1.41 
 bls12_377::ec::g1::bench_g1_double                               535,695      418,700          -116,995  -21.84%   x 1.28 
 bls12_377::ec::g1::bench_g1_mul_assign                           269,918      202,057           -67,861  -25.14%   x 1.34 
 bls12_377::ec::g1::bench_g1_rand                                 267,739      202,269           -65,470  -24.45%   x 1.32 
 bls12_377::ec::g2::bench_g2_add_assign                           4,612        3,798                -814  -17.65%   x 1.21 
 bls12_377::ec::g2::bench_g2_add_assign_mixed                     3,232        2,703                -529  -16.37%   x 1.20 
 bls12_377::ec::g2::bench_g2_double                               2,158        1,877                -281  -13.02%   x 1.15 
 bls12_377::ec::g2::bench_g2_mul_assign                           1,102,020    937,104          -164,916  -14.96%   x 1.18 
 bls12_377::ec::g2::bench_g2_rand                                 1,080,859    922,946          -157,913  -14.61%   x 1.17 
 bls12_377::fq12::bench_fq12_add_assign                           133          131                    -2   -1.50%   x 1.02 
 bls12_377::fq12::bench_fq12_double                               126          114                   -12   -9.52%   x 1.11 
 bls12_377::fq12::bench_fq12_inverse                              25,742       28,005              2,263    8.79%   x 0.92 
 bls12_377::fq12::bench_fq12_mul_assign                           6,874        5,566              -1,308  -19.03%   x 1.23 
 bls12_377::fq12::bench_fq12_square                               4,723        3,882                -841  -17.81%   x 1.22 
 bls12_377::fq12::bench_fq12_sub_assign                           119          116                    -3   -2.52%   x 1.03 
 bls12_377::fq2::bench_fq2_add_assign                             17           16                     -1   -5.88%   x 1.06 
 bls12_377::fq2::bench_fq2_double                                 17           20                      3   17.65%   x 0.85 
 bls12_377::fq2::bench_fq2_inverse                                13,930       18,062              4,132   29.66%   x 0.77 
 bls12_377::fq2::bench_fq2_mul_assign                             249          191                   -58  -23.29%   x 1.30 
 bls12_377::fq2::bench_fq2_sqrt                                   127,995      96,823            -31,172  -24.35%   x 1.32 
 bls12_377::fq2::bench_fq2_square                                 231          227                    -4   -1.73%   x 1.02 
 bls12_377::fq2::bench_fq2_sub_assign                             19           19                      0    0.00%   x 1.00 
 bls12_377::fq::bench_fq_add_assign                               10           10                      0    0.00%   x 1.00 
 bls12_377::fq::bench_fq_double                                   9,108        9,111                   3    0.03%   x 1.00 
 bls12_377::fq::bench_fq_from_repr                                59           38                    -21  -35.59%   x 1.55 
 bls12_377::fq::bench_fq_into_repr                                32           31                     -1   -3.12%   x 1.03 
 bls12_377::fq::bench_fq_inverse                                  13,708       17,631              3,923   28.62%   x 0.78 
 bls12_377::fq::bench_fq_mul_assign                               51,780       30,340            -21,440  -41.41%   x 1.71 
 bls12_377::fq::bench_fq_negate                                   13           14                      1    7.69%   x 0.93 
 bls12_377::fq::bench_fq_repr_add_nocarry                         8            10                      2   25.00%   x 0.80 
 bls12_377::fq::bench_fq_repr_div2                                6            6                       0    0.00%   x 1.00 
 bls12_377::fq::bench_fq_repr_mul2                                5            13                      8  160.00%   x 0.38 
 bls12_377::fq::bench_fq_repr_num_bits                            4            3                      -1  -25.00%   x 1.33 
 bls12_377::fq::bench_fq_repr_sub_noborrow                        9            11                      2   22.22%   x 0.82 
 bls12_377::fq::bench_fq_sqrt                                     77,501       54,993            -22,508  -29.04%   x 1.41 
 bls12_377::fq::bench_fq_square                                   43,704       30,089            -13,615  -31.15%   x 1.45 
 bls12_377::fq::bench_fq_sub_assign                               11           11                      0    0.00%   x 1.00 
 bls12_377::fr::bench_fr_add_assign                               7            7                       0    0.00%   x 1.00 
 bls12_377::fr::bench_fr_double                                   10,316       5,832              -4,484  -43.47%   x 1.77 
 bls12_377::fr::bench_fr_from_repr                                31           22                     -9  -29.03%   x 1.41 
 bls12_377::fr::bench_fr_into_repr                                20           19                     -1   -5.00%   x 1.05 
 bls12_377::fr::bench_fr_inverse                                  7,516        5,451              -2,065  -27.47%   x 1.38 
 bls12_377::fr::bench_fr_mul_assign                               25,583       16,173             -9,410  -36.78%   x 1.58 
 bls12_377::fr::bench_fr_negate                                   10           7                      -3  -30.00%   x 1.43 
 bls12_377::fr::bench_fr_repr_add_nocarry                         6            7                       1   16.67%   x 0.86 
 bls12_377::fr::bench_fr_repr_div2                                4            4                       0    0.00%   x 1.00 
 bls12_377::fr::bench_fr_repr_mul2                                4            7                       3   75.00%   x 0.57 
 bls12_377::fr::bench_fr_repr_num_bits                            4            3                      -1  -25.00%   x 1.33 
 bls12_377::fr::bench_fr_repr_sub_noborrow                        6            8                       2   33.33%   x 0.75 
 bls12_377::fr::bench_fr_sqrt                                     30,712       22,193             -8,519  -27.74%   x 1.38 
 bls12_377::fr::bench_fr_square                                   21,929       15,620             -6,309  -28.77%   x 1.40 
 bls12_377::fr::bench_fr_sub_assign                               8            7                      -1  -12.50%   x 1.14 
 bls12_377::pairing::pairing::bench_pairing_final_exponentiation  1,292,296    1,087,830        -204,466  -15.82%   x 1.19 
 bls12_377::pairing::pairing::bench_pairing_full                  2,235,430    1,845,299        -390,131  -17.45%   x 1.21 
 bls12_377::pairing::pairing::bench_pairing_miller_loop           631,072      510,977          -120,095  -19.03%   x 1.24 
 bls12_381::ec::g1::bench_g1_add_assign                           1,059,324    769,470          -289,854  -27.36%   x 1.38 
 bls12_381::ec::g1::bench_g1_add_assign_mixed                     758,426      534,781          -223,645  -29.49%   x 1.42 
 bls12_381::ec::g1::bench_g1_double                               539,424      419,918          -119,506  -22.15%   x 1.28 
 bls12_381::ec::g1::bench_g1_mul_assign                           274,129      206,160           -67,969  -24.79%   x 1.33 
 bls12_381::ec::g1::bench_g1_rand                                 274,444      204,898           -69,546  -25.34%   x 1.34 
 bls12_381::ec::g2::bench_g2_add_assign                           4,660        3,785                -875  -18.78%   x 1.23 
 bls12_381::ec::g2::bench_g2_add_assign_mixed                     3,267        2,691                -576  -17.63%   x 1.21 
 bls12_381::ec::g2::bench_g2_double                               2,175        1,871                -304  -13.98%   x 1.16 
 bls12_381::ec::g2::bench_g2_mul_assign                           1,116,571    936,037          -180,534  -16.17%   x 1.19 
 bls12_381::ec::g2::bench_g2_rand                                 1,084,845    919,647          -165,198  -15.23%   x 1.18 
 bls12_381::fq12::bench_fq12_add_assign                           130          125                    -5   -3.85%   x 1.04 
 bls12_381::fq12::bench_fq12_double                               126          112                   -14  -11.11%   x 1.12 
 bls12_381::fq12::bench_fq12_inverse                              24,139       25,376              1,237    5.12%   x 0.95 
 bls12_381::fq12::bench_fq12_mul_assign                           6,199        4,687              -1,512  -24.39%   x 1.32 
 bls12_381::fq12::bench_fq12_square                               4,118        3,150                -968  -23.51%   x 1.31 
 bls12_381::fq12::bench_fq12_sub_assign                           119          120                     1    0.84%   x 0.99 
 bls12_381::fq2::bench_fq2_add_assign                             18           20                      2   11.11%   x 0.90 
 bls12_381::fq2::bench_fq2_double                                 17           19                      2   11.76%   x 0.89 
 bls12_381::fq2::bench_fq2_inverse                                14,351       18,160              3,809   26.54%   x 0.79 
 bls12_381::fq2::bench_fq2_mul_assign                             225          162                   -63  -28.00%   x 1.39 
 bls12_381::fq2::bench_fq2_sqrt                                   124,951      83,086            -41,865  -33.51%   x 1.50 
 bls12_381::fq2::bench_fq2_square                                 177          142                   -35  -19.77%   x 1.25 
 bls12_381::fq2::bench_fq2_sub_assign                             19           19                      0    0.00%   x 1.00 
 bls12_381::fq::bench_fq_add_assign                               10           10                      0    0.00%   x 1.00 
 bls12_381::fq::bench_fq_double                                   9,240        9,062                -178   -1.93%   x 1.02 
 bls12_381::fq::bench_fq_from_repr                                59           39                    -20  -33.90%   x 1.51 
 bls12_381::fq::bench_fq_into_repr                                32           31                     -1   -3.12%   x 1.03 
 bls12_381::fq::bench_fq_inverse                                  14,949       17,811              2,862   19.15%   x 0.84 
 bls12_381::fq::bench_fq_mul_assign                               52,143       30,280            -21,863  -41.93%   x 1.72 
 bls12_381::fq::bench_fq_negate                                   14           14                      0    0.00%   x 1.00 
 bls12_381::fq::bench_fq_repr_add_nocarry                         8            10                      2   25.00%   x 0.80 
 bls12_381::fq::bench_fq_repr_div2                                6            6                       0    0.00%   x 1.00 
 bls12_381::fq::bench_fq_repr_mul2                                6            13                      7  116.67%   x 0.46 
 bls12_381::fq::bench_fq_repr_num_bits                            4            4                       0    0.00%   x 1.00 
 bls12_381::fq::bench_fq_repr_sub_noborrow                        9            11                      2   22.22%   x 0.82 
 bls12_381::fq::bench_fq_sqrt                                     61,857       38,010            -23,847  -38.55%   x 1.63 
 bls12_381::fq::bench_fq_square                                   44,295       31,291            -13,004  -29.36%   x 1.42 
 bls12_381::fq::bench_fq_sub_assign                               11           10                     -1   -9.09%   x 1.10 
 bls12_381::fr::bench_fr_add_assign                               7            7                       0    0.00%   x 1.00 
 bls12_381::fr::bench_fr_double                                   10,284       5,845              -4,439  -43.16%   x 1.76 
 bls12_381::fr::bench_fr_from_repr                                31           22                     -9  -29.03%   x 1.41 
 bls12_381::fr::bench_fr_into_repr                                20           19                     -1   -5.00%   x 1.05 
 bls12_381::fr::bench_fr_inverse                                  7,600        5,440              -2,160  -28.42%   x 1.40 
 bls12_381::fr::bench_fr_mul_assign                               27,166       17,113            -10,053  -37.01%   x 1.59 
 bls12_381::fr::bench_fr_negate                                   10           8                      -2  -20.00%   x 1.25 
 bls12_381::fr::bench_fr_repr_add_nocarry                         6            7                       1   16.67%   x 0.86 
 bls12_381::fr::bench_fr_repr_div2                                4            5                       1   25.00%   x 0.80 
 bls12_381::fr::bench_fr_repr_mul2                                4            7                       3   75.00%   x 0.57 
 bls12_381::fr::bench_fr_repr_num_bits                            4            3                      -1  -25.00%   x 1.33 
 bls12_381::fr::bench_fr_repr_sub_noborrow                        7            8                       1   14.29%   x 0.88 
 bls12_381::fr::bench_fr_sqrt                                     27,321       19,871             -7,450  -27.27%   x 1.37 
 bls12_381::fr::bench_fr_square                                   24,908       16,834             -8,074  -32.42%   x 1.48 
 bls12_381::fr::bench_fr_sub_assign                               8            7                      -1  -12.50%   x 1.14 
 bls12_381::pairing::pairing::bench_pairing_final_exponentiation  1,072,725    847,028          -225,697  -21.04%   x 1.27 
 bls12_381::pairing::pairing::bench_pairing_full                  1,871,067    1,472,014        -399,053  -21.33%   x 1.27 
 bls12_381::pairing::pairing::bench_pairing_miller_loop           546,691      413,627          -133,064  -24.34%   x 1.32 
 sw6::ec::g1::bench_g1_add_assign                                 4,185,546    4,048,507        -137,039   -3.27%   x 1.03 
 sw6::ec::g1::bench_g1_add_assign_mixed                           2,963,674    2,836,688        -126,986   -4.28%   x 1.04 
 sw6::ec::g1::bench_g1_double                                     2,624,621    2,602,247         -22,374   -0.85%   x 1.01 
 sw6::ec::g1::bench_g1_mul_assign                                 1,795,032    1,730,592         -64,440   -3.59%   x 1.04 
 sw6::ec::g1::bench_g1_rand                                       1,798,136    1,731,877         -66,259   -3.68%   x 1.04 
 sw6::ec::g2::bench_g2_add_assign                                 4,600        3,771                -829  -18.02%   x 1.22 
 sw6::ec::g2::bench_g2_add_assign_mixed                           3,206        2,663                -543  -16.94%   x 1.20 
 sw6::ec::g2::bench_g2_double                                     2,140        1,863                -277  -12.94%   x 1.15 
 sw6::ec::g2::bench_g2_mul_assign                                 1,098,494    935,481          -163,013  -14.84%   x 1.17 
 sw6::ec::g2::bench_g2_rand                                       1,076,092    915,453          -160,639  -14.93%   x 1.18 
 sw6::fq3::bench_fq3_add_assign                                   75           62                    -13  -17.33%   x 1.21 
 sw6::fq3::bench_fq3_double                                       50           55                      5   10.00%   x 0.91 
 sw6::fq3::bench_fq3_inverse                                      50,737       42,375             -8,362  -16.48%   x 1.20 
 sw6::fq3::bench_fq3_mul_assign                                   2,121        2,046                 -75   -3.54%   x 1.04 
 sw6::fq3::bench_fq3_sqrt                                         3,321,726    3,201,614        -120,112   -3.62%   x 1.04 
 sw6::fq3::bench_fq3_square                                       1,696        1,645                 -51   -3.01%   x 1.03 
 sw6::fq3::bench_fq3_sub_assign                                   67           66                     -1   -1.49%   x 1.02 
 sw6::fq6::bench_fq6_add_assign                                   144          114                   -30  -20.83%   x 1.26 
 sw6::fq6::bench_fq6_double                                       96           96                      0    0.00%   x 1.00 
 sw6::fq6::bench_fq6_inverse                                      58,863       50,288             -8,575  -14.57%   x 1.17 
 sw6::fq6::bench_fq6_mul_assign                                   7,118        6,819                -299   -4.20%   x 1.04 
 sw6::fq6::bench_fq6_square                                       5,248        4,976                -272   -5.18%   x 1.05 
 sw6::fq6::bench_fq6_sub_assign                                   131          127                    -4   -3.05%   x 1.03 
 sw6::fq::bench_fq_add_assign                                     22           19                     -3  -13.64%   x 1.16 
 sw6::fq::bench_fq_double                                         16,932       17,126                194    1.15%   x 0.99 
 sw6::fq::bench_fq_from_repr                                      245          248                     3    1.22%   x 0.99 
 sw6::fq::bench_fq_into_repr                                      138          132                    -6   -4.35%   x 1.05 
 sw6::fq::bench_fq_inverse                                        50,218       38,487            -11,731  -23.36%   x 1.30 
 sw6::fq::bench_fq_mul_assign                                     242,531      240,596            -1,935   -0.80%   x 1.01 
 sw6::fq::bench_fq_negate                                         24           22                     -2   -8.33%   x 1.09 
 sw6::fq::bench_fq_repr_add_nocarry                               14           14                      0    0.00%   x 1.00 
 sw6::fq::bench_fq_repr_div2                                      15           7                      -8  -53.33%   x 2.14 
 sw6::fq::bench_fq_repr_mul2                                      9            13                      4   44.44%   x 0.69 
 sw6::fq::bench_fq_repr_num_bits                                  4            3                      -1  -25.00%   x 1.33 
 sw6::fq::bench_fq_repr_sub_noborrow                              17           17                      0    0.00%   x 1.00 
 sw6::fq::bench_fq_sqrt                                           526,690      518,031            -8,659   -1.64%   x 1.02 
 sw6::fq::bench_fq_square                                         208,952      210,092             1,140    0.55%   x 0.99 
 sw6::fq::bench_fq_sub_assign                                     19           19                      0    0.00%   x 1.00 
 sw6::fr::bench_fr_add_assign                                     10           10                      0    0.00%   x 1.00 
 sw6::fr::bench_fr_double                                         9,174        9,065                -109   -1.19%   x 1.01 
 sw6::fr::bench_fr_from_repr                                      58           38                    -20  -34.48%   x 1.53 
 sw6::fr::bench_fr_into_repr                                      32           31                     -1   -3.12%   x 1.03 
 sw6::fr::bench_fr_inverse                                        13,712       17,668              3,956   28.85%   x 0.78 
 sw6::fr::bench_fr_mul_assign                                     50,664       30,005            -20,659  -40.78%   x 1.69 
 sw6::fr::bench_fr_negate                                         13           14                      1    7.69%   x 0.93 
 sw6::fr::bench_fr_repr_add_nocarry                               8            10                      2   25.00%   x 0.80 
 sw6::fr::bench_fr_repr_div2                                      6            6                       0    0.00%   x 1.00 
 sw6::fr::bench_fr_repr_mul2                                      5            13                      8  160.00%   x 0.38 
 sw6::fr::bench_fr_repr_num_bits                                  4            3                      -1  -25.00%   x 1.33 
 sw6::fr::bench_fr_repr_sub_noborrow                              9            11                      2   22.22%   x 0.82 
 sw6::fr::bench_fr_sqrt                                           77,926       54,407            -23,519  -30.18%   x 1.43 
 sw6::fr::bench_fr_square                                         42,764       30,561            -12,203  -28.54%   x 1.40 
 sw6::fr::bench_fr_sub_assign                                     11           10                     -1   -9.09%   x 1.10 
 sw6::pairing::pairing::bench_pairing_final_exponentiation        8,881,217    8,317,250        -563,967   -6.35%   x 1.07 
 sw6::pairing::pairing::bench_pairing_full                        97,824,713   86,015,907    -11,808,806  -12.07%   x 1.14 
 sw6::pairing::pairing::bench_pairing_miller_loop                 88,591,942   77,475,292    -11,116,650  -12.55%   x 1.14

Pratyush · 2020-04-08T17:50:01Z

This speedup is amazing! Thanks for your work on this!

Couple of quick points:

Let's add an off-by-default asm feature to Cargo.toml, both in algebra-core and in algebra. The idea is to gate the use of assembly behind this feature, so that by default, all arithmetic (even for fields with fewer than 8 limbs) is safe. Then, downstream users can opt into using assembly by enabling the feature. This way we can also ensure that the library compiles on stable by default, because inline assembly requires nightly AFAIK.
Along with the asm feature, we should also use target_arch and target_feature to enable this assembly only on x86-64 architectures that support the assembly instructions we use.
Do you think replacing build.rs with a custom proc-macro that directly generated the assembly would be a cleaner approach? (Currently we have a build.rs that generates a macro that creates the assembly.)
Is it possible to run the code through a sanitizer or fuzzer? It might catch some safety issues. This can be done after we merge the PR, because it'll be behind a feature-gate.

Pratyush · 2020-04-08T17:52:28Z

I suspect the major bottlenecks are in adc, sbb, negate. Negate is much slower than sub_assign. This I suspect could be fixed with a little more inline assembly, since it's unclear the compiler knows how to do adc. In particular, the modulus can be hard-coded for negate. Assuming we can shave of 3ns per add/sub, we can probably improve performance by another 5%.

It seems that now field addition is only ~1.6x faster than field multiplication, which is surprising. Do you think a similar speedup is possible also for addition

Pratyush · 2020-04-09T01:56:30Z

Oh and it would also be nice to get some more comments on the structure of the code, and maybe some references to any prior work that this is might be based on (to make auditing and comparison easier).

…inline-asm

jon-chuang · 2020-04-14T01:07:52Z

Link: examples of fuzzing with cargo-fuzz in wasmer (wasmer also does a lot of inline assembly/unsafe code)

RUSTFLAGS="--emit=asm -C target-cpu=native -C target-feature=+bmi2,+adx" cargo +nightly bench --features asm

jon-chuang · 2020-04-16T10:49:48Z

Hi @Pratyush , my apologies for not getting back to you sooner. Things were a work in progress. I have made progress on feature-gating and forcing explicit target declarations.

I have also created abstractions that make the code look cleaner, but I am still utilising a build script for codegen as it is simpler. I did however resort to using procedural macros to obtain something close to a DSL, similar to xbyak (although that is far more sophisticated). Apart from a few macro calls to define arrays of &str tokens, one simply need call #[assemble] over the function signature to invoke the DSL.

xorq(RCX, RCX);
for i in 0..limbs {
    if i == 0 {
        movq(a1[0], RDX);
        mulxq(b1[0], R[0], R[1]);
        for j in 1..limbs - 1 {
            mulxq(b1[j], RAX, R[((j + 1) % limbs)]);
            adcxq(RAX, R[j]);
        }
        mulxq(b1[limbs - 1], RAX, RCX);
        movq(zero, RBX);
        adcxq(RAX, R[limbs - 1]);
        adcxq(RBX, RCX);
    } else {
        movq(a1[i], RDX);
        for j in 0..limbs - 1 {
            mulxq(b1[j], RAX, RBX);
            adcxq(RAX, R[(j + i) % limbs]);
            adoxq(RBX, R[(j + i + 1) % limbs]);
        }
        mulxq(b1[limbs - 1], RAX, RCX);
        movq(zero, RBX);
        adcxq(RAX, R[(i + limbs - 1) % limbs]);
        adoxq(RBX, RCX);
        adcxq(RBX, RCX);
    }
    movq(inverse, RDX);
    mulxq(R[i], RDX, RAX);
    mulxq(m1[0], RAX, RBX);
    adcxq(R[i % limbs], RAX);
    adoxq(RBX, R[(i + 1) % limbs]);
    for j in 1..limbs - 1 {
        mulxq(m1[j], RAX, RBX);
        adcxq(RAX, R[(j + i) % limbs]);
        adoxq(RBX, R[(j + i + 1) % limbs]);
    }
    mulxq(m1[limbs - 1], RAX, R[i % limbs]);
    movq(zero, RBX);
    adcxq(RAX, R[(i + limbs - 1) % limbs]);
    adoxq(RCX, R[i % limbs]);
    adcxq(RBX, R[i % limbs]);
}
for i in 0..limbs {
    movq(R[i], a1[i]);
}

With a bit more toying around, it may be possible to have a function-like procedural macro similar to ordinary macros that generates the code. However, there are advantages to the current way of doing things which is that you can easily inspect the generated assembly code.

There are also limitations with my current method such as requiring everything to be in the context of the current function body. This could lead to code bloat. Hence, I am exploring user-defined TokenStreams that are consumed by a procedural macro to inline the macro routines as required. Other extensions include changing from AT&T syntax to Intel syntax based on an attribute e.g. #[assemble(Intel)]. I have separated the procedural macro portion of the code into a new crate, called mince. A somewhat grotesque take on assemblers.

Remaining work, stated above, is to add support for more routines, as well as extend efficiently to more limbs as per kilic's work.

Pratyush · 2020-04-22T04:11:23Z

algebra/Cargo.toml

+full_asm = [ "algebra-core/asm", "bls12_377", "bls12_381", "sw6", "mnt4_298", "mnt4_753", "mnt6_298", "mnt6_753", "edwards_bls12", "edwards_sw6", "jubjub" ]
+small_asm = ["algebra-core/asm", "mnt4_298", "mnt6_298" ]
+mid_asm = [ "algebra-core/asm", "bls12_377", "bls12_381", "edwards_bls12"]
+big_asm = [ "algebra-core/asm", "sw6", "mnt4_753", "mnt6_753", "edwards_sw6" ]
+mix_asm = [ "algebra-core/asm", "sw6", "mnt4_753", "bls12_381", "mnt6_298" ]


I don't think we need these features; because features are additive, if you enable asm and any of these other features, the underlying arithmetic will automatically use the asm routines.

~~I don't think that is the case. At least not with the current cfg logic. If I try to run~~

RUSTFLAGS="--emit=asm -C target-cpu=native -C target-feature=+bmi2,+adx" cargo +nightly test --features asm bls12_381

~~none of the tests run.~~

Ok, looks like you need to double quote all the features together. Should be put in the readme.

jon-chuang · 2020-04-22T14:33:57Z

@Pratyush to avoid code drift, I manually merged my suggested changes in #188 into this PR, and also added a n_fold bench cfg, as well as clean up the benches with code-reusing macros, especially unifying all fields, to make it much more ergonomic. As a result, implementing benchmarks for all the MNT curves took less than 1 minute.

However, there was already some drift and I may not have dealt with the merge with some serialisation/repr stuff correctly.

I suggest that the above changes I proposed be saved for a future PR, as I have to focus on my schoolwork for the next 6+ weeks.

@paberr I added benchmarking support for MNT curves, let me know if there are any issues.

algebra-core/field-assembly/src/utils.rs

Pratyush · 2020-04-23T07:49:06Z

@jon-chuang, is the following an accurate summary of the architecture?.

There are two crates:

mince
field_assembly

mince defines a kind of DSL for writing assembly programs, and uses this DSL to define three higher-level chunks: mul_1, mul_add_1, and mul_add_shift_1. These are used by field_assembly to define the multiplication routines.

field_assembly defines a macro asm_mul roughly of the following form:

asm_mul ($limbs) {
	2 => {{...}},
	...
}

algebra-core now has a build.rs which uses field_assembly to generate a file containing this macro, and then field/arithmetic.rs includes this file depending on the asm feature flag.

If this understanding is correct, it seems to me that it might be possible to eliminate the build.rs and simplify the architecture as follows:

Let s be the String that is the output of the i-th iteration of the body of the loop in generate_matches (i.e. s contains the contents of one match arm in the above asm_mul macro). If one now calls syn::parse(s), one obtains a TokenStream which contains the assembly implementation of the multiplication routine for fields with i limbs. If we define a function that outputs this TokenStream, then we have obtained a proc-macro that takes as input the number of limbs, and outputs the optimized assembly implementation. (We can define it as a function-like macro to make usage particularly pleasant)

There are also ergonomic benefits to this: we can throw compile errors when the proc macro receives an unsupported number of limbs as input, instead of run-time errors as is the case right now.

Let me know what you think of this idea. I think it's worth pursuing in a follow-up PR, but if you've tried this approach and run into problems I'd like to hear about that as well =)

jon-chuang · 2020-04-23T09:34:50Z

@Pratyush You're right on the money with the code organisation. While what you suggest its pros, as mentioned previously, one of the upsides of the current strategy is that one can inspect the generated code easily. Nonetheless, I will take your suggestions into consideration for a future PR.

Pratyush · 2020-04-23T09:50:59Z

Hmm cargo expand does allow you to view the post-macro-expansion code, but you're right that it's not as nice as having a generated rust file. It's a good point to think about. (There are other potential benefits to using proc-macros; for instance in the future we might want to migrate to something like this to help reduce compile times. However let's leave it to a future PR)

Pratyush · 2020-04-23T16:53:17Z

By the way, the latest version of Rust allows using #[cfg(...)] on of statements, so maybe that will eliminate the the issue with needing to wrap the if-statements in blocks:

https://github.com/rust-lang/rust/blob/master/RELEASES.md#version-1430-2020-04-23

jon-chuang · 2020-04-23T16:57:52Z

I gather this is stable?

Pratyush · 2020-04-23T16:59:31Z

Yes, it’s the latest stable release

jon-chuang · 2020-04-24T03:38:50Z

I'm still getting 04-20 with rustup update. The compiler warning is not due to these extra brackets (I can't identify the actual problem). But errors can result from lack of them. For backwards compatibility for people building with older stable, I suggest we keep things the way they are.

(sorry, accidental close)

Pratyush · 2020-05-04T12:44:20Z

@jon-chuang I'd like to merge this as is to prevent the PR from getting out of sync with master. I've done a review, and the performance and generated code seems great. Because it's gated behind an off-by-default feature, and because it's available only on nightly, I think merging is a low-risk option.

jon-chuang · 2020-05-04T13:07:49Z

Awesome! Thanks for helping cleanup the code.

jon-chuang added 2 commits April 8, 2020 20:10

Added asm code generation for montgomery multiplication and squaring

e5f1dbf

Merge branch 'master' into inline-asm

8116a25

jon-chuang added 6 commits April 10, 2020 15:04

add data movement

cc3d14f

Failed attempt at data movement through swaps and solving corner cases

53e78e7

data movement milestone - max 12 limbs

1b4ed94

data movement milestone - max 12 limbs

d3c1414

Merge branch 'inline-asm' of https://github.com/jon-chuang/zexe into …

dc109c8

…inline-asm

18 limbs

a4dfc3b

jon-chuang mentioned this pull request Apr 15, 2020

Benchmarks Consensys/goff#12

Closed

jon-chuang added 5 commits April 15, 2020 19:24

break up into: assembler and arithmetic

80e7d6a

Add: context.

c46481b

ASM DSL with procedural macros

822fb30

conditional compilation

4a12185

target_arch and target_feature. Compile with:

d6f3ba5

RUSTFLAGS="--emit=asm -C target-cpu=native -C target-feature=+bmi2,+adx" cargo +nightly bench --features asm

minor changes in config and naming

feff30d

Pratyush reviewed Apr 22, 2020

View reviewed changes

jon-chuang added 6 commits April 22, 2020 17:05

readme asm instructions

aab30ce

More detailed readme instructions

d71eac6

more readme edits

dce1003

cleaned up benches with macros, increasing code reuse

44516e2

feature = n_fold

c6265d5

cleanup/refactor fields

cfc3975

jon-chuang mentioned this pull request Apr 22, 2020

cleanup/refactor fields #188

Closed

added benchmarking support for all MNT curves

f3c8ee9

remove unnecessary clone()s

a865b2e

Pratyush reviewed Apr 23, 2020

View reviewed changes

algebra-core/field-assembly/src/utils.rs Show resolved Hide resolved

jon-chuang added 2 commits April 23, 2020 15:29

more readable string manipulation

7f80032

fmt...

1f8852c

Merge branch 'master' into inline-asm

128bc1b

jon-chuang force-pushed the inline-asm branch from ff52b14 to 128bc1b Compare April 23, 2020 14:01

fixed omitted argument

e0f5aaf

jon-chuang closed this Apr 24, 2020

jon-chuang reopened this Apr 24, 2020

jon-chuang and others added 4 commits April 24, 2020 11:56

fmt

00b59da

Clean up features, and make nightly detection robust

86ec169

Small clean up of code

1cec7d2

Merge remote-tracking branch 'upstream/master' into inline-asm

c4df769

Pratyush added 2 commits May 4, 2020 05:48

Formatting

27e9c4d

Fix imports and features

a52841d

Pratyush changed the title ~~Added asm code generation for montgomery multiplication and squaring + Field and Bench cleanup/refactoring~~ Add x86_64 asm codegen for PrimeField mul and square May 4, 2020

Pratyush merged commit 4369be9 into arkworks-rs:master May 4, 2020

tarcieri mentioned this pull request May 4, 2020

Expose field arithmetic RustCrypto/elliptic-curves#29

Closed

DanieleDiBenedetto added a commit to HorizenOfficial/ginger-lib that referenced this pull request Sep 17, 2020

Import arkworks-rs#176 and arkworks-rs#200

2cc0896

Pratyush added a commit that referenced this pull request Nov 27, 2020

Add x86_64 asm codegen for PrimeField mul and square (#176)

02f6976

jon-chuang mentioned this pull request Jan 21, 2021

Preliminary commit for min const generics arkworks-rs/algebra#182

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add x86_64 asm codegen for PrimeField mul and square #176

Add x86_64 asm codegen for PrimeField mul and square #176

jon-chuang commented Apr 8, 2020 •

edited

Loading

Pratyush commented Apr 8, 2020 •

edited

Loading

Pratyush commented Apr 8, 2020

Pratyush commented Apr 9, 2020

jon-chuang commented Apr 14, 2020

jon-chuang commented Apr 16, 2020 •

edited

Loading

Pratyush Apr 22, 2020

jon-chuang Apr 22, 2020 •

edited

Loading

jon-chuang commented Apr 22, 2020 •

edited

Loading

Pratyush commented Apr 23, 2020

jon-chuang commented Apr 23, 2020

Pratyush commented Apr 23, 2020 •

edited

Loading

Pratyush commented Apr 23, 2020

jon-chuang commented Apr 23, 2020

Pratyush commented Apr 23, 2020

jon-chuang commented Apr 24, 2020 •

edited

Loading

Pratyush commented May 4, 2020

jon-chuang commented May 4, 2020

Add x86_64 asm codegen for PrimeField mul and square #176

Add x86_64 asm codegen for PrimeField mul and square #176

Conversation

jon-chuang commented Apr 8, 2020 • edited Loading

Pratyush commented Apr 8, 2020 • edited Loading

Pratyush commented Apr 8, 2020

Pratyush commented Apr 9, 2020

jon-chuang commented Apr 14, 2020

jon-chuang commented Apr 16, 2020 • edited Loading

Pratyush Apr 22, 2020

Choose a reason for hiding this comment

jon-chuang Apr 22, 2020 • edited Loading

Choose a reason for hiding this comment

jon-chuang commented Apr 22, 2020 • edited Loading

Pratyush commented Apr 23, 2020

jon-chuang commented Apr 23, 2020

Pratyush commented Apr 23, 2020 • edited Loading

Pratyush commented Apr 23, 2020

jon-chuang commented Apr 23, 2020

Pratyush commented Apr 23, 2020

jon-chuang commented Apr 24, 2020 • edited Loading

Pratyush commented May 4, 2020

jon-chuang commented May 4, 2020

jon-chuang commented Apr 8, 2020 •

edited

Loading

Pratyush commented Apr 8, 2020 •

edited

Loading

jon-chuang commented Apr 16, 2020 •

edited

Loading

jon-chuang Apr 22, 2020 •

edited

Loading

jon-chuang commented Apr 22, 2020 •

edited

Loading

Pratyush commented Apr 23, 2020 •

edited

Loading

jon-chuang commented Apr 24, 2020 •

edited

Loading