-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add x86_64 asm codegen for PrimeField mul and square #176
Conversation
This speedup is amazing! Thanks for your work on this! Couple of quick points:
|
It seems that now field addition is only ~1.6x faster than field multiplication, which is surprising. Do you think a similar speedup is possible also for addition |
Oh and it would also be nice to get some more comments on the structure of the code, and maybe some references to any prior work that this is might be based on (to make auditing and comparison easier). |
Link: examples of fuzzing with cargo-fuzz in wasmer (wasmer also does a lot of inline assembly/unsafe code) |
RUSTFLAGS="--emit=asm -C target-cpu=native -C target-feature=+bmi2,+adx" cargo +nightly bench --features asm
Hi @Pratyush , my apologies for not getting back to you sooner. Things were a work in progress. I have made progress on feature-gating and forcing explicit target declarations. I have also created abstractions that make the code look cleaner, but I am still utilising a build script for codegen as it is simpler. I did however resort to using procedural macros to obtain something close to a DSL, similar to xbyak (although that is far more sophisticated). Apart from a few macro calls to define arrays of
With a bit more toying around, it may be possible to have a function-like procedural macro similar to ordinary macros that generates the code. However, there are advantages to the current way of doing things which is that you can easily inspect the generated assembly code. There are also limitations with my current method such as requiring everything to be in the context of the current function body. This could lead to code bloat. Hence, I am exploring user-defined Remaining work, stated above, is to add support for more routines, as well as extend efficiently to more limbs as per kilic's work. |
algebra/Cargo.toml
Outdated
full_asm = [ "algebra-core/asm", "bls12_377", "bls12_381", "sw6", "mnt4_298", "mnt4_753", "mnt6_298", "mnt6_753", "edwards_bls12", "edwards_sw6", "jubjub" ] | ||
small_asm = ["algebra-core/asm", "mnt4_298", "mnt6_298" ] | ||
mid_asm = [ "algebra-core/asm", "bls12_377", "bls12_381", "edwards_bls12"] | ||
big_asm = [ "algebra-core/asm", "sw6", "mnt4_753", "mnt6_753", "edwards_sw6" ] | ||
mix_asm = [ "algebra-core/asm", "sw6", "mnt4_753", "bls12_381", "mnt6_298" ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need these features; because features are additive, if you enable asm
and any of these other features, the underlying arithmetic will automatically use the asm routines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that is the case. At least not with the current cfg logic. If I try to run
RUSTFLAGS="--emit=asm -C target-cpu=native -C target-feature=+bmi2,+adx" cargo +nightly test --features asm bls12_381
none of the tests run.
Ok, looks like you need to double quote all the features together. Should be put in the readme.
@Pratyush to avoid code drift, I manually merged my suggested changes in #188 into this PR, and also added a n_fold bench However, there was already some drift and I may not have dealt with the merge with some serialisation/repr stuff correctly. I suggest that the above changes I proposed be saved for a future PR, as I have to focus on my schoolwork for the next 6+ weeks. @paberr I added benchmarking support for MNT curves, let me know if there are any issues. |
@jon-chuang, is the following an accurate summary of the architecture?. There are two crates:
asm_mul ($limbs) {
2 => {{...}},
...
}
If this understanding is correct, it seems to me that it might be possible to eliminate the Let There are also ergonomic benefits to this: we can throw compile errors when the proc macro receives an unsupported number of limbs as input, instead of run-time errors as is the case right now. Let me know what you think of this idea. I think it's worth pursuing in a follow-up PR, but if you've tried this approach and run into problems I'd like to hear about that as well =) |
@Pratyush You're right on the money with the code organisation. While what you suggest its pros, as mentioned previously, one of the upsides of the current strategy is that one can inspect the generated code easily. Nonetheless, I will take your suggestions into consideration for a future PR. |
Hmm |
By the way, the latest version of Rust allows using https://github.com/rust-lang/rust/blob/master/RELEASES.md#version-1430-2020-04-23 |
I gather this is stable? |
Yes, it’s the latest stable release |
I'm still getting 04-20 with rustup update. The compiler warning is not due to these extra brackets (I can't identify the actual problem). But errors can result from lack of them. For backwards compatibility for people building with older stable, I suggest we keep things the way they are. (sorry, accidental close) |
@jon-chuang I'd like to merge this as is to prevent the PR from getting out of sync with |
Awesome! Thanks for helping cleanup the code. |
I was able to achieve a 1.7x speedup on 6-limb
mul_assign
against my previous pull request #163, bringing the total speedup to about 1.84x. This is achieved with a generic code-generation script utilisingbuild.rs
that generates x64 assembly utilising themulxq
,adoxq
andadcxq
instructions for 2 up to 8 limbs.Update: was able to extend work to 18 limbs, achieving 1.34x speedup on Fp832. May be worth looking into translating the oldsquare_in_place
into assembly, but this is a significant amount of work.Update: Put behind feature. Build with:
Next steps: more limbs with less data movement using kilic's strategy. Extend strategy to
square_in_place
.Details
I implemented the assembly version of
square_in_place
as applyingmul_assign
to itself. That's because the goff squaring formula is not really good, while the original squaring formula requires twice the number of registers and hence goes beyond the scope of this PR, and may in fact be slower in the end, as this PR takes massive advantage of the add and carry chains present inmul_assign
.Update: I cleaned up the messy
build.rs
by creating a standalone subcrate and importing the generate function.Future work
Currently, the maximum number of limbs supported is 8, and could be possibly be extended by current methods to about 10 for mul and 11 for squaring. However, going beyond that may require additional work to optimally move data to and from registers into memory (i.e. L1 cache). One should study how a compiler would reason about this to write a reasonable code generation procedure script to do this.Edit: our data movement is extremely suboptimal. https://raw.githubusercontent.com/kilic/fp/master/generic/x86_arithmetic.s is able to achieve 149ns on 13 limbs, whereas we achieve 190ns. There's a lot to learn here.
We are also somehow still trailing behind MCL, probably the gold-standard for pairings. Here are the benchmarks, run on a similar i7-7700 to mine. Although our Fp performance totally outclasses it, everything else is lagging behind. It manages to achieve a staggering ~0.7ms BLS12_381 pairing (?). Their webassembly target achieves a sickening 3.5ms on my Chrome browser.
I suspect there could be bottlenecks in adc, sbb, negate. Negate is much slower than sub_assign. This I suspect could be fixed with a little more inline assembly, since it's unclear the compiler knows how to do adc. In particular, the modulus can be hard-coded for negate. Assuming we can shave of 3ns per add/sub, we can probably improve performance by another 5%. (Given this, it may indeed be worthwhile to write a procedural macro and do a cleanup. This may turn out not to be possible. But it is worth trying.)
Areas for Improvement
This however does not explain the discrepancy. Looking at the benchmarks more carefully, for BLS12_381, Fq12 mul is ~6800ns, whereas for MCL it is ~3000ns. We're better for Miller loop, 413us vs 527us (but not the precomputed Miller loop (fixed Q in G12) - 405us (which are we using?)). Nonetheless, those timings were for the BN curve, so really we need to improve the Miller loop (7500Mq) down to about 350us.
But this makes it apparent that the significant weakness is in the final exponentiation (9000Mq). We must get timings down from 850us down to 414us, a 2x speedup. Part of this is probably due to our slow Fq inverse timings. This is clear also by examining pairing formulas, where the number of field ops for the final exponentiation should be roughly similar to the Miller loop, but for us the former has a timing double the latter. I estimate fixing this could give another 35% boost to our pairings.
G2 should also be fixed - 25% slowdown - 2.7us vs 2.1us. for add, 1.85us vs 1.49us for double. Our mul assign is apparently ~3x slower (930us vs 360us).
G1 add 529us vs MCL's 472us. Suggests, as suspected, some inefficiency (probably from the very slow Fq doubling). Our G1 doubling beats it however. Should investigate why. Mul is 25% slower (v.s. 154us), but should be fixable with standalone pippenger.
MCL also implements GLV.
I will try to study MCL and obtain more detailed and up to date benchmarks for it.
Main results:
384-bit
mul_assign
384-bit
square_in_place
:256-bit
mul_assign
:256-bit
square_in_place
:384-bit G1:
384-bit pairings:
SW6 (updated)
Full results