-
Notifications
You must be signed in to change notification settings - Fork 12
Benchmarks #12
Comments
By the way, the timings for Barretenberg look wrong. The last time I checked, it was running at 22ns mul and 20ns squaring. Ours look like 25.5ns and 22.5ns. |
Also, @mratsim also brought up MCL, which is frighteningly fast. Once Barretenberg's benchmarking is back up to speed, it would be nice to do a proper benchmark and set the records straight. The next step for us is to use macros to generate ASM strings with ADX (mulx, adoc, adox) instructions for broadwell and newer CPUs. |
Thanks for that;
|
I see. I would have loved to use intrinsics for ADX/MULX, but unfortunately the LLVM and hence rustc compilers don't emit it, so we have to resort to manipulating strings. Out of curiousity, is there intrinsics support in the go language/compiler? |
Yes, and that's why we didn't need to write assembly so far: https://dave.cheney.net/2019/08/20/go-compiler-intrinsics . However, the current instrisics don't generate ADX/MULX, so to go that route (or the more crazy VPMADD52LUQ) we will likely resort to assembly for the Mul. |
VPAADDMULQ (IFMA) is for unsaturated representation isn't it? Are there any benchmarks that show that they are more performant for general primes? I know they are good for special primes (pseudomersenne/generalised mersenne), but as far as I know no such primes are used in pairing curves. |
Yes, they are for unsatured. The best would be to benchmark Milagro/AMCL/MIRACL
I can provide bench for BLS12-381 specifically since I have to create some anyway and this is what we are using as a backend for Eth2 at https://github.com/status-im/nim-blscurve I explored using lazy reduction and lazy carry for BN254 and BLS12-381 but ultimately I believe it's a dead-end, I explain my analysis in-depth here: mratsim/constantine#15 (comment) Otherwise, I believe there are enough interest and parallel effort that maybe a common Discord/Matrix/Telegram chat for all implementers of optimized pairing libraries and maybe as a first outcome a shootout in the style of:
|
Hi @mratsim, this seems like a great idea. Would you be willing to setup the benchmarking infrastructure? It seems like there are some issues getting around the different languages and benchmarking suites however. I was also thinking that we can pool some resources like generated assembly and benchmark those on different platforms. Does milagro use the unsaturated representation for general primes? I have not seen it used in that way? Can unsaturated repr. perform montgomery multiplication for general primes? I am starting to think that it could be much better to copy some hand-written assembly than to generate/write one's own, it is rather complex and error-prone, and there are many difficult things to do like data movement and register allocation. One can then do a bit of clean up. Another option is to get the GCC compiler to emit the necessary intrinsics for us. |
@mratsim thanks for the links, very informative!. |
@jon-chuang copy existing hand written assembly is not an option for Will soon merge, some cleaning up to do: #13 |
@gbotrel That looks good! It seems one can generate such assembly with inlined moduli movs (saving one register and also memory fetches) by passing the moduli to a macro if one is looking for slightly more flexibility. However, generalizing to arbitrary number of limbs is probably not possible, because the way one uses registers is different for each case. For higher number of limbs, my guess is the compiler would start using pushq and popq to save some values, or perhaps just movq to save to the memory value. I notice for the second inner loop, you are already using 2 movs in each block. I think an interesting project could be to extend fiat crypto's approach to assembly. May I know how you created the assembly? Was it hand-written? Also, did you get any timings for that piece of assembly code? Would you mind if I adapted it just to benchmark the speed? |
Another option, of course, is to movq the moduli from a memory location, wasting one register. I wonder if it could have good performance still compared to using constants. |
@jon-chuang #13 is now merged. You can benchmark in
as mentioned in the PR, on my machine, it is faster than |
I'll have a look over the weekend. I remember when trying to benchmark Milagro/MIRACL, RELIC, Barrentenberg struggling quite a bit to setup the benchmarks. The first 2 because you need to generate C code and then compile the generated code, the last one because the benchmark is hidden quite deep and mixes Google/benchmark report and manual cycle counting in the same stdout. In any-case I'll setup an organization, repo and vendor some of the librairie I found interesting so that we can get started. In terms of number, I think ideally library should report throughput (ops/s) and time (ns/op) and as a stretch goal, cycle count on x86. I've merged a benchmark PR on our Milagro codebase used for Eth 2 yesterday and the reports looks like this: status-im/nim-blscurve#46
FYI, mcl is 8x faster on Scalar multiplication on G2 and 3x faster on Pairing: status-im/nim-blscurve#47 |
@mratsim awesome! I think the key contenders are MCL, Barretenberg, I am not sure why Milagro is still in the picture, it isn't under active development it seems. |
It's interesting for me and many Ethereum2 implementers because everyone used Milagro at one point. |
Hi all, I have managed to generate ADX and non ADX backend for an arbitrary modulus up to 1024 bits and had great benchmark results. Please take a look https://github.com/kilic/fp. I am also considering to add the new multiplication trick used with goff. |
@kilic
|
@kilic I've had good results with ASM for limbs <= 7 with the no-carry trick: arkworks-rs/snark#176 |
Hi, I would like to invite the goff team to update their benchmark timings against zexe's new macro-based FF implementation. We used goff's no carry optimisation. I reference a relevant issue here: https://github.com/scipr-lab/zexe/issues/162.
My benchmarks show that zexe's original squaring implementation is faster (42.5ns vs 46.5ns) than goff's and on par for multiplication (50.2ns vs 50.3ns for goff on my machine).
To verify, please compile my fork with the appropriately modified benchmarks in the folder
algebra-benches
withRUSTFLAGS="--emit=asm" cargo +nightly bench
.The text was updated successfully, but these errors were encountered: