New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Arm64 assembly #513

Merged

mratsim merged 12 commits into master from arm64-asm

Jan 8, 2025

Owner

mratsim commented Jan 8, 2025 •

edited

Loading

This implements assembly for ARM64. Tested on Mac M4 Max.

Performance vs BLST assembly

27% perf improvement for BLS signature, overcoming BLST
18.4% perf improvement for BLS verification, overcoming BLST

Internal perf improvement

32% perf improvement on scalar mul G1
27.6% for scalar mul G2
14% for pairings

mratsim added 12 commits

January 8, 2025 15:20


          ARM64: initial impl of compile-time assembler

86fddff


          arm64: remove spurious usage of x86 rdtsc

1acadf9


          arm64: assembly modular addition

765cef0


          ARM64: submod

2109d85


          ARM64: add field multiplication

6edbcfa


          ARM64: add standalone Montgomery reduction

d45139b


          ARM64: add standalone bigint multiplication

1fe222f


          ARM64: squaring optimizations

36f9d8a


          fix path for x86 benchmarking


          arm64: add assembly for sumprod

ec80d57


          arm64: revert debug change

9f09fcc


          macos: rename build artifacts

c1180eb

mratsim added the performance 🏁 label

mratsim commented

View reviewed changes

Owner Author

mratsim left a comment

First iteration of ARM64 assembly, some refactorings expected:

dedicated final subtraction proc
evaluation of prefetching/interleaving loads with compute as those complexify the algorithm
more assembly routines (negation, lazy reduced field elements)

constantine/math/arithmetic/assembly/limbs_asm_bigint_arm64.nim

+                    swap(v0, v1)
+                    if i+2 < N:
+                      ctx.ldr u1, a[i+2]
+                      ctx.ldr v1, b[i+2]

Owner Author

mratsim Jan 8, 2025

For now this uses fancy prefetching but unsure if beneficial on Apple Silicon (can fetch/decode up to 8 instructions per cycle) or Raspberry Pi 5

constantine/math/arithmetic/assembly/limbs_asm_bigint_arm64.nim

+                    swap(v0, v1)
+                    if i+2 < N:
+                      ctx.ldr u1, a[i+2]
+                      ctx.ldr v1, b[i+2]

Owner Author

mratsim Jan 8, 2025

For now this uses fancy prefetching but unsure if beneficial on Apple Silicon (can fetch/decode up to 8 instructions per cycle) or Raspberry Pi 5

constantine/math/arithmetic/assembly/limbs_asm_bigint_arm64.nim

+                    swap(v0, v1)
+                    if i+2 < N:
+                      ctx.ldr u1, a[i+2]
+                      ctx.ldr v1, b[i+2]

Owner Author

mratsim Jan 8, 2025

For now this uses fancy prefetching but unsure if beneficial on Apple Silicon (can fetch/decode up to 8 instructions per cycle) or Raspberry Pi 5

constantine/math/arithmetic/assembly/limbs_asm_modular_arm64.nim

+                      # This can only occur if N == 1, for example in t_multilinear_extensions
+                      ctx.ldr v[0], M[0]
+                    else:
+                      ctx.ldr v[i-(N-2)], M[i-(N-2)]

Owner Author

mratsim Jan 8, 2025

For now this uses fancy prefetching but unsure if beneficial on Apple Silicon (can fetch/decode up to 8 instructions per cycle) or Raspberry Pi 5

constantine/math/arithmetic/assembly/limbs_asm_modular_arm64.nim

+                      # This can only occur if N == 1, for example in t_multilinear_extensions
+                      ctx.ldr v[0], M[0]
+                    else:
+                      ctx.ldr v[i-(N-2)], M[i-(N-2)]

Owner Author

mratsim Jan 8, 2025

For now this uses fancy prefetching but unsure if beneficial on Apple Silicon (can fetch/decode up to 8 instructions per cycle) or Raspberry Pi 5

constantine/math/arithmetic/assembly/limbs_asm_modular_arm64.nim

+                    # Next iteration
+                    if i+2 < N:
+                      ctx.ldr v[i+2], M[i+2]

Owner Author

mratsim Jan 8, 2025

This interleaving of prefetching might also be unnecessary

constantine/math/arithmetic/assembly/limbs_asm_modular_arm64.nim

+                      ctx.ldr v[i-(N-2)], M[i-(N-2)]
+                # M[0], M[1] is loaded into v[0], v[1]
+                if spareBits >= 1:

Owner Author

mratsim Jan 8, 2025

The final substraction should probably be a separate function so it can be reused by montgomery multiplication and reduction routines

constantine/math/arithmetic/assembly/limbs_asm_mul_mont_arm64.nim

+                    ctx.str t[i], r[i]
+                else:
+                  # Final substraction
+                  # we reuse the aa buffer

Owner Author

mratsim Jan 8, 2025

Final subtraction to refactor in dedicated proc common with field addition and Montgomery reduction

constantine/math/arithmetic/assembly/limbs_asm_mul_mont_arm64.nim

+                  for i in 0 ..< N:
+                    ctx.str t[i], r[i]
+                else:
+                  # Final substraction

Owner Author

mratsim Jan 8, 2025

Final subtraction to refactor in dedicated proc common with field addition and Montgomery reduction

constantine/math/arithmetic/assembly/limbs_asm_redc_mont_arm64.nim

+                    ctx.adcs u[i], u[i], t0
+                  swap(t0, t1)
+                if spareBits >= 2 and lazyReduce:

Owner Author

mratsim Jan 8, 2025

Final subtraction to refactor

Owner Author

mratsim commented Jan 8, 2025

For info, versus an AMD Ryzen 9950X (overclocked +200MHz)

Ethereum BLS signature 161us (9950X with SHA256 intrinsics) vs 173us (M4 Max, no SHA256 accel)
Ethereum BLS verification 435us (9950X with SHA256 intrinsics) vs 488us (M4 Max, no SHA256 accel)

Scalar Mul G1, 31.9us (9950X) vs 30.7us (M4 Max)
Scalar Mul G2, 64.3us (9950X) vs 69us (M4 Max)
Pairing, 271.8us (9950X) vs 309.6 (M4 Max)

I'm probably missing some towering improvements on ARM64 like assembly lazily reduced field elements or parameter passing has more overhead on ARM64 (?!).

In any case, despite some key x86 advantages for big integer, namely:

64x64 ->128 wide mul (but disadvantage is very specific registers)
ADOX and ADCX for dual carry chains

ARM64 (or at least Apple implementation) is very competitive, likely due to the capacity of issuing MUL/UMULH twice per cycle, and from any pair of registers (instead of being forced into RDX or wors RAX+RDX)

mratsim merged commit b8ab0f9 into master

12 checks passed

mratsim deleted the arm64-asm branch

January 8, 2025 19:51

mratsim mentioned this pull request

Perf: Assembly code generator for ARM and ARM64 #200

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance 🏁