Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arm64 assembly #513

Merged
merged 12 commits into from
Jan 8, 2025
Merged

Arm64 assembly #513

merged 12 commits into from
Jan 8, 2025

Conversation

mratsim
Copy link
Owner

@mratsim mratsim commented Jan 8, 2025

This implements assembly for ARM64. Tested on Mac M4 Max.

Performance vs BLST assembly
image

  • 27% perf improvement for BLS signature, overcoming BLST
  • 18.4% perf improvement for BLS verification, overcoming BLST

Internal perf improvement
image

  • 32% perf improvement on scalar mul G1
  • 27.6% for scalar mul G2
  • 14% for pairings

Copy link
Owner Author

@mratsim mratsim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First iteration of ARM64 assembly, some refactorings expected:

  • dedicated final subtraction proc
  • evaluation of prefetching/interleaving loads with compute as those complexify the algorithm
  • more assembly routines (negation, lazy reduced field elements)

swap(v0, v1)
if i+2 < N:
ctx.ldr u1, a[i+2]
ctx.ldr v1, b[i+2]
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now this uses fancy prefetching but unsure if beneficial on Apple Silicon (can fetch/decode up to 8 instructions per cycle) or Raspberry Pi 5

swap(v0, v1)
if i+2 < N:
ctx.ldr u1, a[i+2]
ctx.ldr v1, b[i+2]
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now this uses fancy prefetching but unsure if beneficial on Apple Silicon (can fetch/decode up to 8 instructions per cycle) or Raspberry Pi 5

swap(v0, v1)
if i+2 < N:
ctx.ldr u1, a[i+2]
ctx.ldr v1, b[i+2]
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now this uses fancy prefetching but unsure if beneficial on Apple Silicon (can fetch/decode up to 8 instructions per cycle) or Raspberry Pi 5

# This can only occur if N == 1, for example in t_multilinear_extensions
ctx.ldr v[0], M[0]
else:
ctx.ldr v[i-(N-2)], M[i-(N-2)]
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now this uses fancy prefetching but unsure if beneficial on Apple Silicon (can fetch/decode up to 8 instructions per cycle) or Raspberry Pi 5

# This can only occur if N == 1, for example in t_multilinear_extensions
ctx.ldr v[0], M[0]
else:
ctx.ldr v[i-(N-2)], M[i-(N-2)]
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now this uses fancy prefetching but unsure if beneficial on Apple Silicon (can fetch/decode up to 8 instructions per cycle) or Raspberry Pi 5


# Next iteration
if i+2 < N:
ctx.ldr v[i+2], M[i+2]
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interleaving of prefetching might also be unnecessary

ctx.ldr v[i-(N-2)], M[i-(N-2)]

# M[0], M[1] is loaded into v[0], v[1]
if spareBits >= 1:
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final substraction should probably be a separate function so it can be reused by montgomery multiplication and reduction routines

ctx.str t[i], r[i]
else:
# Final substraction
# we reuse the aa buffer
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final subtraction to refactor in dedicated proc common with field addition and Montgomery reduction

for i in 0 ..< N:
ctx.str t[i], r[i]
else:
# Final substraction
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final subtraction to refactor in dedicated proc common with field addition and Montgomery reduction

ctx.adcs u[i], u[i], t0
swap(t0, t1)

if spareBits >= 2 and lazyReduce:
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final subtraction to refactor

@mratsim
Copy link
Owner Author

mratsim commented Jan 8, 2025

For info, versus an AMD Ryzen 9950X (overclocked +200MHz)

image

  • Ethereum BLS signature 161us (9950X with SHA256 intrinsics) vs 173us (M4 Max, no SHA256 accel)
  • Ethereum BLS verification 435us (9950X with SHA256 intrinsics) vs 488us (M4 Max, no SHA256 accel)

image

  • Scalar Mul G1, 31.9us (9950X) vs 30.7us (M4 Max)
  • Scalar Mul G2, 64.3us (9950X) vs 69us (M4 Max)
  • Pairing, 271.8us (9950X) vs 309.6 (M4 Max)

I'm probably missing some towering improvements on ARM64 like assembly lazily reduced field elements or parameter passing has more overhead on ARM64 (?!).

In any case, despite some key x86 advantages for big integer, namely:

  • 64x64 ->128 wide mul (but disadvantage is very specific registers)
  • ADOX and ADCX for dual carry chains

ARM64 (or at least Apple implementation) is very competitive, likely due to the capacity of issuing MUL/UMULH twice per cycle, and from any pair of registers (instead of being forced into RDX or wors RAX+RDX)

@mratsim mratsim merged commit b8ab0f9 into master Jan 8, 2025
12 checks passed
@mratsim mratsim deleted the arm64-asm branch January 8, 2025 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant