-
-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arm64 assembly #513
Arm64 assembly #513
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First iteration of ARM64 assembly, some refactorings expected:
- dedicated final subtraction proc
- evaluation of prefetching/interleaving loads with compute as those complexify the algorithm
- more assembly routines (negation, lazy reduced field elements)
swap(v0, v1) | ||
if i+2 < N: | ||
ctx.ldr u1, a[i+2] | ||
ctx.ldr v1, b[i+2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now this uses fancy prefetching but unsure if beneficial on Apple Silicon (can fetch/decode up to 8 instructions per cycle) or Raspberry Pi 5
swap(v0, v1) | ||
if i+2 < N: | ||
ctx.ldr u1, a[i+2] | ||
ctx.ldr v1, b[i+2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now this uses fancy prefetching but unsure if beneficial on Apple Silicon (can fetch/decode up to 8 instructions per cycle) or Raspberry Pi 5
swap(v0, v1) | ||
if i+2 < N: | ||
ctx.ldr u1, a[i+2] | ||
ctx.ldr v1, b[i+2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now this uses fancy prefetching but unsure if beneficial on Apple Silicon (can fetch/decode up to 8 instructions per cycle) or Raspberry Pi 5
# This can only occur if N == 1, for example in t_multilinear_extensions | ||
ctx.ldr v[0], M[0] | ||
else: | ||
ctx.ldr v[i-(N-2)], M[i-(N-2)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now this uses fancy prefetching but unsure if beneficial on Apple Silicon (can fetch/decode up to 8 instructions per cycle) or Raspberry Pi 5
# This can only occur if N == 1, for example in t_multilinear_extensions | ||
ctx.ldr v[0], M[0] | ||
else: | ||
ctx.ldr v[i-(N-2)], M[i-(N-2)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now this uses fancy prefetching but unsure if beneficial on Apple Silicon (can fetch/decode up to 8 instructions per cycle) or Raspberry Pi 5
|
||
# Next iteration | ||
if i+2 < N: | ||
ctx.ldr v[i+2], M[i+2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This interleaving of prefetching might also be unnecessary
ctx.ldr v[i-(N-2)], M[i-(N-2)] | ||
|
||
# M[0], M[1] is loaded into v[0], v[1] | ||
if spareBits >= 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The final substraction should probably be a separate function so it can be reused by montgomery multiplication and reduction routines
ctx.str t[i], r[i] | ||
else: | ||
# Final substraction | ||
# we reuse the aa buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final subtraction to refactor in dedicated proc common with field addition and Montgomery reduction
for i in 0 ..< N: | ||
ctx.str t[i], r[i] | ||
else: | ||
# Final substraction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final subtraction to refactor in dedicated proc common with field addition and Montgomery reduction
ctx.adcs u[i], u[i], t0 | ||
swap(t0, t1) | ||
|
||
if spareBits >= 2 and lazyReduce: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final subtraction to refactor
For info, versus an AMD Ryzen 9950X (overclocked +200MHz)
I'm probably missing some towering improvements on ARM64 like assembly lazily reduced field elements or parameter passing has more overhead on ARM64 (?!). In any case, despite some key x86 advantages for big integer, namely:
ARM64 (or at least Apple implementation) is very competitive, likely due to the capacity of issuing MUL/UMULH twice per cycle, and from any pair of registers (instead of being forced into RDX or wors RAX+RDX) |
This implements assembly for ARM64. Tested on Mac M4 Max.
Performance vs BLST assembly
Internal perf improvement