Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ABGLSV-Pornin multiplication #323

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

str4d
Copy link
Contributor

@str4d str4d commented May 4, 2020

Adds a backend for computing δ(aA + bB - C) in variable time, where:

  • B is the Ed25519 basepoint;
  • δ is a value invertible mod , which is selected internally to the function.

This corresponds to the signature verification optimisation presented in Antipa et al 2005. It uses Algorithm 4 from Pornin 2020 to find a suitable short vector, and then windowed non-adjacent form Straus for the resulting multiscalar multiplication.

References:

@hdevalence
Copy link
Contributor

This is really cool! A few comments / questions based on a quick read of the source code:

  1. I'm a little uncertain about the type signature of vartime_triple_scalar_mul_basepoint. Specifically, it returns an EdwardsPoint with the value of [δa]A + [δb]B - [δ]C, but since δ is selected internally to the function and isn't returned, this value isn't useful to the caller. All that they can usefully do is check whether the result is the identity (maybe after multiplication by the cofactor).

    So, perhaps the API would be simpler if, rather than being conceptualized as a new triple-base scalar multiplication, it was conceptualized as a more efficient implementation of the check:

    /// Checks whether \\([a]A + b[B] = C\\) in variable time.
    pub fn vartime_check_double_scalar_mul_basepoint(
            a: &Scalar,
            A: &EdwardsPoint,
            b: &Scalar,
            C: &EdwardsPoint,
        ) -> bool { /* ... */ }

    This can use a triple-base function like the one you already have internally, but the advantage is that in the exposed API surface of the crate, we only commit to the functionality, and not the method. The EdwardsPoint version of the API can do a cofactor multiplication on the result of the internal function, but the RistrettoPoint version doesn't have to.

  2. If we conceptualize it that way, an obvious second question is: what is the purpose of vartime_double_scalar_mul_basepoint? Well, it was exposed for the sole purpose of providing the functionality of vartime_check_double_scalar_mul_basepoint, but because it didn't describe the problem at a high-enough level (it gives an intermediate value that the user is supposed to compare), we can't transparently drop in a better implementation. (Avoiding exactly this kind of problem is the motivation for the change suggested above).

    However, a code search of all of GitHub reveals that with one exception, every single user of this function could be using the faster check function instead, so I think we got the API wrong, and if this code is merged, I would like to deprecate and/or remove the non-checked version. The exception is a Bulletproofs implementation that would be faster if it used the generic multiscalar multiplication implementation. This would require a new major version, but because the breaking change is locally scoped and in the direction of existing code, I don't think this is a big deal provided someone (e.g., me) is willing to do the work of submitting patches to all downstream crates.

  3. If vartime_double_scalar_mul_basepoint isn't there (or won't be in the future), a third question is: what parts of its implementation are carried over to this one, which parts should be, and which parts can be dropped from the codebase? One thing that sticks out as a carryover is the width-8 NAF lookup tables. These were added to the previous implementation as an easy way to get a slight speedup without much algorithmic work, at the cost of additional memory usage. This cost isn't just a problem for the binary size, but also because larger tables are more expensive to access at runtime, which is why they don't work well for very large multiscalar multiplications. Now that the algorithm is different, is there still a significant benefit to using the width-8 instead of width-5 tables? Can we use this change as an opportunity to save memory? It would be good to have some empirical numbers, but because they're microbenchmarks it's hard to judge the right decision based on those numbers alone (e.g., filling the entire cache with lookup tables is great for microbenchmarks, but real applications do other work with other data that those tables evict).

  4. The algorithm requires an implementation of some big-integer arithmetic. Because this is only used once to prepare the inputs to the multiscalar function, it seems like its performance is less critical than other arithmetic, so I wonder whether it would make sense to have a single implementation using u128s and rely on the compiler to lower it to machine code. This may be slightly less efficient, but it has a huge saving in code complexity and maintainability. So I don't think it's necessary to implement two versions of the lattice reduction code for different architectures. It would also be good to fit the types a little better to the problem than just BigInt, which I'm guessing is what you meant by refining the type to ShrinkingInt?

    I'm not exactly sure how the code would end up being factored across the different backends, but my guess would be that it would look something like:

    • backend-agnostic code to do the lattice reduction, maybe in the scalar subtree;
    • an implementation of the triple-base mul in the backend::serial subtree;
    • an implementation of the triple-base mul in the backend::vector subtree;
      Does that seem right?

@str4d
Copy link
Contributor Author

str4d commented May 4, 2020

So, perhaps the API would be simpler if, rather than being conceptualized as a new triple-base scalar multiplication, it was conceptualized as a more efficient implementation of the check:
The EdwardsPoint version of the API can do a cofactor multiplication on the result of the internal function, but the RistrettoPoint version doesn't have to.

I did consider this API, but wasn't sure whether there were any cases where we would want to not use cofactor multiplication on the result for EdwardsPoint. If we're happy making that an internal consideration (or maybe a boolean flag or parallel API), then yes this is definitely simpler.

I think we got the API wrong, and if this code is merged, I would like to deprecate and/or remove the non-checked version.

Yep, this is also what I'd like. I kept the prior API initially so I had something to benchmark against 😄

Now that the algorithm is different, is there still a significant benefit to using the width-8 instead of width-5 tables? Can we use this change as an opportunity to save memory?

It looks like in (Curve9697) @pornin uses width-4 for runtime-calculated tables, and width-5 for pre-computed tables. IDK if he has relevant benchmarks, but it's another datapoint towards dropping width-8. I think it would make sense to examine this in a subsequent PR, separate to this change.

The algorithm requires an implementation of some big-integer arithmetic. Because this is only used once to prepare the inputs to the multiscalar function, it seems like its performance is less critical than other arithmetic, so I wonder whether it would make sense to have a single implementation using u128s and rely on the compiler to lower it to machine code.

The input preparation is performance-critical, in that the pre-Pornin algorithms were slow enough that the reduction in doublings could not offset it (which led to the Ed25519 paper dismissing ABGLSV and using double-base scalar mult instead). That said, a u128-based impl may still be performant enough to retain the overall benefit on platforms that would normally use the u32 backend.

I had originally started writing BigInt64 using u128s, but was not sure whether that was acceptable inside the u64 backend. I'll rework it as a backend-independent implementation instead.

It would also be good to fit the types a little better to the problem than just BigInt, which I'm guessing is what you meant by refining the type to ShrinkingInt?

Yes, this is what I mean. We want to leverage the fact that bit lengths are strictly-decreasing, to avoid operating on higher limbs that are guaranteed to not contain the MSB.

@str4d str4d force-pushed the abglsv-pornin-mul branch 3 times, most recently from 9b8b93d to b401033 Compare May 5, 2020 06:57
@str4d
Copy link
Contributor Author

str4d commented May 5, 2020

I've reworked the PR following @hdevalence's comments, and added an AVX2 backend.

abglsv_pornin::mul is passing the fixed test case for both serial and AVX2, but is occasionally failing the random values tests (often enough that the tests now fail consistently due to trying 100 sets of random values).

@pornin
Copy link
Contributor

pornin commented May 5, 2020

About window sizes: there are several parameters in play, not all of which apply to the present case; notably, my default implementations strive to work on very small systems, and that means using very little RAM. For Curve9767, each point in a window uses 80 bytes (affine coordinates, each coordinate is a polynomial of degree less than 19, coefficients on 16 bits each, two dummy slots for alignment); if the windows collectively contain 16 points (for instance), then that's 1280 bytes of stack space, and for the very low-end of microcontrollers, that's too much (I must leave a few hundred bytes for the temporaries used in field element operations, and the calling application may also have needs). ROM/Flash size is also a constraint (though usually less severe), again encouraging using relatively small windows.

With a window of n bits, 2n-1 points must be stored (e.g. for a 16-bit window, this stores points P, 2P,... 8P, from which we can also dynamically obtain -P, -2P,... -8P). If using wNAF, we only need the odd multiplies of these points (i.e. P, 3P, 5P and 7P for a 16-bit window), lowering the storage cost to 2n-2 points. In the signature verification, I have two dynamic windows to store: computing uA+vB+wC, with B being the generator but A and C dynamically obtained, I need one window for A and another for C. Therefore, if I want to use only 8 points (640 stack bytes), then I must stick to 4-bit windows. Static windows are in ROM, and there's more space there, but there's exponential growth; each 5-bit window is 1280 bytes, and there are two of them, so 2560 bytes of ROM for these.

In the x86/AVX2 implementation, for signature verification, I use 5-bit dynamic windows, and 7-bit static windows; for generic point multiplication (non-NAF, thus with also the even multiples), I have both static and dynamic 5-bit windows (four static windows for the base point). The static windows add up to 10240 bytes, which I think is a reasonable figure for big x86, since there will typically be about 32 kB of L1 cache: again, we must think that the caller also has data in cache, and if we use up all the L1 cache for the signature verification, this may look well on benchmarks, but in practical situations this will induce cache misses down the line. We should therefore strive to use only a minority of the total L1 cache.

Note that Ed25519 points are somewhat larger than Curve9767 points: AffineNielsPoint is three field elements (so at least 96 bytes, possibly more depending on internal representation), while ProjectiveNielsPoint is four field elements (at least 128 bytes). Dynamic windows will use the latter.

About CPU cost: this is a matter of trade-offs. In wNAF, with n-bit windows, building the window will require 2n-2-1 point additions, and will induce on average one point addition every n+1 bits. With 127-bit multipliers, this means that 4-bit windows need 28.4 point additions on average (for each window, not counting the 126 doublings), while 5-bit windows need about 28.2. With Curve9767, the latter is better (if you have the RAM) for another reason which is not applicable to Ed25519: long sequences of point doublings are slightly more efficient, and longer windows increase the average length of runs of doublings. This benefit does not apply to Ed25519. Thus, for dynamic windows and Ed25519, I'd say that 4-bit and 5-bit wNAF windows should be about equivalent (5-bit windows would be better if using 252-bit multipliers).

With static windows, there is no CPU cost in building windows, and larger windows are better, but there are diminishing returns. Going from 7-bit to 8-bit windows would save less than two point additions, possibly not worth the effort unless you are aiming at breaking the record in a microbenchmark context which will be meaningless in real situations.

@str4d str4d force-pushed the abglsv-pornin-mul branch from b401033 to 2194715 Compare May 8, 2020 22:23
@str4d
Copy link
Contributor Author

str4d commented May 8, 2020

Force-pushed to fix the serial and vector Straus impls, which were not correctly checking for the first non-zero d_1 bit. The tests now pass.

Base automatically changed from master to main March 25, 2021 03:33
@str4d str4d force-pushed the abglsv-pornin-mul branch from 2194715 to 93fd6ea Compare December 12, 2022 11:04
@str4d str4d changed the base branch from main to release/4.0 December 12, 2022 11:04
@str4d
Copy link
Contributor Author

str4d commented Dec 12, 2022

This PR was previously based on release 2.0.0. I've rebased it onto release/4.0 as that's where development work is currently directed, but I can also rebase instead onto release/3.2 if there is a desire to get this out for existing users (as the API introduced by this PR is a pure addition).

Assuming this PR is merged, I plan to make a separate PR to release/4.0 removing vartime_double_scalar_mul_basepoint and naming this API as the replacement per the above discussion (as the API introduced by this PR is around 13% faster).

@str4d str4d force-pushed the abglsv-pornin-mul branch 2 times, most recently from 8b6c6c1 to e267f48 Compare December 12, 2022 11:51
@str4d
Copy link
Contributor Author

str4d commented Dec 12, 2022

Fixed the simple bugs, but CI is still failing because between 2.0.0 and release/4.0 a bunch more backends were added to CI, and now I need to generate [2^128] B tables for them all.

@str4d str4d force-pushed the abglsv-pornin-mul branch 6 times, most recently from 81534c1 to f6ec510 Compare December 12, 2022 13:16
])
.unwrap()),
);
println!("b_shl_128_odd_lookup_table = {:?}", b_shl_128_odd_table);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this test to match the one for BASEPOINT_ODD_LOOKUP_TABLE, and used it to regenerate the AVX2 B_SHL_128_ODD_LOOKUP_TABLE table (the contents of which apparently gets generated differently after over 2 years of crate development, but both the old and new lookup tables pass tests).


let basepoint_odd_table =
NafLookupTable8::<CachedPoint>::from(&constants::ED25519_BASEPOINT_POINT);
println!("basepoint_odd_lookup_table = {:?}", basepoint_odd_table);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this from the AVX2 tests to have an equivalent check of the AVX512IFMA lookup table, but I don't have a suitable device to test this.

])
.unwrap()),
);
println!("b_shl_128_odd_lookup_table = {:?}", b_shl_128_odd_table);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this from the AVX2 tests to have an equivalent check of the AVX512IFMA lookup table. Someone with a suitable device needs to run this test and extract the output of this println so we can update ifma::constants with the correct table.

@@ -2060,3 +2060,2031 @@ pub(crate) static BASEPOINT_ODD_LOOKUP_TABLE: NafLookupTable8<CachedPoint> = Naf
),
])),
]);

/// Odd multiples of `[2^128]B`.
// TODO: generate real constants using test in `super::edwards`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently just a duplicate of BASEPOINT_ODD_LOOKUP_TABLE to get the build CI checks to pass.

@str4d
Copy link
Contributor Author

str4d commented Dec 12, 2022

Moving the todo list out of the top post:

  • Rework external APIs to check an equality instead of exposing the output of abglsv_pornin::mul.
  • Add RistrettoElement API.
  • Add AVX2 backend
  • Replace u64 BigInt with a u128-based implementation.
  • Convert BigInt into ShrinkingInt to take advantage of the strictly-decreasing bit lengths within abglsv_pornin::mul.

The last two items are not blockers for this PR.

@str4d str4d force-pushed the abglsv-pornin-mul branch 6 times, most recently from d785f10 to d448edd Compare March 29, 2024 13:29
Uses Algorithm 4 from Pornin 2020 to find a suitable short vector.

References:
- Pornin 2020: https://eprint.iacr.org/2020/454
@str4d str4d force-pushed the abglsv-pornin-mul branch from d448edd to b13b3a6 Compare March 29, 2024 13:37
@str4d
Copy link
Contributor Author

str4d commented Mar 29, 2024

Force-pushed to fix post-rebase bugs and get CI passing.

@str4d str4d force-pushed the abglsv-pornin-mul branch 2 times, most recently from a3524dc to 20e355e Compare March 29, 2024 14:02
@str4d
Copy link
Contributor Author

str4d commented Mar 29, 2024

Force-pushed to add changelog entries and fix documentation.

Comment on lines +910 to +916
/// Checks whether \\([8a]A + [8b]B = [8]C\\) in variable time.
///
/// This can be used to implement [RFC 8032]-compatible Ed25519 signature validation.
/// Note that it includes a multiplication by the cofactor.
///
/// [RFC 8032]: https://tools.ietf.org/html/rfc8032
pub fn vartime_check_double_scalar_mul_basepoint(
Copy link
Contributor Author

@str4d str4d Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ed25519-dalek is now in the same workspace as curve25519-dalek, so I can make changes to it in this PR, but I think the next question is how we use this method.

I opened this PR in May 2020. Originally I just returned the scalar mul output directly, but @hdevalence suggested this "check" API instead, where the EdwardsPoint version would multiply by the cofactor. I migrated to that, noting that we might want to make the cofactor multiplication configurable.

In October 2020 @hdevalence published his survey of Ed25519 validation criteria. Some time in the intervening 3.5 years, ed25519-dalek has gained several separate signature verification methods, that all use this helper function internally:

// Helper function for verification. Computes the _expected_ R component of the signature. The
// caller compares this to the real R component. If `context.is_some()`, this does the
// prehashed variant of the computation using its contents.
// Note that this returns the compressed form of R and the caller does a byte comparison. This
// means that all our verification functions do not accept non-canonically encoded R values.
// See the validation criteria blog post for more details:
// https://hdevalence.ca/blog/2020-10-04-its-25519am
#[allow(non_snake_case)]
fn recompute_R<CtxDigest>(

These helpers are therefore either checking "ad-hoc" or "strict" equality of R, neither of which multiply by the cofactor. Meanwhile the ed25519-zebra crate implements the ZIP 215 signature validation rules, which are the "expansive" rules (R is not required to be a canonical encoding, and multiplication by cofactor is required).

So I think we do want some kind of configurability here over the cofactor multiplication. What should this look like? A boolean argument, or two separate APIs?

Note also that the scalar mul optimization implemented in this PR actually checks [δa]A + [δb]B = [δ]C, where δ is a value invertible mod $\ell$. As @pornin notes in Section 3 of ePrint 2020/454:

If there is on the curve a non-trivial point T of order h, then replacing R with R+T will make
the standard verification equation fail, but the second one will still accept the signature if it
so happens that the value δ (obtained from the lattice basis reduction algorithm) turns out to
be a multiple of h.

Is there a way we can avoid this by adjusting the lattice basis reduction algorithm to filter out these δ values? If not, then we cannot use this optimisation for the "strict" verification methods, and it is debatable whether we should even use it for the "ad-hoc" methods (as doing so would change the ill-defined set of valid signatures - not that there isn't already wide inconsistencies between implementations here already, but this would be a difference between two versions of curve25519-dalek, and IDK what the maintainers' policy here is).

Regardless, we definitely should offer a "mul-by-cofactor" version of this in the API, as curve25519-zebra (and anyone else using the cofactor check equation) will benefit from it (as will anyone using RistrettoPoint in a signature scheme, which fortunately does not suffer from this problem).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have looked a bit at the problem, I think leveraging the optimization while being strictly equivalent to the cofactorless equation is doable, but it is a bit unpleasant.

We have a public key A, generator is B, signature is (R, s), and during the verification, the challenge k is computed as a SHA-512 output, which is then interpreted as an integer. The curve has order 8*L. Points A and R are on the curve, but not necessarily in the subgroup of order L. The cofactorless verification equation is:

s*B - k*A = R

First, we should note that while k is nominally a 512-bit integer, the implementation in curve25519-dalek represents k as a Scalar, which implies reduction modulo L. This already deviates from the cofactorless equation in RFC 8032, where there is no such reduction. This matters if A is not in the subgroup of order L; for instance, it may happen that k is, as an integer, a multiple of 8, while k mod L is an odd integer, in which case the cofactorless equation would report a success, while the dalek implementation would reject it. The reverse is also possible (signature accepted by dalek but rejected by the RFC). All these variants are still within the scope of the signature algorithm, i.e. the discrepancies between verifier behaviours do not allow actual signature forgeries by attackers not knowing the private key. There is some extra discussion in the Taming the many EdDSAs paper (page 11). Here I am discussing reproducing the exact behaviour of the current dalek implementation, and therefore I call k the reduction of the SHA-512 output modulo L.

Given k, one can compute k8 = k mod 8 (the three low bits of k). The cofactorless equation is then equivalent to:

s*B - ((k - k8)/8)*(8*A) - (R + k8*A) = 0

Thus, by replacing k, A and R with, respectively, (k >> 3), 8*A and R + (k & 7)*A, I have a completely equivalent equation (thus with the same behaviour), but I have also guaranteed that the A point is in the proper subgroup of order L. Thus we can now assume that A is in that subgroup. This is important: when multiplying A by an integer x, we can now reduce x modulo L without any loss of information.

When we apply Lagrange's algorithm on the lattice basis ((k, 1), (L, 0)), we get a new basis ((u0, u1), (v0, v1)) for the same lattice. In algorithm 4 in my paper, we stop as soon as the smaller of these two vectors is "small enough", but we can also reuse the stopping condition from algorithm 3, i.e. we can change this:

if len(N_v) <= t:
    return (v, u)

into:

if len(N_v) <= t:
    if 2*abs(p) <= N_v:
        return (v, u)

This would, on average, add maybe one or two iterations to the algorithm, i.e. the extra cost on the algorithm would likely be negligible. By using this test, we ensure that not only v is truly the smallest non-zero vector in the lattice, but u is the second smallest non-zero vector among those which are not colinear to v (this kind of assertion breaks down at higher lattice dimensions, but in dimension 2 it works).

Now, Lagrange's algorithm starts here with u = (k, 1), and 1 is odd. Moreover, each step either adds a multiple of v to u, or a multiple of u to v. The consequence is that u1 and v1 can never both be even; at least one of them is odd. The important point here is that if v1 is an odd integer, and less than L (by construction), then it is invertible modulo L (since L is prime) but also modulo 8 (since it is odd). Thus, v1 is invertible modulo 8*L. If v1 is invertible modulo 8*L, which is the whole curve order, then we can multiply the verification equation by v1 in a reversible way, i.e. without changing the behaviour. We thus get:

(v1*s mod L)*B - v0*A - v1*R = 0

which is the Antipa et al optimization. Note that the equivalence relies on two properties: that A is in the right subgroup (so that we can replace k*v1 with v0), and that v1 is odd.

The unpleasantness is that v1 might be even. As explained above, if v1 is even, then u1 must be odd, hence we can use (u0, u1) instead of (v0, v1). However, the smallest non-zero vector in the lattice is v, not u. Heuristically, u is not much bigger than v, but there are some degenerate cases. For instance, if k = (L - 1)/2, then the output of Lagrange's algorithm is v = (1, -2) (very small, but -2 is even), and u = (2*(L+1)/5, (L-4)/5) (denominator u1 = (L-4)/5 is odd, but both u0 and u1 are almost as large as L).

In the verification algorithm, k is an output of SHA-512, and thus attackers would have trouble crafting signatures that leverage the most degenerate cases, and we can heuristically consider that u won't be a very large vector, but the lattice reduction algorithm must still performs update on u and v with their full 254-bit size (including the sign bit); the nice trick of computing them only over 128 bits is no longer applicable. This may conceivably increase the cost, and thus decrease the usefulness of the optimization.

Summary: the behaviour of the current implementation (with the cofactorless equation) can be maintained while applying the Antipa et al optimization, provided that the following process is applied:

  1. Compute k as previously, with a SHA-512 output and with reduction modulo L (to maintain backward compatibility).
  2. Replace k, A and R with k >> 3, 8*A and R + (k & 7)*A, respectively.
  3. Compute Lagrange's algorithm over ((k, 1), (L, 0)) (optionally with the extra ending test so that a truly size-reduced basis is obtained, to make both basis vectors as small as possible). Updates to coordinates of u and v must be maintained over their full size (254 bits).
  4. Given the output ((v0, v1), (u0, u1)) of Lagrange's algorithm (with (v0, v1) being the smallest non-zero vector in the lattice), use (v0, v1) if v1 is odd; but if v1 turns out to be even, use (u0, u1) instead (in that case, u1 is odd). Since u is not the smallest vector, its coordinates can be larger than sqrt(1.16*L), so the combined Straus algorithm must be able to handle large coefficients (even if these are improbable in practice).

WARNING: I wrote all this without actually implementing it. It seems to make sense on paper, but until it is implemented and tested, there's no guarantee I did not make a mistake.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pornin for looking into this! I think the proposed changes are complex enough that they should be made and tested in a separate PR.

To avoid blocking this PR further, I propose that we rename EdwardsPoint::vartime_check_double_scalar_mul_basepoint to something like EdwardsPoint::vartime_check_double_scalar_mul_basepoint_cofactor, and then in a subsequent PR we can attempt to expose an EdwardsPoint::vartime_check_double_scalar_mul_basepoint that is cofactor-less.

@str4d str4d force-pushed the abglsv-pornin-mul branch from 20e355e to fd8952c Compare March 30, 2024 16:40
@str4d
Copy link
Contributor Author

str4d commented Mar 30, 2024

Force-pushed to move the new generated serial tables into separate submodules, and added cfg-flagged tests to generate them, and a CI job that verifies them. If this works, I'll attempt to replicate this for the vector tables.

@str4d str4d force-pushed the abglsv-pornin-mul branch 3 times, most recently from 0d8eea8 to a66efb2 Compare March 30, 2024 20:40
str4d added 3 commits March 30, 2024 20:48
This corresponds to the signature verification optimisation presented in
Antipa et al 2005. It uses windowed non-adjacent form Straus for the
multiscalar multiplication.

References:
- Antipa et al 2005: http://cacr.uwaterloo.ca/techreports/2005/cacr2005-28.pdf
Checks whether [8a]A + [8b]B = [8]C in variable time.

This can be used to implement RFC 8032-compatible Ed25519 signature
validation. Note that it includes a multiplication by the cofactor.
Checks whether [a]A + [b]B = C in variable time.
@str4d str4d force-pushed the abglsv-pornin-mul branch from a66efb2 to 5e03d5c Compare March 30, 2024 20:48
@str4d
Copy link
Contributor Author

str4d commented Mar 30, 2024

Force-pushed to fix the Fiat backends, and adjust the new CI check to fail if the table generators do nothing (as they generate output that is incorrectly formatted, and thus detectable).

@str4d str4d force-pushed the abglsv-pornin-mul branch from 5e03d5c to c96c810 Compare March 30, 2024 21:22
@str4d
Copy link
Contributor Author

str4d commented Mar 30, 2024

Force-pushed to implement a similar kind of generator approach for the AVX2 vector table. It doesn't currently work because the Debug impl for u32x8 doesn't print out values that are suitable for use in its constructor; additional post-processing will be required.

@str4d str4d force-pushed the abglsv-pornin-mul branch from c96c810 to a77e13b Compare April 15, 2024 07:11
@str4d
Copy link
Contributor Author

str4d commented Apr 15, 2024

Force-pushed to fix the AVX2 table generator. The generated constant is concretely different from before (I presume something changed about the wNAF implementation in the intervening four years), but tests pass before and after the change (and I checked that mutating either version of the constant causes a test to fail).

@str4d str4d force-pushed the abglsv-pornin-mul branch from a77e13b to 01a9e9e Compare April 15, 2024 07:52
@str4d
Copy link
Contributor Author

str4d commented Apr 15, 2024

Force-pushed to implement a generator for the IFMA vector table, based on the working AVX2 generator. It should work, but I don't have the hardware to run it, and so the IFMA constants remain invalid. Someone with compatible hardware needs to run the following commands on this branch:

$ RUSTFLAGS="--cfg curve25519_dalek_generate_tables" cargo +nightly test --all-features table_generators
$ cargo fmt

and then provide the resulting diff to the IFMA table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants