Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup Uint::shr #753

Closed
wants to merge 21 commits into from
Closed

Conversation

erik-3milabs
Copy link
Contributor

@erik-3milabs erik-3milabs commented Jan 24, 2025

Refactor Uint::shr, achieving a >60% speed up for U2048.

Benchmarks (run on commit b2dcb20)

right shift/shr, U2048  time:   [292.04 ns 292.16 ns 292.30 ns]
                        change: [-0.2073% -0.0666% +0.0669%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe
right shift/new shr, U2048
                        time:   [114.34 ns 114.48 ns 114.62 ns]
                        change: [+0.8375% +1.0410% +1.2401%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild

@erik-3milabs
Copy link
Contributor Author

erik-3milabs commented Jan 24, 2025

@tarcieri, can you think of ways to better integrate the (helper) functions I wrote? This might improve the quality of the code base and extend the performance speed up to other shr operations.

Also, once you're more/less on board with this PR, I'd be happy to code up the same improvement for shl.

@erik-3milabs erik-3milabs marked this pull request as draft January 24, 2025 16:12
@erik-3milabs erik-3milabs force-pushed the speedup_shr branch 3 times, most recently from 201733e to 1579e18 Compare January 24, 2025 16:33
@erik-3milabs
Copy link
Contributor Author

erik-3milabs commented Jan 24, 2025

Also, with this implementation in place, it becomes painfully apparent that Limb::shr is not constant time, unlike what its name suggests.

We could rename the current implementation as shr_vartime and construct a new shr as follows:

pub const fn shr(self, shift: u32) -> Self {
    assert!(shift < Self::BITS);
    self.carrying_shr(shift).0
}

This would not change the interface of shr, meaning it could fit in 0.6.1

@erik-3milabs erik-3milabs marked this pull request as ready for review January 24, 2025 16:39
@tarcieri
Copy link
Member

it becomes painfully apparent that Limb::shr is not constant time

@erik-3milabs that's unfortunate to hear, can you open a separate issue? It sounds like a potential security vulnerability

src/uint/shr.rs Outdated
let mut result = *self;
let mut i = 0;

// Speed up shifting upto 8 limbs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much does this part contribute to the overall speedup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good question. I'd expect this to improve the performance of shr for large Uint instances. However, my benchmark are failing to provide clear evidence of this effect.

I generalized the full_limb_shr over the WINDOW_SIZE (hardcoded to 4 in this PR); I made sure to completely remove the speed-up for WINDOWS_SIZE=0.

Below are the (condensed) raw results for running shr with a certain WINDOW_SIZE. There does seem to be some merit to the idea: for U2048, U8192 and U16384 the running time drops significantly. But the drop is not consistent (e.g., WS=2 is larger than WS=1 for U2048, U8192 and U16384, no impact on U4096).

I understand that I need more consistent performance to make this speedup actually stick 🤔

right shift/U256/
WINDOW_SIZE: 0      time:   [15.822 ns 15.831 ns 15.840 ns]
WINDOW_SIZE: 1      time:   [15.534 ns 15.544 ns 15.554 ns]
WINDOW_SIZE: 2      time:   [15.888 ns 15.894 ns 15.900 ns]
WINDOW_SIZE: 3      time:   [15.886 ns 15.893 ns 15.900 ns]
WINDOW_SIZE: 4      time:   [15.890 ns 15.897 ns 15.904 ns]
WINDOW_SIZE: 5      time:   [15.936 ns 15.946 ns 15.956 ns]

right shift/U512/
WINDOW_SIZE: 0      time:   [32.916 ns 32.938 ns 32.961 ns]
WINDOW_SIZE: 1      time:   [30.961 ns 30.974 ns 30.988 ns]
WINDOW_SIZE: 2      time:   [32.949 ns 32.962 ns 32.974 ns]
WINDOW_SIZE: 3      time:   [34.661 ns 35.130 ns 35.783 ns]
WINDOW_SIZE: 4      time:   [34.262 ns 34.350 ns 34.443 ns]
WINDOW_SIZE: 5      time:   [35.019 ns 35.121 ns 35.231 ns]

right shift/U1024/
WINDOW_SIZE: 0     time:   [51.829 ns 51.847 ns 51.864 ns]
WINDOW_SIZE: 1     time:   [51.862 ns 51.885 ns 51.910 ns]
WINDOW_SIZE: 2     time:   [51.422 ns 51.435 ns 51.448 ns]
WINDOW_SIZE: 3     time:   [50.870 ns 50.891 ns 50.914 ns]
WINDOW_SIZE: 4     time:   [50.867 ns 50.883 ns 50.898 ns]
WINDOW_SIZE: 5     time:   [50.820 ns 50.833 ns 50.846 ns]

right shift/U2048/
WINDOW_SIZE: 0     time:   [167.11 ns 167.17 ns 167.23 ns]
WINDOW_SIZE: 1     time:   [147.72 ns 147.78 ns 147.82 ns]
WINDOW_SIZE: 2     time:   [173.59 ns 173.68 ns 173.77 ns]
WINDOW_SIZE: 3     time:   [114.72 ns 114.77 ns 114.83 ns]
WINDOW_SIZE: 4     time:   [112.92 ns 113.12 ns 113.34 ns]
WINDOW_SIZE: 5     time:   [114.96 ns 115.03 ns 115.10 ns]

right shift/U4096/
WINDOW_SIZE: 0     time:   [236.76 ns 237.06 ns 237.33 ns]
WINDOW_SIZE: 1     time:   [237.82 ns 238.08 ns 238.32 ns]
WINDOW_SIZE: 2     time:   [234.11 ns 234.41 ns 234.68 ns]
WINDOW_SIZE: 3     time:   [232.93 ns 233.20 ns 233.45 ns]
WINDOW_SIZE: 4     time:   [234.26 ns 234.50 ns 234.72 ns]
WINDOW_SIZE: 5     time:   [230.70 ns 231.10 ns 231.51 ns]

right shift/U8192/
WINDOW_SIZE: 0     time:   [831.95 ns 833.59 ns 835.47 ns]
WINDOW_SIZE: 1     time:   [788.52 ns 789.06 ns 789.62 ns]
WINDOW_SIZE: 2     time:   [847.16 ns 847.81 ns 848.59 ns]
WINDOW_SIZE: 3     time:   [777.58 ns 778.05 ns 778.49 ns]
WINDOW_SIZE: 4     time:   [698.27 ns 698.73 ns 699.21 ns]
WINDOW_SIZE: 5     time:   [623.64 ns 624.28 ns 624.90 ns]

right shift/U16384/
WINDOW_SIZE: 0    time:   [1.6641 µs 1.6650 µs 1.6659 µs]
WINDOW_SIZE: 1    time:   [1.5565 µs 1.5608 µs 1.5657 µs]
WINDOW_SIZE: 2    time:   [1.4930 µs 1.4952 µs 1.4975 µs]
WINDOW_SIZE: 3    time:   [1.6327 µs 1.6339 µs 1.6350 µs]
WINDOW_SIZE: 4    time:   [1.5341 µs 1.5356 µs 1.5374 µs]
WINDOW_SIZE: 5    time:   [1.4262 µs 1.4278 µs 1.4296 µs]

@fjarri
Copy link
Contributor

fjarri commented Jan 24, 2025

Also, with this implementation in place, it becomes painfully apparent that Limb::shr is not constant time, unlike what its name suggests.

Could you elaborate? Does Rust compiler introduce some checks for >> on primitives?

@tarcieri
Copy link
Member

tarcieri commented Jan 24, 2025

The only branch I'm aware of is the one for the potential panic condition, however we don't consider conditions that cause panics to be "variable time" even though there is a branch, because the only time the branch is taken will crash the program, so the non-crashing case still runs in constant time. Also, that branch can be elided by the compiler in many conditions, especially as the code is annotated #[inline(always)].

Edit: also, notably it's so aggressively inlined by LLVM that I'm having trouble making Godbolt even generate code for it

src/limb/shr.rs Outdated
@@ -72,6 +102,60 @@ mod tests {
assert_eq!(Limb(16) >> 2, Limb(4));
}

#[cfg(target_pointer_width = "64")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's more reliable to use proptest in such situations instead of a number of hand-written tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you point me to a good source on proptests? I'm unfamiliar with the concept but would love to learn about it!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can read about it here: https://proptest-rs.github.io/proptest/proptest/index.html

If you look under the toplevel tests/ directory there are several files named .proptest-regressions. You can look at any of the correspondingly named tests for some proptest examples, e.g. https://github.com/RustCrypto/crypto-bigint/blob/master/tests/boxed_monty_form.rs#L37-L155

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added them. Let me know what you think!

@erik-3milabs
Copy link
Contributor Author

Also, with this implementation in place, it becomes painfully apparent that Limb::shr is not constant time, unlike what its name suggests.

Could you elaborate? Does the Rust compiler introduce some checks for >> on primitives?

My apologies; perhaps I was too quick to jump to a conclusion here. Taking a step back, a better statement is:

Either Limb::shr is vartime, or it is possible to speed up Uint::shr even further.

Why I thought Limb::shr is vartime...

Allow me to elaborate. I jumped to my earlier conclusion because of how Uint::shr has been implemented until now. The "old" implementation makes successive calls to Uint::shr_vartime(1<<i) for i=0, 1, 2, .... and then Uint::selects the result if the ith bit was set in shift. The fact that Uint::shr leverages it as a subroute strongly suggests that Uint::shr_vartime is a very non-constant time algorithm, and that not a single part of it is constant time.

Moreover, the idea to

fn shr(&self, shift: u32) -> Self {
     let (bit_shift, limb_shift) = (shift % Limb::BITS, shift / Limb::BITS);
     self.intra_limb_shr(bit_shift).limb_shr(limb_shift)
}

was so apparent to me that I assumed there must be a good reason this has not been implemented 🙈 After all, first ideas are usually terrible ideas, especially when working on a large repo like this with much more intelligent brains than mine working on them.

... and how we might be able to speed up Uint::shr even further.

I did some benchmarking of my own (I'm not familiar enough with compilers such as Godbolt to use that stuff) and got very stable results when using Limb::shr. This confirms that it might indeed be constant time.

When we use it to replace the Limb::carrying_shr call in Uint::intra_limb_carrying_shr, we save another 44ns when benchmarking on my machine, dropping U2048::shr to ~70ns. Not bad if you ask me! 😄

@erik-3milabs
Copy link
Contributor Author

erik-3milabs commented Jan 27, 2025

@tarcieri @fjarri
I updated the code and improved the benchmark set. Below are my local results.

The table lists

  • vartime: the vartime performance,
  • old: the old shr implementation,
  • split: splitting shift into its "intra limb" component, using the faster Limb::carrying_shr here, and
  • fast split: splitting shift as well as speeding up full limb shifting.

@fjarri, you asked

How much does this part contribute to the overall speedup?

To answer that question: I now use this technique for all limb shifts (not just the eight-limb window we had before) in the "fast split" variant. Your answer is answered by observing the difference in the last two columns.

I added a footnote to highlight one result I don't understand... 🤔

limbs vartime old split fast split
4 2.5372 ns 85.502 ns 8.8319 ns 8.8421 ns
8 4.3408 ns 116.84 ns 13.628 ns 14.009 ns
16 8.0908 ns 164.06 ns 29.176 ns 29.036 ns
32 16.188 ns 290.44 ns 195.14 ns 1 70.491 ns
64 42.217 ns 576.61 ns 141.09 ns 136.71 ns
128 72.712 ns 1.5387 µs 660.17 ns 281.96 ns
256 182.35 ns 2.8131 µs 1.2757 µs 665.62 ns

And the raw data, in case you'd like to look at the variances:

right shift/overflowing_shr_vartime/4
                        time:   [2.3705 ns 2.5372 ns 2.6880 ns]
                        change: [-2.6951% +4.9023% +13.295%] (p = 0.23 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
right shift/overflowing_shr/4
                        time:   [85.384 ns 85.502 ns 85.674 ns]
                        change: [-0.0424% +0.2264% +0.6129%] (p = 0.18 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe
right shift/split_overflowing_shr/4
                        time:   [8.8258 ns 8.8319 ns 8.8377 ns]
                        change: [-1.7688% -1.5376% -1.3147%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe
right shift/fast_split_overflowing_shr/4
                        time:   [8.8354 ns 8.8421 ns 8.8487 ns]
                        change: [-0.5675% -0.1341% +0.2004%] (p = 0.55 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe
right shift/overflowing_shr_vartime/8
                        time:   [4.1154 ns 4.3408 ns 4.5350 ns]
                        change: [-4.0490% +2.0512% +8.4408%] (p = 0.52 > 0.05)
                        No change in performance detected.
right shift/overflowing_shr/8
                        time:   [116.79 ns 116.84 ns 116.90 ns]
                        change: [+0.0388% +0.1941% +0.3382%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  11 (11.00%) high severe
right shift/split_overflowing_shr/8
                        time:   [13.600 ns 13.628 ns 13.654 ns]
                        change: [+1.2040% +1.6944% +2.0893%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
right shift/fast_split_overflowing_shr/8
                        time:   [13.873 ns 14.009 ns 14.155 ns]
                        change: [-2.5360% -1.9567% -1.3958%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe
right shift/overflowing_shr_vartime/16
                        time:   [7.6717 ns 8.0908 ns 8.4453 ns]
                        change: [-6.9666% -1.1241% +5.3446%] (p = 0.72 > 0.05)
                        No change in performance detected.
right shift/overflowing_shr/16
                        time:   [164.02 ns 164.06 ns 164.12 ns]
                        change: [-0.5762% -0.4384% -0.3061%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe
right shift/split_overflowing_shr/16
                        time:   [29.108 ns 29.176 ns 29.265 ns]
                        change: [-0.2187% +0.2073% +0.6774%] (p = 0.36 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) high mild
  8 (8.00%) high severe
right shift/fast_split_overflowing_shr/16
                        time:   [29.011 ns 29.036 ns 29.060 ns]
                        change: [-1.0360% -0.6637% -0.3492%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high severe
right shift/overflowing_shr_vartime/32
                        time:   [15.419 ns 16.188 ns 16.845 ns]
                        change: [-7.6622% -2.9743% +1.9597%] (p = 0.22 > 0.05)
                        No change in performance detected.
right shift/overflowing_shr/32
                        time:   [290.37 ns 290.44 ns 290.53 ns]
                        change: [-2.0446% -1.9147% -1.8026%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) low severe
  1 (1.00%) high mild
  6 (6.00%) high severe
right shift/split_overflowing_shr/32
                        time:   [195.10 ns 195.14 ns 195.19 ns]
                        change: [-2.6424% -2.4363% -2.2503%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe
right shift/fast_split_overflowing_shr/32
                        time:   [70.229 ns 70.491 ns 70.755 ns]
                        change: [-9.4592% -9.2123% -8.9452%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  11 (11.00%) high mild
  5 (5.00%) high severe
right shift/overflowing_shr_vartime/64
                        time:   [40.703 ns 42.217 ns 43.531 ns]
                        change: [-5.7245% -2.3106% +1.2632%] (p = 0.22 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
right shift/overflowing_shr/64
                        time:   [576.30 ns 576.61 ns 577.01 ns]
                        change: [+0.1246% +0.2759% +0.4082%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) low severe
  3 (3.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe
right shift/split_overflowing_shr/64
                        time:   [140.89 ns 141.09 ns 141.27 ns]
                        change: [+0.0042% +0.2691% +0.4997%] (p = 0.04 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
right shift/fast_split_overflowing_shr/64
                        time:   [136.50 ns 136.71 ns 136.89 ns]
                        change: [+0.0655% +0.2786% +0.4818%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
right shift/overflowing_shr_vartime/128
                        time:   [69.208 ns 72.712 ns 75.757 ns]
                        change: [-5.4828% -0.2478% +4.8264%] (p = 0.93 > 0.05)
                        No change in performance detected.
right shift/overflowing_shr/128
                        time:   [1.5383 µs 1.5387 µs 1.5391 µs]
                        change: [+5.4762% +6.1564% +6.6780%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 20 outliers among 100 measurements (20.00%)
  9 (9.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  7 (7.00%) high severe
right shift/split_overflowing_shr/128
                        time:   [659.83 ns 660.17 ns 660.51 ns]
                        change: [+4.7803% +4.9904% +5.1604%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe
right shift/fast_split_overflowing_shr/128
                        time:   [279.45 ns 281.96 ns 284.17 ns]
                        change: [-1.7589% -0.9949% -0.1996%] (p = 0.02 < 0.05)
                        Change within noise threshold.
right shift/overflowing_shr_vartime/256
                        time:   [176.04 ns 182.35 ns 187.73 ns]
                        change: [-1.6892% +1.9530% +5.5595%] (p = 0.31 > 0.05)
                        No change in performance detected.
right shift/overflowing_shr/256
                        time:   [2.8119 µs 2.8131 µs 2.8144 µs]
                        change: [+1.1322% +1.3257% +1.4943%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  6 (6.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe
right shift/split_overflowing_shr/256
                        time:   [1.2748 µs 1.2757 µs 1.2766 µs]
                        change: [-0.9445% -0.7247% -0.5119%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe
right shift/fast_split_overflowing_shr/256
                        time:   [662.83 ns 665.62 ns 668.21 ns]
                        change: [+0.2058% +0.5884% +0.9747%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild

Footnotes

  1. this value concerns me...

@@ -165,7 +165,7 @@ impl<const LIMBS: usize> Uint<LIMBS> {
}

/// Borrow the limbs of this [`Uint`] mutably.
pub fn as_limbs_mut(&mut self) -> &mut [Limb; LIMBS] {
pub const fn as_limbs_mut(&mut self) -> &mut [Limb; LIMBS] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

src/uint/shr.rs Outdated
///
/// Returns `None` if `shift >= Self::BITS`.
pub const fn split_overflowing_shr(&self, shift: u32) -> ConstCtOption<Self> {
let (intra_limb_shift, limb_shift) = (shift % Limb::BITS, shift / Limb::BITS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: clippy has an integer_division_remainder_used lint it would be nice to eventually turn on in this crate, if you can find a solution which doesn't use these operators

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@tarcieri
Copy link
Member

Not sure about the intra_limb_* naming scheme but I'm not sure I have a better suggestion

@erik-3milabs
Copy link
Contributor Author

Not sure about the intra_limb_* naming scheme but I'm not sure I have a better suggestion

That was exactly what I was thinking 😅 @fjarri do you have a better proposal?

src/uint/shr.rs Outdated
///
/// Panics if `shift >= Limb::BITS`.
#[inline(always)]
const fn intra_limb_carrying_shr_internal(&self, shift: u32) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: not being pub makes this implicitly internal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@tarcieri
Copy link
Member

@erik-3milabs one option which would let us merge this PR without first bikeshedding the intra_limb_* naming would be to remove pub from those functions for now until we're happy with the name (or give up)

Comment on lines +121 to +125
/// Split `shift` into `shift % Limb::BITS` (its intra-limb-shift component), and
/// `shift / Limb::BITS` (its limb-shift component).
///
/// This function achieves this without using a division/remainder operation.
pub(crate) const fn decompose_shift(shift: u32) -> (u32, u32) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this function is used in both shl.rs and shr.rs, is there perhaps a better spot to put this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's literally 3 lines of code. Maybe just copy it around?

@erik-3milabs
Copy link
Contributor Author

@erik-3milabs one option which would let us merge this PR without first bikeshedding the intra_limb_* naming would be to remove pub from those functions for now until we're happy with the name (or give up)

I still can't think of anything better than intra_limb_*. I made it private for now.

I went ahead and "mirrored" the implementation to replace the existing Uint::shl implementation.

@erik-3milabs
Copy link
Contributor Author

erik-3milabs commented Jan 30, 2025

@tarcieri One thing before we merge this:

On my machine, the benchmarks are fluctuating a teeny-tiny bit: it looks like

Uint::<LIMBS>>::ONE.shr(Uint::<LIMBS>>::BITS/2 + 10)

is consistently slower than

Uint::<LIMBS>>::ONE.shr(10)

The difference is almost nothing (~0.5%) but it is there.
What is the tolerance this library works with?

@erik-3milabs
Copy link
Contributor Author

erik-3milabs commented Jan 30, 2025

Hmm, the compiler optimizes out my zero shift 🤔

The zero shift is (partially) optimized out:

left shift/intra_limb_carrying_shl(0)/64
                        time:   [55.765 ns 56.493 ns 57.146 ns]
                        change: [-1.1124% +0.0880% +1.3286%] (p = 0.88 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
left shift/intra_limb_carrying_shl(1)/64
                        time:   [56.976 ns 57.773 ns 58.464 ns]
                        change: [-1.0465% +0.2521% +1.5104%] (p = 0.70 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
left shift/intra_limb_carrying_shl(Limb::BITS-1)/64
                        time:   [56.803 ns 57.610 ns 58.300 ns]
                        change: [-1.2401% -0.0472% +1.2422%] (p = 0.94 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
left shift/intra_limb_carrying_shl(0)/128
                        time:   [103.58 ns 106.12 ns 108.38 ns]
                        change: [-3.7659% -0.9938% +1.7795%] (p = 0.46 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
left shift/intra_limb_carrying_shl(1)/128
                        time:   [143.28 ns 145.11 ns 146.72 ns]
                        change: [-1.4626% -0.2726% +0.9269%] (p = 0.66 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
left shift/intra_limb_carrying_shl(Limb::BITS-1)/128
                        time:   [143.67 ns 145.68 ns 147.50 ns]
                        change: [-1.1970% +0.0733% +1.4213%] (p = 0.92 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
left shift/intra_limb_carrying_shl(0)/256
                        time:   [211.32 ns 216.86 ns 221.59 ns]
                        change: [-3.1180% -0.3510% +2.4557%] (p = 0.79 > 0.05)
                        No change in performance detected.
left shift/intra_limb_carrying_shl(1)/256
                        time:   [278.80 ns 282.88 ns 286.51 ns]
                        change: [-1.9677% -0.3737% +1.2130%] (p = 0.64 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
left shift/intra_limb_carrying_shl(Limb::BITS-1)/256
                        time:   [279.81 ns 283.84 ns 287.33 ns]
                        change: [-1.3217% +0.2101% +1.8418%] (p = 0.79 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

@erik-3milabs
Copy link
Contributor Author

erik-3milabs commented Jan 30, 2025

full_limb_overflowing_shl might also not be as constant time as we were hoping. The full_limb_overflowing_shl(LIMBS-1) is always a (significant) outlier. How does that work, though? 🤔

left shift/full_limb_overflowing_shl(0)/64
                        time:   [133.76 ns 134.28 ns 134.75 ns]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
left shift/full_limb_overflowing_shl(1)/64
                        time:   [132.82 ns 133.30 ns 133.74 ns]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
left shift/full_limb_overflowing_shl(5)/64
                        time:   [133.07 ns 133.52 ns 133.96 ns]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
left shift/full_limb_overflowing_shl(LIMBS-1)/64
                        time:   [134.69 ns 135.09 ns 135.48 ns]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
left shift/full_limb_overflowing_shl(0)/128
                        time:   [269.71 ns 271.83 ns 273.78 ns]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
left shift/full_limb_overflowing_shl(1)/128
                        time:   [269.01 ns 270.96 ns 272.70 ns]
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high mild
left shift/full_limb_overflowing_shl(5)/128
                        time:   [269.63 ns 271.79 ns 273.74 ns]
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
left shift/full_limb_overflowing_shl(LIMBS-1)/128
                        time:   [275.54 ns 277.45 ns 279.22 ns]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
left shift/full_limb_overflowing_shl(0)/256
                        time:   [636.37 ns 638.88 ns 641.05 ns]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
left shift/full_limb_overflowing_shl(1)/256
                        time:   [635.93 ns 638.91 ns 641.63 ns]
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high mild
left shift/full_limb_overflowing_shl(5)/256
                        time:   [636.87 ns 639.65 ns 642.13 ns]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
left shift/full_limb_overflowing_shl(LIMBS-1)/256
                        time:   [647.82 ns 650.69 ns 653.35 ns]
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

@erik-3milabs
Copy link
Contributor Author

I think this idea just went bust.

I benchmarked intra_limb_carrying_shl some more. It has become evident that this function does not execute in constant time. Perhaps it is the compiler, perhaps the CPU. Either way: it makes this code unusable.

I'm out of ideas for now. Unless one of you has an idea on preventing this, I propose we close this PR for now.

@erik-3milabs erik-3milabs marked this pull request as draft January 31, 2025 11:30
@erik-3milabs erik-3milabs mentioned this pull request Jan 31, 2025
@fjarri
Copy link
Contributor

fjarri commented Jan 31, 2025

I don't really see anything that would account for variable-time execution. How are you benchmarking it, exactly?

@erik-3milabs
Copy link
Contributor Author

@fjarri let's return to this.

I don't really see anything that would account for variable-time execution. How are you benchmarking it, exactly?

fn shl_benchmarks<const LIMBS: usize>(group: &mut BenchmarkGroup<WallTime>) {
    // note: is zero
    let zero = Uint::<LIMBS>::random_mod(&mut OsRng, &Uint::ONE.shl(10).to_nz().unwrap()).shr(11);

    // note: always zero
    let shift = zero.as_limbs()[0].0 as u32;
    group.bench_function(BenchmarkId::new("intra_limb_carrying_shl(0)", LIMBS), |b| {
        b.iter_batched(
            || Uint::<LIMBS>::ONE,
            |x| black_box(x.intra_limb_carrying_shl(shift)),
            BatchSize::SmallInput,
        )
    });
    group.bench_function(BenchmarkId::new("intra_limb_carrying_shl(Limb::BITS-1)", LIMBS), |b| {
        b.iter_batched(
            || (Uint::<LIMBS>::ONE, Limb::BITS-1),
            |(x, shift)| black_box(x.intra_limb_carrying_shl(shift)),
            BatchSize::SmallInput,
        )
    });
}

fn bench_shl(c: &mut Criterion) {
    let mut group = c.benchmark_group("left shift");
    shl_benchmarks::<{ U4096::LIMBS }>(&mut group);
    shl_benchmarks::<{ U8192::LIMBS }>(&mut group);
    shl_benchmarks::<{ U16384::LIMBS }>(&mut group);
    shl_benchmarks::<{ U32768::LIMBS }>(&mut group);

    group.finish();
}

will give this output on my machine:

left shift/intra_limb_carrying_shl(0)/64
                        time:   [63.015 ns 64.204 ns 65.263 ns]
                        change: [+1.4139% +3.7152% +5.9009%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
left shift/intra_limb_carrying_shl(Limb::BITS-1)/64
                        time:   [69.013 ns 69.628 ns 70.243 ns]
                        change: [-10.035% -8.3799% -6.8060%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
left shift/intra_limb_carrying_shl(0)/128
                        time:   [113.22 ns 116.19 ns 119.08 ns]
                        change: [-10.609% -5.6487% -1.1644%] (p = 0.02 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) high mild
  3 (3.00%) high severe
left shift/intra_limb_carrying_shl(Limb::BITS-1)/128
                        time:   [149.33 ns 152.70 ns 156.34 ns]
                        change: [-6.8919% -4.3440% -1.5910%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) high mild
  10 (10.00%) high severe
left shift/intra_limb_carrying_shl(0)/256
                        time:   [223.04 ns 231.61 ns 240.23 ns]
                        change: [-4.8209% -2.3135% +0.2843%] (p = 0.10 > 0.05)
                        No change in performance detected.
Found 20 outliers among 100 measurements (20.00%)
  4 (4.00%) low mild
  8 (8.00%) high mild
  8 (8.00%) high severe
left shift/intra_limb_carrying_shl(Limb::BITS-1)/256
                        time:   [288.18 ns 292.85 ns 297.30 ns]
                        change: [-12.275% -8.1139% -4.3964%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  4 (4.00%) low mild
  10 (10.00%) high mild
  4 (4.00%) high severe
left shift/intra_limb_carrying_shl(0)/512
                        time:   [506.98 ns 516.80 ns 526.50 ns]
                        change: [-2.7622% -1.0253% +0.7468%] (p = 0.26 > 0.05)
                        No change in performance detected.
Found 18 outliers among 100 measurements (18.00%)
  5 (5.00%) low severe
  6 (6.00%) high mild
  7 (7.00%) high severe
left shift/intra_limb_carrying_shl(Limb::BITS-1)/512
                        time:   [634.03 ns 643.39 ns 652.63 ns]
                        change: [-9.6863% -7.9263% -6.0933%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  2 (2.00%) low severe
  3 (3.00%) low mild
  8 (8.00%) high mild
  5 (5.00%) high severe

Based on these results, I'm led to believe that either

  1. there is something wrong with the way I benchmarked this code, or
  2. the compiler somehow understands that my non-constant zero is actually zero and optimizes these operations away, or
  3. the CPU executes this algorithm in non-constant time

What am I missing?

@fjarri
Copy link
Contributor

fjarri commented Mar 12, 2025

I don't think this is conclusive, this could really be just variations caused by OS scheduling. https://crates.io/crates/dudect-bencher gives more accurate results, but it's not perfect either. I remember there was some tool mentioned during the audit that analyzes actual generated assembly, but I cannot recall at the moment what was it called.

@dvdplm
Copy link
Contributor

dvdplm commented Mar 12, 2025

@erik-3milabs Maybe you can use a seedable RNG so that each run uses the exact same numbers: if you still see variability you can't account for, then that'll be the OS scheduling and similar shenanigans. Once you know how large this "unaccountable variability" is you can then vary the seed and check that the performance is still stable between measurements (or at least never larger than the "unaccountable variability").

@tarcieri
Copy link
Member

I remember there was some tool mentioned during the audit that analyzes actual generated assembly, but I cannot recall at the moment what was it called.

I started looking into BINSEC and cargo checkct for these purposes but ran into some build problems

If this is really the problem, for example:

the compiler somehow understands that my non-constant zero is actually zero and optimizes these operations away,

...it caught a similar issue in curve25519-dalek where the compiler spotted an opportunity to insert a branch to optimize away an all-zero mask

@erik-3milabs
Copy link
Contributor Author

@fjarri @dvdplm please let me know what you think of the following.

I re-did the experiment, this time using a seedable RNG to obtain my non-const zero. The benchmark code is as follows:

#[inline]
fn shl_zero_benchmarks<const LIMBS: usize>(group: &mut BenchmarkGroup<WallTime>, seed: u64) {
    let mut rng = ChaCha8Rng::seed_from_u64(seed);

    // note: is zero
    let zero = Uint::<LIMBS>::random_mod(&mut rng, &Uint::ONE.shl(10).to_nz().unwrap()).shr(11);

    // note: always zero
    let shift = zero.as_limbs()[0].0 as u32;

    group.bench_function(BenchmarkId::new("intra_limb_carrying_shl(0)", LIMBS*100 + seed as usize), |b| {
        b.iter_batched(
            || Uint::<LIMBS>::ONE,
            |x| black_box(x.intra_limb_carrying_shl(shift)),
            BatchSize::SmallInput,
        )
    });
}

#[inline]
fn shl_max_benchmarks<const LIMBS: usize>(group: &mut BenchmarkGroup<WallTime>) {
    group.bench_function(
        BenchmarkId::new("intra_limb_carrying_shl(MAX)", LIMBS),
        |b| {
            b.iter_batched(
                || (Uint::<LIMBS>::ONE, Limb::BITS - 1),
                |(x, shift)| black_box(x.intra_limb_carrying_shl(shift)),
                BatchSize::SmallInput,
            )
        },
    );
}

With these benchmark functions, I first ran the following experiment 7 times (I tried 10, but I interrupted my machine, garbling the results of the eight round, so I decided to discard them):

fn bench_shl(c: &mut Criterion) {
    let mut group = c.benchmark_group("left shift");
    
    shl_zero_benchmarks::<{ U256::LIMBS }>(&mut group, 0);
    shl_zero_benchmarks::<{ U512::LIMBS }>(&mut group, 0);
    shl_zero_benchmarks::<{ U1024::LIMBS }>(&mut group, 0);
    shl_zero_benchmarks::<{ U2048::LIMBS }>(&mut group, 0);
    shl_zero_benchmarks::<{ U4096::LIMBS }>(&mut group, 0);
    shl_zero_benchmarks::<{ U8192::LIMBS }>(&mut group, 0);
    shl_zero_benchmarks::<{ U16384::LIMBS }>(&mut group, 0);
    shl_zero_benchmarks::<{ U32768::LIMBS }>(&mut group, 0);
    shl_max_benchmarks::<{ U256::LIMBS }>(&mut group);
    shl_max_benchmarks::<{ U512::LIMBS }>(&mut group);
    shl_max_benchmarks::<{ U1024::LIMBS }>(&mut group);
    shl_max_benchmarks::<{ U2048::LIMBS }>(&mut group);
    shl_max_benchmarks::<{ U4096::LIMBS }>(&mut group);
    shl_max_benchmarks::<{ U8192::LIMBS }>(&mut group);
    shl_max_benchmarks::<{ U16384::LIMBS }>(&mut group);
    shl_max_benchmarks::<{ U32768::LIMBS }>(&mut group);
}

Dividing the difference between the minimum and maximum by the number of limbs, yields the following table:

LIMBS ZERO MAX
4 0.02 0.02
8 0.11 0.02
16 0.09 0.11
32 0.02 0.03
64 0.04 0.03
128 0.02 0.02
256 0.03 0.01
512 0.04 0.02

From this, I conclude that the "variability" due to CPU scheduling on my machine never exceeds 0.12 ns per limb.

Then, I performed the second experiment:

fn bench_shl(c: &mut Criterion) {
    let mut group = c.benchmark_group("left shift");

    for i in 0..10 {
        shl_zero_benchmarks::<{ U256::LIMBS }>(&mut group, i);
        shl_zero_benchmarks::<{ U512::LIMBS }>(&mut group, i);
        shl_zero_benchmarks::<{ U1024::LIMBS }>(&mut group, i);
        shl_zero_benchmarks::<{ U2048::LIMBS }>(&mut group, i);
        shl_zero_benchmarks::<{ U4096::LIMBS }>(&mut group, i);
        shl_zero_benchmarks::<{ U8192::LIMBS }>(&mut group, i);
        shl_zero_benchmarks::<{ U16384::LIMBS }>(&mut group, i);
        shl_zero_benchmarks::<{ U32768::LIMBS }>(&mut group, i);
    }
}

Again, I aggregated the results by integer size, and computed (max() - min())/#limbs. I've included the results below.

LIMBS ZERO
4 0.15
8 0.04
16 0.01
32 0.05
64 0.04
128 0.03
256 0.01
512 0.03

The highest value of "0.15" for four limbs exceeds the previously found bound. As such, I think it is safe to conclude that the variability due to the RNG + CPU scheduling is less than 0.16 ns per limb.

We now look at the maxima obtained in the case of the zero-shift and the minima obtained for that max-shift during the first experiment. These are as follows:

#LIMBS MAX ZERO-SHIFT MIN MAX-SHIFT DIFF DIFF/LIMBS
4 3.355 3.353 0.00 0.00
8 15.068 16.827 1.76 0.22
16 23.633 21.433 -2.20 -0.14
32 34.907 47.844 12.94 0.40
64 58.580 69.848 11.27 0.18
128 119.660 150.580 30.92 0.24
256 236.170 295.680 59.51 0.23
512 545.000 652.750 107.75 0.21

For 32 or more limbs, the difference between these two measurements clearly exceeds the aforementioned variability due to RNG or CPU scheduling. As such, I'm confident that this indicates non-constant time behavior (possibly due to caching?)

WDYT?

@dvdplm
Copy link
Contributor

dvdplm commented Mar 20, 2025

First of all, I am by no means a constant-time expert and I am definitely out of my depth here. The only way to actually know what is going on is to check the assembly.
Doing CT in Rust is a "best effort" thing (true for any other language that isn't assembly, probably).

If we posit that your measurements prove actual non-CT, the question becomes "how much non-CT does an adversary really need to be able to pull off an attack?". I really don't know if the answer to that, beyond saying "the less we have the better".

If your new impl is "more CT" than the old one, and also faster, I think that is all we need to know and that we should merge the PR.

As a tangential note I read Pornin's paper on "Optimized Binary GCD for Modular Inversion" yesterday and found this note on page 7:

The last Intel CPU on which shifts were not constant-time with regard to the shift count was the Pentium IV, in the early 2000s, but that CPU does not offer lzcnt.

If that holds for Arm and AMD CPUs as well, wouldn't that indicate that Limb-level shifts are guaranteed to be CT?

@erik-3milabs
Copy link
Contributor Author

Woops, wrong button 🙈

If your new impl is "more CT" than the old one, and also faster, I think that is all we need to know and that we should merge the PR.

Perhaps this got lost in communication, but I'm only claiming that the implementation introduced in this PR is exhibiting non-constant behavior; running the same suite of benchmarks on the master branch sees no discernable difference in runtime between shifting by 0 or Uint::<LIMBS>::BITS - 1.

As a tangential note I read Pornin's paper on "Optimized Binary GCD for Modular Inversion" yesterday and found this note on page 7:

The last Intel CPU on which shifts were not constant-time with regard to the shift count was the Pentium IV, in the early 2000s, but that CPU does not offer lzcnt.

If that holds for Arm and AMD CPUs as well, wouldn't that indicate that Limb-level shifts are guaranteed to be CT?

This is one of the reasons why I started this discussion. I, too, was of the impression that shifts were constant-time w.r.t. the shift. So, I was surprised to find that this code suggests the contrary. Chances are the compiler discards several operations based on the value of the shift?

I don't know how to rewrite the code such that it i) preserves the better performance, while ii) also running in constant time. Unless anybody else has ideas, I propose to close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants