Speedup `Uint::shr` #753

erik-3milabs · 2025-01-24T15:40:33Z

Refactor Uint::shr, achieving a >60% speed up for U2048.

Benchmarks (run on commit `b2dcb20`)

right shift/shr, U2048  time:   [292.04 ns 292.16 ns 292.30 ns]
                        change: [-0.2073% -0.0666% +0.0669%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe
right shift/new shr, U2048
                        time:   [114.34 ns 114.48 ns 114.62 ns]
                        change: [+0.8375% +1.0410% +1.2401%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild

erik-3milabs · 2025-01-24T15:43:00Z

@tarcieri, can you think of ways to better integrate the (helper) functions I wrote? This might improve the quality of the code base and extend the performance speed up to other shr operations.

Also, once you're more/less on board with this PR, I'd be happy to code up the same improvement for shl.

erik-3milabs · 2025-01-24T16:36:06Z

Also, with this implementation in place, it becomes painfully apparent that Limb::shr is not constant time, unlike what its name suggests.

We could rename the current implementation as shr_vartime and construct a new shr as follows:

pub const fn shr(self, shift: u32) -> Self {
    assert!(shift < Self::BITS);
    self.carrying_shr(shift).0
}

This would not change the interface of shr, meaning it could fit in 0.6.1

tarcieri · 2025-01-24T17:10:58Z

it becomes painfully apparent that Limb::shr is not constant time

@erik-3milabs that's unfortunate to hear, can you open a separate issue? It sounds like a potential security vulnerability

fjarri · 2025-01-24T19:12:37Z

src/uint/shr.rs

        let mut result = *self;
        let mut i = 0;
+
+        // Speed up shifting upto 8 limbs.


How much does this part contribute to the overall speedup?

Hmm, good question. I'd expect this to improve the performance of shr for large Uint instances. However, my benchmark are failing to provide clear evidence of this effect.

I generalized the full_limb_shr over the WINDOW_SIZE (hardcoded to 4 in this PR); I made sure to completely remove the speed-up for WINDOWS_SIZE=0.

Below are the (condensed) raw results for running shr with a certain WINDOW_SIZE. There does seem to be some merit to the idea: for U2048, U8192 and U16384 the running time drops significantly. But the drop is not consistent (e.g., WS=2 is larger than WS=1 for U2048, U8192 and U16384, no impact on U4096).

I understand that I need more consistent performance to make this speedup actually stick 🤔

right shift/U256/ WINDOW_SIZE: 0 time: [15.822 ns 15.831 ns 15.840 ns] WINDOW_SIZE: 1 time: [15.534 ns 15.544 ns 15.554 ns] WINDOW_SIZE: 2 time: [15.888 ns 15.894 ns 15.900 ns] WINDOW_SIZE: 3 time: [15.886 ns 15.893 ns 15.900 ns] WINDOW_SIZE: 4 time: [15.890 ns 15.897 ns 15.904 ns] WINDOW_SIZE: 5 time: [15.936 ns 15.946 ns 15.956 ns] right shift/U512/ WINDOW_SIZE: 0 time: [32.916 ns 32.938 ns 32.961 ns] WINDOW_SIZE: 1 time: [30.961 ns 30.974 ns 30.988 ns] WINDOW_SIZE: 2 time: [32.949 ns 32.962 ns 32.974 ns] WINDOW_SIZE: 3 time: [34.661 ns 35.130 ns 35.783 ns] WINDOW_SIZE: 4 time: [34.262 ns 34.350 ns 34.443 ns] WINDOW_SIZE: 5 time: [35.019 ns 35.121 ns 35.231 ns] right shift/U1024/ WINDOW_SIZE: 0 time: [51.829 ns 51.847 ns 51.864 ns] WINDOW_SIZE: 1 time: [51.862 ns 51.885 ns 51.910 ns] WINDOW_SIZE: 2 time: [51.422 ns 51.435 ns 51.448 ns] WINDOW_SIZE: 3 time: [50.870 ns 50.891 ns 50.914 ns] WINDOW_SIZE: 4 time: [50.867 ns 50.883 ns 50.898 ns] WINDOW_SIZE: 5 time: [50.820 ns 50.833 ns 50.846 ns] right shift/U2048/ WINDOW_SIZE: 0 time: [167.11 ns 167.17 ns 167.23 ns] WINDOW_SIZE: 1 time: [147.72 ns 147.78 ns 147.82 ns] WINDOW_SIZE: 2 time: [173.59 ns 173.68 ns 173.77 ns] WINDOW_SIZE: 3 time: [114.72 ns 114.77 ns 114.83 ns] WINDOW_SIZE: 4 time: [112.92 ns 113.12 ns 113.34 ns] WINDOW_SIZE: 5 time: [114.96 ns 115.03 ns 115.10 ns] right shift/U4096/ WINDOW_SIZE: 0 time: [236.76 ns 237.06 ns 237.33 ns] WINDOW_SIZE: 1 time: [237.82 ns 238.08 ns 238.32 ns] WINDOW_SIZE: 2 time: [234.11 ns 234.41 ns 234.68 ns] WINDOW_SIZE: 3 time: [232.93 ns 233.20 ns 233.45 ns] WINDOW_SIZE: 4 time: [234.26 ns 234.50 ns 234.72 ns] WINDOW_SIZE: 5 time: [230.70 ns 231.10 ns 231.51 ns] right shift/U8192/ WINDOW_SIZE: 0 time: [831.95 ns 833.59 ns 835.47 ns] WINDOW_SIZE: 1 time: [788.52 ns 789.06 ns 789.62 ns] WINDOW_SIZE: 2 time: [847.16 ns 847.81 ns 848.59 ns] WINDOW_SIZE: 3 time: [777.58 ns 778.05 ns 778.49 ns] WINDOW_SIZE: 4 time: [698.27 ns 698.73 ns 699.21 ns] WINDOW_SIZE: 5 time: [623.64 ns 624.28 ns 624.90 ns] right shift/U16384/ WINDOW_SIZE: 0 time: [1.6641 µs 1.6650 µs 1.6659 µs] WINDOW_SIZE: 1 time: [1.5565 µs 1.5608 µs 1.5657 µs] WINDOW_SIZE: 2 time: [1.4930 µs 1.4952 µs 1.4975 µs] WINDOW_SIZE: 3 time: [1.6327 µs 1.6339 µs 1.6350 µs] WINDOW_SIZE: 4 time: [1.5341 µs 1.5356 µs 1.5374 µs] WINDOW_SIZE: 5 time: [1.4262 µs 1.4278 µs 1.4296 µs]

fjarri · 2025-01-24T19:13:42Z

Also, with this implementation in place, it becomes painfully apparent that Limb::shr is not constant time, unlike what its name suggests.

Could you elaborate? Does Rust compiler introduce some checks for >> on primitives?

tarcieri · 2025-01-24T19:16:19Z

The only branch I'm aware of is the one for the potential panic condition, however we don't consider conditions that cause panics to be "variable time" even though there is a branch, because the only time the branch is taken will crash the program, so the non-crashing case still runs in constant time. Also, that branch can be elided by the compiler in many conditions, especially as the code is annotated #[inline(always)].

Edit: also, notably it's so aggressively inlined by LLVM that I'm having trouble making Godbolt even generate code for it

fjarri · 2025-01-24T19:16:43Z

src/limb/shr.rs

@@ -72,6 +102,60 @@ mod tests {
        assert_eq!(Limb(16) >> 2, Limb(4));
    }

+    #[cfg(target_pointer_width = "64")]


I wonder if it's more reliable to use proptest in such situations instead of a number of hand-written tests.

Could you point me to a good source on proptests? I'm unfamiliar with the concept but would love to learn about it!

You can read about it here: https://proptest-rs.github.io/proptest/proptest/index.html

If you look under the toplevel tests/ directory there are several files named .proptest-regressions. You can look at any of the correspondingly named tests for some proptest examples, e.g. https://github.com/RustCrypto/crypto-bigint/blob/master/tests/boxed_monty_form.rs#L37-L155

I added them. Let me know what you think!

erik-3milabs · 2025-01-24T21:20:22Z

Also, with this implementation in place, it becomes painfully apparent that Limb::shr is not constant time, unlike what its name suggests.

Could you elaborate? Does the Rust compiler introduce some checks for >> on primitives?

My apologies; perhaps I was too quick to jump to a conclusion here. Taking a step back, a better statement is:

Either Limb::shr is vartime, or it is possible to speed up Uint::shr even further.

Why I thought `Limb::shr` is vartime...

Allow me to elaborate. I jumped to my earlier conclusion because of how Uint::shr has been implemented until now. The "old" implementation makes successive calls to Uint::shr_vartime(1<<i) for i=0, 1, 2, .... and then Uint::selects the result if the ith bit was set in shift. The fact that Uint::shr leverages it as a subroute strongly suggests that Uint::shr_vartime is a very non-constant time algorithm, and that not a single part of it is constant time.

Moreover, the idea to

fn shr(&self, shift: u32) -> Self {
     let (bit_shift, limb_shift) = (shift % Limb::BITS, shift / Limb::BITS);
     self.intra_limb_shr(bit_shift).limb_shr(limb_shift)
}

was so apparent to me that I assumed there must be a good reason this has not been implemented 🙈 After all, first ideas are usually terrible ideas, especially when working on a large repo like this with much more intelligent brains than mine working on them.

... and how we might be able to speed up `Uint::shr` even further.

I did some benchmarking of my own (I'm not familiar enough with compilers such as Godbolt to use that stuff) and got very stable results when using Limb::shr. This confirms that it might indeed be constant time.

When we use it to replace the Limb::carrying_shr call in Uint::intra_limb_carrying_shr, we save another 44ns when benchmarking on my machine, dropping U2048::shr to ~70ns. Not bad if you ask me! 😄

erik-3milabs · 2025-01-27T17:46:29Z

@tarcieri @fjarri
I updated the code and improved the benchmark set. Below are my local results.

The table lists

vartime: the vartime performance,
old: the old shr implementation,
split: splitting shift into its "intra limb" component, using the faster Limb::carrying_shr here, and
fast split: splitting shift as well as speeding up full limb shifting.

@fjarri, you asked

How much does this part contribute to the overall speedup?

To answer that question: I now use this technique for all limb shifts (not just the eight-limb window we had before) in the "fast split" variant. Your answer is answered by observing the difference in the last two columns.

I added a footnote to highlight one result I don't understand... 🤔

limbs	vartime	old	split	fast split
4	2.5372 ns	85.502 ns	8.8319 ns	8.8421 ns
8	4.3408 ns	116.84 ns	13.628 ns	14.009 ns
16	8.0908 ns	164.06 ns	29.176 ns	29.036 ns
32	16.188 ns	290.44 ns	195.14 ns ¹	70.491 ns
64	42.217 ns	576.61 ns	141.09 ns	136.71 ns
128	72.712 ns	1.5387 µs	660.17 ns	281.96 ns
256	182.35 ns	2.8131 µs	1.2757 µs	665.62 ns

And the raw data, in case you'd like to look at the variances:

right shift/overflowing_shr_vartime/4
                        time:   [2.3705 ns 2.5372 ns 2.6880 ns]
                        change: [-2.6951% +4.9023% +13.295%] (p = 0.23 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
right shift/overflowing_shr/4
                        time:   [85.384 ns 85.502 ns 85.674 ns]
                        change: [-0.0424% +0.2264% +0.6129%] (p = 0.18 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe
right shift/split_overflowing_shr/4
                        time:   [8.8258 ns 8.8319 ns 8.8377 ns]
                        change: [-1.7688% -1.5376% -1.3147%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe
right shift/fast_split_overflowing_shr/4
                        time:   [8.8354 ns 8.8421 ns 8.8487 ns]
                        change: [-0.5675% -0.1341% +0.2004%] (p = 0.55 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe
right shift/overflowing_shr_vartime/8
                        time:   [4.1154 ns 4.3408 ns 4.5350 ns]
                        change: [-4.0490% +2.0512% +8.4408%] (p = 0.52 > 0.05)
                        No change in performance detected.
right shift/overflowing_shr/8
                        time:   [116.79 ns 116.84 ns 116.90 ns]
                        change: [+0.0388% +0.1941% +0.3382%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  11 (11.00%) high severe
right shift/split_overflowing_shr/8
                        time:   [13.600 ns 13.628 ns 13.654 ns]
                        change: [+1.2040% +1.6944% +2.0893%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
right shift/fast_split_overflowing_shr/8
                        time:   [13.873 ns 14.009 ns 14.155 ns]
                        change: [-2.5360% -1.9567% -1.3958%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe
right shift/overflowing_shr_vartime/16
                        time:   [7.6717 ns 8.0908 ns 8.4453 ns]
                        change: [-6.9666% -1.1241% +5.3446%] (p = 0.72 > 0.05)
                        No change in performance detected.
right shift/overflowing_shr/16
                        time:   [164.02 ns 164.06 ns 164.12 ns]
                        change: [-0.5762% -0.4384% -0.3061%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe
right shift/split_overflowing_shr/16
                        time:   [29.108 ns 29.176 ns 29.265 ns]
                        change: [-0.2187% +0.2073% +0.6774%] (p = 0.36 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) high mild
  8 (8.00%) high severe
right shift/fast_split_overflowing_shr/16
                        time:   [29.011 ns 29.036 ns 29.060 ns]
                        change: [-1.0360% -0.6637% -0.3492%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high severe
right shift/overflowing_shr_vartime/32
                        time:   [15.419 ns 16.188 ns 16.845 ns]
                        change: [-7.6622% -2.9743% +1.9597%] (p = 0.22 > 0.05)
                        No change in performance detected.
right shift/overflowing_shr/32
                        time:   [290.37 ns 290.44 ns 290.53 ns]
                        change: [-2.0446% -1.9147% -1.8026%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) low severe
  1 (1.00%) high mild
  6 (6.00%) high severe
right shift/split_overflowing_shr/32
                        time:   [195.10 ns 195.14 ns 195.19 ns]
                        change: [-2.6424% -2.4363% -2.2503%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe
right shift/fast_split_overflowing_shr/32
                        time:   [70.229 ns 70.491 ns 70.755 ns]
                        change: [-9.4592% -9.2123% -8.9452%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  11 (11.00%) high mild
  5 (5.00%) high severe
right shift/overflowing_shr_vartime/64
                        time:   [40.703 ns 42.217 ns 43.531 ns]
                        change: [-5.7245% -2.3106% +1.2632%] (p = 0.22 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
right shift/overflowing_shr/64
                        time:   [576.30 ns 576.61 ns 577.01 ns]
                        change: [+0.1246% +0.2759% +0.4082%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) low severe
  3 (3.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe
right shift/split_overflowing_shr/64
                        time:   [140.89 ns 141.09 ns 141.27 ns]
                        change: [+0.0042% +0.2691% +0.4997%] (p = 0.04 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
right shift/fast_split_overflowing_shr/64
                        time:   [136.50 ns 136.71 ns 136.89 ns]
                        change: [+0.0655% +0.2786% +0.4818%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
right shift/overflowing_shr_vartime/128
                        time:   [69.208 ns 72.712 ns 75.757 ns]
                        change: [-5.4828% -0.2478% +4.8264%] (p = 0.93 > 0.05)
                        No change in performance detected.
right shift/overflowing_shr/128
                        time:   [1.5383 µs 1.5387 µs 1.5391 µs]
                        change: [+5.4762% +6.1564% +6.6780%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 20 outliers among 100 measurements (20.00%)
  9 (9.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  7 (7.00%) high severe
right shift/split_overflowing_shr/128
                        time:   [659.83 ns 660.17 ns 660.51 ns]
                        change: [+4.7803% +4.9904% +5.1604%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe
right shift/fast_split_overflowing_shr/128
                        time:   [279.45 ns 281.96 ns 284.17 ns]
                        change: [-1.7589% -0.9949% -0.1996%] (p = 0.02 < 0.05)
                        Change within noise threshold.
right shift/overflowing_shr_vartime/256
                        time:   [176.04 ns 182.35 ns 187.73 ns]
                        change: [-1.6892% +1.9530% +5.5595%] (p = 0.31 > 0.05)
                        No change in performance detected.
right shift/overflowing_shr/256
                        time:   [2.8119 µs 2.8131 µs 2.8144 µs]
                        change: [+1.1322% +1.3257% +1.4943%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  6 (6.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe
right shift/split_overflowing_shr/256
                        time:   [1.2748 µs 1.2757 µs 1.2766 µs]
                        change: [-0.9445% -0.7247% -0.5119%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe
right shift/fast_split_overflowing_shr/256
                        time:   [662.83 ns 665.62 ns 668.21 ns]
                        change: [+0.2058% +0.5884% +0.9747%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild

this value concerns me... ↩

tarcieri · 2025-01-28T23:25:12Z

src/uint.rs

@@ -165,7 +165,7 @@ impl<const LIMBS: usize> Uint<LIMBS> {
    }

    /// Borrow the limbs of this [`Uint`] mutably.
-    pub fn as_limbs_mut(&mut self) -> &mut [Limb; LIMBS] {
+    pub const fn as_limbs_mut(&mut self) -> &mut [Limb; LIMBS] {


tarcieri · 2025-01-28T23:59:58Z

src/uint/shr.rs

+    ///
+    /// Returns `None` if `shift >= Self::BITS`.
+    pub const fn split_overflowing_shr(&self, shift: u32) -> ConstCtOption<Self> {
+        let (intra_limb_shift, limb_shift) = (shift % Limb::BITS, shift / Limb::BITS);


FYI: clippy has an integer_division_remainder_used lint it would be nice to eventually turn on in this crate, if you can find a solution which doesn't use these operators

tarcieri · 2025-01-29T00:01:10Z

Not sure about the intra_limb_* naming scheme but I'm not sure I have a better suggestion

erik-3milabs · 2025-01-29T15:51:55Z

Not sure about the intra_limb_* naming scheme but I'm not sure I have a better suggestion

That was exactly what I was thinking 😅 @fjarri do you have a better proposal?

tarcieri · 2025-01-29T15:54:04Z

src/uint/shr.rs

+    ///
+    /// Panics if `shift >= Limb::BITS`.
+    #[inline(always)]
+    const fn intra_limb_carrying_shr_internal(&self, shift: u32) -> Self {


Nit: not being pub makes this implicitly internal

tarcieri · 2025-01-29T16:03:29Z

@erik-3milabs one option which would let us merge this PR without first bikeshedding the intra_limb_* naming would be to remove pub from those functions for now until we're happy with the name (or give up)

erik-3milabs · 2025-01-30T11:13:20Z

src/uint/shl.rs

+    /// Split `shift` into `shift % Limb::BITS` (its intra-limb-shift component), and
+    /// `shift / Limb::BITS` (its limb-shift component).
+    ///
+    /// This function achieves this without using a division/remainder operation.
+    pub(crate) const fn decompose_shift(shift: u32) -> (u32, u32) {


Given that this function is used in both shl.rs and shr.rs, is there perhaps a better spot to put this?

It's literally 3 lines of code. Maybe just copy it around?

erik-3milabs · 2025-01-30T11:16:20Z

@erik-3milabs one option which would let us merge this PR without first bikeshedding the intra_limb_* naming would be to remove pub from those functions for now until we're happy with the name (or give up)

I still can't think of anything better than intra_limb_*. I made it private for now.

I went ahead and "mirrored" the implementation to replace the existing Uint::shl implementation.

erik-3milabs · 2025-01-30T11:23:43Z

@tarcieri One thing before we merge this:

On my machine, the benchmarks are fluctuating a teeny-tiny bit: it looks like

Uint::<LIMBS>>::ONE.shr(Uint::<LIMBS>>::BITS/2 + 10)

is consistently slower than

Uint::<LIMBS>>::ONE.shr(10)

The difference is almost nothing (~0.5%) but it is there.
What is the tolerance this library works with?

erik-3milabs · 2025-01-30T11:48:21Z

Hmm, the compiler optimizes out my zero shift 🤔

The zero shift is (partially) optimized out:

left shift/intra_limb_carrying_shl(0)/64
                        time:   [55.765 ns 56.493 ns 57.146 ns]
                        change: [-1.1124% +0.0880% +1.3286%] (p = 0.88 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
left shift/intra_limb_carrying_shl(1)/64
                        time:   [56.976 ns 57.773 ns 58.464 ns]
                        change: [-1.0465% +0.2521% +1.5104%] (p = 0.70 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
left shift/intra_limb_carrying_shl(Limb::BITS-1)/64
                        time:   [56.803 ns 57.610 ns 58.300 ns]
                        change: [-1.2401% -0.0472% +1.2422%] (p = 0.94 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
left shift/intra_limb_carrying_shl(0)/128
                        time:   [103.58 ns 106.12 ns 108.38 ns]
                        change: [-3.7659% -0.9938% +1.7795%] (p = 0.46 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
left shift/intra_limb_carrying_shl(1)/128
                        time:   [143.28 ns 145.11 ns 146.72 ns]
                        change: [-1.4626% -0.2726% +0.9269%] (p = 0.66 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
left shift/intra_limb_carrying_shl(Limb::BITS-1)/128
                        time:   [143.67 ns 145.68 ns 147.50 ns]
                        change: [-1.1970% +0.0733% +1.4213%] (p = 0.92 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
left shift/intra_limb_carrying_shl(0)/256
                        time:   [211.32 ns 216.86 ns 221.59 ns]
                        change: [-3.1180% -0.3510% +2.4557%] (p = 0.79 > 0.05)
                        No change in performance detected.
left shift/intra_limb_carrying_shl(1)/256
                        time:   [278.80 ns 282.88 ns 286.51 ns]
                        change: [-1.9677% -0.3737% +1.2130%] (p = 0.64 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
left shift/intra_limb_carrying_shl(Limb::BITS-1)/256
                        time:   [279.81 ns 283.84 ns 287.33 ns]
                        change: [-1.3217% +0.2101% +1.8418%] (p = 0.79 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

erik-3milabs · 2025-01-30T12:09:27Z

full_limb_overflowing_shl might also not be as constant time as we were hoping. The full_limb_overflowing_shl(LIMBS-1) is always a (significant) outlier. How does that work, though? 🤔

left shift/full_limb_overflowing_shl(0)/64
                        time:   [133.76 ns 134.28 ns 134.75 ns]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
left shift/full_limb_overflowing_shl(1)/64
                        time:   [132.82 ns 133.30 ns 133.74 ns]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
left shift/full_limb_overflowing_shl(5)/64
                        time:   [133.07 ns 133.52 ns 133.96 ns]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
left shift/full_limb_overflowing_shl(LIMBS-1)/64
                        time:   [134.69 ns 135.09 ns 135.48 ns]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
left shift/full_limb_overflowing_shl(0)/128
                        time:   [269.71 ns 271.83 ns 273.78 ns]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
left shift/full_limb_overflowing_shl(1)/128
                        time:   [269.01 ns 270.96 ns 272.70 ns]
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high mild
left shift/full_limb_overflowing_shl(5)/128
                        time:   [269.63 ns 271.79 ns 273.74 ns]
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
left shift/full_limb_overflowing_shl(LIMBS-1)/128
                        time:   [275.54 ns 277.45 ns 279.22 ns]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
left shift/full_limb_overflowing_shl(0)/256
                        time:   [636.37 ns 638.88 ns 641.05 ns]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
left shift/full_limb_overflowing_shl(1)/256
                        time:   [635.93 ns 638.91 ns 641.63 ns]
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high mild
left shift/full_limb_overflowing_shl(5)/256
                        time:   [636.87 ns 639.65 ns 642.13 ns]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
left shift/full_limb_overflowing_shl(LIMBS-1)/256
                        time:   [647.82 ns 650.69 ns 653.35 ns]
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

erik-3milabs · 2025-01-31T11:22:35Z

I think this idea just went bust.

I benchmarked intra_limb_carrying_shl some more. It has become evident that this function does not execute in constant time. Perhaps it is the compiler, perhaps the CPU. Either way: it makes this code unusable.

I'm out of ideas for now. Unless one of you has an idea on preventing this, I propose we close this PR for now.

fjarri · 2025-01-31T18:00:51Z

I don't really see anything that would account for variable-time execution. How are you benchmarking it, exactly?

erik-3milabs · 2025-03-11T16:08:19Z

@fjarri let's return to this.

I don't really see anything that would account for variable-time execution. How are you benchmarking it, exactly?

fn shl_benchmarks<const LIMBS: usize>(group: &mut BenchmarkGroup<WallTime>) {
    // note: is zero
    let zero = Uint::<LIMBS>::random_mod(&mut OsRng, &Uint::ONE.shl(10).to_nz().unwrap()).shr(11);

    // note: always zero
    let shift = zero.as_limbs()[0].0 as u32;
    group.bench_function(BenchmarkId::new("intra_limb_carrying_shl(0)", LIMBS), |b| {
        b.iter_batched(
            || Uint::<LIMBS>::ONE,
            |x| black_box(x.intra_limb_carrying_shl(shift)),
            BatchSize::SmallInput,
        )
    });
    group.bench_function(BenchmarkId::new("intra_limb_carrying_shl(Limb::BITS-1)", LIMBS), |b| {
        b.iter_batched(
            || (Uint::<LIMBS>::ONE, Limb::BITS-1),
            |(x, shift)| black_box(x.intra_limb_carrying_shl(shift)),
            BatchSize::SmallInput,
        )
    });
}

fn bench_shl(c: &mut Criterion) {
    let mut group = c.benchmark_group("left shift");
    shl_benchmarks::<{ U4096::LIMBS }>(&mut group);
    shl_benchmarks::<{ U8192::LIMBS }>(&mut group);
    shl_benchmarks::<{ U16384::LIMBS }>(&mut group);
    shl_benchmarks::<{ U32768::LIMBS }>(&mut group);

    group.finish();
}

will give this output on my machine:

left shift/intra_limb_carrying_shl(0)/64
                        time:   [63.015 ns 64.204 ns 65.263 ns]
                        change: [+1.4139% +3.7152% +5.9009%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
left shift/intra_limb_carrying_shl(Limb::BITS-1)/64
                        time:   [69.013 ns 69.628 ns 70.243 ns]
                        change: [-10.035% -8.3799% -6.8060%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
left shift/intra_limb_carrying_shl(0)/128
                        time:   [113.22 ns 116.19 ns 119.08 ns]
                        change: [-10.609% -5.6487% -1.1644%] (p = 0.02 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) high mild
  3 (3.00%) high severe
left shift/intra_limb_carrying_shl(Limb::BITS-1)/128
                        time:   [149.33 ns 152.70 ns 156.34 ns]
                        change: [-6.8919% -4.3440% -1.5910%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) high mild
  10 (10.00%) high severe
left shift/intra_limb_carrying_shl(0)/256
                        time:   [223.04 ns 231.61 ns 240.23 ns]
                        change: [-4.8209% -2.3135% +0.2843%] (p = 0.10 > 0.05)
                        No change in performance detected.
Found 20 outliers among 100 measurements (20.00%)
  4 (4.00%) low mild
  8 (8.00%) high mild
  8 (8.00%) high severe
left shift/intra_limb_carrying_shl(Limb::BITS-1)/256
                        time:   [288.18 ns 292.85 ns 297.30 ns]
                        change: [-12.275% -8.1139% -4.3964%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  4 (4.00%) low mild
  10 (10.00%) high mild
  4 (4.00%) high severe
left shift/intra_limb_carrying_shl(0)/512
                        time:   [506.98 ns 516.80 ns 526.50 ns]
                        change: [-2.7622% -1.0253% +0.7468%] (p = 0.26 > 0.05)
                        No change in performance detected.
Found 18 outliers among 100 measurements (18.00%)
  5 (5.00%) low severe
  6 (6.00%) high mild
  7 (7.00%) high severe
left shift/intra_limb_carrying_shl(Limb::BITS-1)/512
                        time:   [634.03 ns 643.39 ns 652.63 ns]
                        change: [-9.6863% -7.9263% -6.0933%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  2 (2.00%) low severe
  3 (3.00%) low mild
  8 (8.00%) high mild
  5 (5.00%) high severe

Based on these results, I'm led to believe that either

there is something wrong with the way I benchmarked this code, or
the compiler somehow understands that my non-constant zero is actually zero and optimizes these operations away, or
the CPU executes this algorithm in non-constant time

What am I missing?

fjarri · 2025-03-12T07:06:23Z

I don't think this is conclusive, this could really be just variations caused by OS scheduling. https://crates.io/crates/dudect-bencher gives more accurate results, but it's not perfect either. I remember there was some tool mentioned during the audit that analyzes actual generated assembly, but I cannot recall at the moment what was it called.

dvdplm · 2025-03-12T15:12:19Z

@erik-3milabs Maybe you can use a seedable RNG so that each run uses the exact same numbers: if you still see variability you can't account for, then that'll be the OS scheduling and similar shenanigans. Once you know how large this "unaccountable variability" is you can then vary the seed and check that the performance is still stable between measurements (or at least never larger than the "unaccountable variability").

tarcieri · 2025-03-12T15:15:59Z

I remember there was some tool mentioned during the audit that analyzes actual generated assembly, but I cannot recall at the moment what was it called.

I started looking into BINSEC and cargo checkct for these purposes but ran into some build problems

If this is really the problem, for example:

the compiler somehow understands that my non-constant zero is actually zero and optimizes these operations away,

...it caught a similar issue in curve25519-dalek where the compiler spotted an opportunity to insert a branch to optimize away an all-zero mask

erik-3milabs · 2025-03-19T16:22:08Z

@fjarri @dvdplm please let me know what you think of the following.

I re-did the experiment, this time using a seedable RNG to obtain my non-const zero. The benchmark code is as follows:

#[inline]
fn shl_zero_benchmarks<const LIMBS: usize>(group: &mut BenchmarkGroup<WallTime>, seed: u64) {
    let mut rng = ChaCha8Rng::seed_from_u64(seed);

    // note: is zero
    let zero = Uint::<LIMBS>::random_mod(&mut rng, &Uint::ONE.shl(10).to_nz().unwrap()).shr(11);

    // note: always zero
    let shift = zero.as_limbs()[0].0 as u32;

    group.bench_function(BenchmarkId::new("intra_limb_carrying_shl(0)", LIMBS*100 + seed as usize), |b| {
        b.iter_batched(
            || Uint::<LIMBS>::ONE,
            |x| black_box(x.intra_limb_carrying_shl(shift)),
            BatchSize::SmallInput,
        )
    });
}

#[inline]
fn shl_max_benchmarks<const LIMBS: usize>(group: &mut BenchmarkGroup<WallTime>) {
    group.bench_function(
        BenchmarkId::new("intra_limb_carrying_shl(MAX)", LIMBS),
        |b| {
            b.iter_batched(
                || (Uint::<LIMBS>::ONE, Limb::BITS - 1),
                |(x, shift)| black_box(x.intra_limb_carrying_shl(shift)),
                BatchSize::SmallInput,
            )
        },
    );
}

With these benchmark functions, I first ran the following experiment 7 times (I tried 10, but I interrupted my machine, garbling the results of the eight round, so I decided to discard them):

fn bench_shl(c: &mut Criterion) {
    let mut group = c.benchmark_group("left shift");
    
    shl_zero_benchmarks::<{ U256::LIMBS }>(&mut group, 0);
    shl_zero_benchmarks::<{ U512::LIMBS }>(&mut group, 0);
    shl_zero_benchmarks::<{ U1024::LIMBS }>(&mut group, 0);
    shl_zero_benchmarks::<{ U2048::LIMBS }>(&mut group, 0);
    shl_zero_benchmarks::<{ U4096::LIMBS }>(&mut group, 0);
    shl_zero_benchmarks::<{ U8192::LIMBS }>(&mut group, 0);
    shl_zero_benchmarks::<{ U16384::LIMBS }>(&mut group, 0);
    shl_zero_benchmarks::<{ U32768::LIMBS }>(&mut group, 0);
    shl_max_benchmarks::<{ U256::LIMBS }>(&mut group);
    shl_max_benchmarks::<{ U512::LIMBS }>(&mut group);
    shl_max_benchmarks::<{ U1024::LIMBS }>(&mut group);
    shl_max_benchmarks::<{ U2048::LIMBS }>(&mut group);
    shl_max_benchmarks::<{ U4096::LIMBS }>(&mut group);
    shl_max_benchmarks::<{ U8192::LIMBS }>(&mut group);
    shl_max_benchmarks::<{ U16384::LIMBS }>(&mut group);
    shl_max_benchmarks::<{ U32768::LIMBS }>(&mut group);
}

Dividing the difference between the minimum and maximum by the number of limbs, yields the following table:

LIMBS	ZERO	MAX
4	0.02	0.02
8	0.11	0.02
16	0.09	0.11
32	0.02	0.03
64	0.04	0.03
128	0.02	0.02
256	0.03	0.01
512	0.04	0.02

From this, I conclude that the "variability" due to CPU scheduling on my machine never exceeds 0.12 ns per limb.

Then, I performed the second experiment:

fn bench_shl(c: &mut Criterion) {
    let mut group = c.benchmark_group("left shift");

    for i in 0..10 {
        shl_zero_benchmarks::<{ U256::LIMBS }>(&mut group, i);
        shl_zero_benchmarks::<{ U512::LIMBS }>(&mut group, i);
        shl_zero_benchmarks::<{ U1024::LIMBS }>(&mut group, i);
        shl_zero_benchmarks::<{ U2048::LIMBS }>(&mut group, i);
        shl_zero_benchmarks::<{ U4096::LIMBS }>(&mut group, i);
        shl_zero_benchmarks::<{ U8192::LIMBS }>(&mut group, i);
        shl_zero_benchmarks::<{ U16384::LIMBS }>(&mut group, i);
        shl_zero_benchmarks::<{ U32768::LIMBS }>(&mut group, i);
    }
}

Again, I aggregated the results by integer size, and computed (max() - min())/#limbs. I've included the results below.

LIMBS	ZERO
4	0.15
8	0.04
16	0.01
32	0.05
64	0.04
128	0.03
256	0.01
512	0.03

The highest value of "0.15" for four limbs exceeds the previously found bound. As such, I think it is safe to conclude that the variability due to the RNG + CPU scheduling is less than 0.16 ns per limb.

We now look at the maxima obtained in the case of the zero-shift and the minima obtained for that max-shift during the first experiment. These are as follows:

#LIMBS	MAX ZERO-SHIFT	MIN MAX-SHIFT	DIFF	DIFF/LIMBS
4	3.355	3.353	0.00	0.00
8	15.068	16.827	1.76	0.22
16	23.633	21.433	-2.20	-0.14
32	34.907	47.844	12.94	0.40
64	58.580	69.848	11.27	0.18
128	119.660	150.580	30.92	0.24
256	236.170	295.680	59.51	0.23
512	545.000	652.750	107.75	0.21

For 32 or more limbs, the difference between these two measurements clearly exceeds the aforementioned variability due to RNG or CPU scheduling. As such, I'm confident that this indicates non-constant time behavior (possibly due to caching?)

WDYT?

dvdplm · 2025-03-20T11:13:59Z

First of all, I am by no means a constant-time expert and I am definitely out of my depth here. The only way to actually know what is going on is to check the assembly.
Doing CT in Rust is a "best effort" thing (true for any other language that isn't assembly, probably).

If we posit that your measurements prove actual non-CT, the question becomes "how much non-CT does an adversary really need to be able to pull off an attack?". I really don't know if the answer to that, beyond saying "the less we have the better".

If your new impl is "more CT" than the old one, and also faster, I think that is all we need to know and that we should merge the PR.

As a tangential note I read Pornin's paper on "Optimized Binary GCD for Modular Inversion" yesterday and found this note on page 7:

The last Intel CPU on which shifts were not constant-time with regard to the shift count was the Pentium IV, in the early 2000s, but that CPU does not offer lzcnt.

If that holds for Arm and AMD CPUs as well, wouldn't that indicate that Limb-level shifts are guaranteed to be CT?

erik-3milabs · 2025-03-20T15:10:19Z

Woops, wrong button 🙈

If your new impl is "more CT" than the old one, and also faster, I think that is all we need to know and that we should merge the PR.

Perhaps this got lost in communication, but I'm only claiming that the implementation introduced in this PR is exhibiting non-constant behavior; running the same suite of benchmarks on the master branch sees no discernable difference in runtime between shifting by 0 or Uint::<LIMBS>::BITS - 1.

As a tangential note I read Pornin's paper on "Optimized Binary GCD for Modular Inversion" yesterday and found this note on page 7:

The last Intel CPU on which shifts were not constant-time with regard to the shift count was the Pentium IV, in the early 2000s, but that CPU does not offer lzcnt.

If that holds for Arm and AMD CPUs as well, wouldn't that indicate that Limb-level shifts are guaranteed to be CT?

This is one of the reasons why I started this discussion. I, too, was of the impression that shifts were constant-time w.r.t. the shift. So, I was surprised to find that this code suggests the contrary. Chances are the compiler discards several operations based on the value of the shift?

I don't know how to rewrite the code such that it i) preserves the better performance, while ii) also running in constant time. Unless anybody else has ideas, I propose to close this PR.

erik-3milabs force-pushed the speedup_shr branch from 35e4f29 to a0afa4b Compare January 24, 2025 15:51

erik-3milabs marked this pull request as draft January 24, 2025 16:12

erik-3milabs force-pushed the speedup_shr branch 3 times, most recently from 201733e to 1579e18 Compare January 24, 2025 16:33

erik-3milabs marked this pull request as ready for review January 24, 2025 16:39

fjarri reviewed Jan 24, 2025

View reviewed changes

erik-3milabs added 4 commits January 27, 2025 17:46

Make Uint::as_limbs_mut const

509c51c

Implement Limb::carrying_shr

3b78bd5

Implement Uint::split_overflowing_shr

16f59fc

Introduce Uint::fast_split_overflowing_shr

068b054

erik-3milabs force-pushed the speedup_shr branch from 1579e18 to 068b054 Compare January 27, 2025 17:40

Fix benches

dd28ce4

tarcieri reviewed Jan 28, 2025

View reviewed changes

erik-3milabs added 2 commits January 29, 2025 16:47

Remove integer division/remainder from split_overflowing_shr

a5dbc74

Add annotations

6858e22

tarcieri reviewed Jan 29, 2025

View reviewed changes

tarcieri approved these changes Jan 29, 2025

View reviewed changes

erik-3milabs added 9 commits January 30, 2025 10:56

Implement Limb::carrying_shl

271eadb

Speed up Uint::shl

3d7c87e

Remove duplicate shr/shl code

6949c33

Remove duplicate Uint::shl code

81de22d

Deprecate Limb::shl1

d438508

Remove duplicate Uint::shr code

aeb0f73

Deprecate Limb::shr1

b138396

Fix fmt

fb01e75

Expand Uint::shl benchmarking

ada14bc

erik-3milabs commented Jan 30, 2025

View reviewed changes

Fix fmt

3a1eb7a

erik-3milabs mentioned this pull request Jan 31, 2025

Make Uint::as_limbs_mut const #757

Merged

erik-3milabs marked this pull request as draft January 31, 2025 11:30

erik-3milabs mentioned this pull request Jan 31, 2025

Binary GCD #755

Open

erik-3milabs closed this Mar 20, 2025

erik-3milabs reopened this Mar 20, 2025

erik-3milabs closed this Mar 24, 2025

Speedup Uint::shr #753

Speedup Uint::shr #753

Conversation

erik-3milabs commented Jan 24, 2025 • edited Loading

Benchmarks (run on commit b2dcb20)

erik-3milabs commented Jan 24, 2025 • edited Loading

erik-3milabs commented Jan 24, 2025 • edited Loading

tarcieri commented Jan 24, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjarri commented Jan 24, 2025

tarcieri commented Jan 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erik-3milabs commented Jan 24, 2025

Why I thought Limb::shr is vartime...

... and how we might be able to speed up Uint::shr even further.

erik-3milabs commented Jan 27, 2025 • edited Loading

Footnotes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tarcieri commented Jan 29, 2025

erik-3milabs commented Jan 29, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tarcieri commented Jan 29, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erik-3milabs commented Jan 30, 2025

erik-3milabs commented Jan 30, 2025 • edited Loading

erik-3milabs commented Jan 30, 2025 • edited Loading

erik-3milabs commented Jan 30, 2025 • edited Loading

erik-3milabs commented Jan 31, 2025

fjarri commented Jan 31, 2025

erik-3milabs commented Mar 11, 2025

fjarri commented Mar 12, 2025

dvdplm commented Mar 12, 2025

tarcieri commented Mar 12, 2025

erik-3milabs commented Mar 19, 2025

dvdplm commented Mar 20, 2025 • edited Loading

erik-3milabs commented Mar 20, 2025

Speedup `Uint::shr` #753

Speedup `Uint::shr` #753

erik-3milabs commented Jan 24, 2025 •

edited

Loading

Benchmarks (run on commit `b2dcb20`)

erik-3milabs commented Jan 24, 2025 •

edited

Loading

erik-3milabs commented Jan 24, 2025 •

edited

Loading

tarcieri commented Jan 24, 2025 •

edited

Loading

Why I thought `Limb::shr` is vartime...

... and how we might be able to speed up `Uint::shr` even further.

erik-3milabs commented Jan 27, 2025 •

edited

Loading

erik-3milabs commented Jan 30, 2025 •

edited

Loading

erik-3milabs commented Jan 30, 2025 •

edited

Loading

erik-3milabs commented Jan 30, 2025 •

edited

Loading

dvdplm commented Mar 20, 2025 •

edited

Loading