Trust Quorum: Prepare phase retries and testing #8000

andrewjstone · 2025-04-17T20:49:53Z

This PR implements the ability for a trust quorum node to handle prepare acknowledgements and send retries when time has advanced via Node::tick calls.

The vast majority of the code is test code. coordinator.rs is the start of a property based test to test the behavior of a node that is coordinating reconfigurations. The coordinating node itself is the system under test (SUT), and there is an abstract model that keeps enough information to allow asserting properties about the behavior of the SUT. A TestInput is generated which contains an initial configuration for the coordinating node and a generated list of abstract Actions to be executed by the test. Each action has a corresponding method on the TestState for handling it. These methods update the model state, SUT state, and then verify any properties they can.

The Action enum is going to grow in the next few PRs such that reconfigurations beyond the initial configuration will run and messages can be dropped. These will correspond with an expansion of the Node implementation to allow recovering key shares from past committed configuration and the ability to handle Commit and Cancel API calls which ultimately are triggered from Nexus, as described in RFD 238.

This is the initial commit of the "real" trust quorum code as described in RFD 238. This commit provides some of the foundation needed to implement the trust quorum reconfiguration protocol. The protocol itself will be implemented in a `Node` type in a [sans-io](https://sans-io.readthedocs.io/) style with property based tests for the full protocol, similar to LRTQ. Async code will utilize the protocol and forward messages over sprockets channels, and handle requests from Nexus. This initial code was split out to keep the PR small, although it has been used in some preliminary protocol code already that will come in a follow up PR.

This change removes the requirement to carry encrypted rack shares in every `Prepare` message. gfss can generate them given enough encrypted shares because its interpolate method is more generic than other implementations and doesn't only calculate the secret. Also fix up based on some other review comments. This removes the `Error` type altogether as it's no longer used.

This code builds upon #7859, which itself builds upon #7891. Those need to be merged in order first. A `Node` is the sans-io entity driving the trust quorum protocol. This PR starts the work of creating the `Node` type and coordinating an initial configuration. There's a property based test that generates an initial `ReconfigureMsg` that is then used to call `Node::coordinate_reconfiguration`, which will internally setup a `CoordinatorState` and create a bunch of messages to be sent. We verify that a new `PersistentState` is returned and that messages are sent. This needs a bit more documentation and testing, so it's still WIP.

This PR implements the ability for a trust quorum node to handle prepare acknowledgements and send retries when time has advanced via `Node::tick` calls. The vast majority of the code is test code. `coordinator.rs` is the start of a property based test to test the behavior of a node that is coordinating reconfigurations. The coordinating node itself is the system under test (SUT), and there is an abstract model that keeps enough information to allow asserting properties about the behavior of the SUT. A `TestInput` is generated which contains an initial configuration for the coordinating node and a generated list of abstract `Action`s to be executed by the test. Each action has a corresponding method on the `TestState` for handling it. These methods update the model state, SUT state, and then verify any properties they can. The `Action` enum is going to grow in the next few PRs such that reconfigurations beyond the initial configuration will run and messages can be dropped. These will correspond with an expansion of the `Node` implementation to allow recovering key shares from past committed configuration and the ability to handle `Commit` and `Cancel` API calls which ultimately are triggered from Nexus, as described in RFD 238.

andrewjstone · 2025-04-17T20:55:57Z

trust-quorum/src/persistent_state.rs

@@ -18,9 +18,6 @@ use std::collections::BTreeMap;
 /// All the persistent state for this protocol
 #[derive(Debug, Clone, Serialize, Deserialize, Default)]
 pub struct PersistentState {
-    // Ledger generation


There's no reason that the trust quorum protocol code should know how the persistent state is stored.

trust-quorum/tests/coordinator.rs

sunshowers

Looks good, just some minor comments.

sunshowers · 2025-04-18T21:17:26Z

trust-quorum/src/coordinator_state.rs

    /// When the reconfiguration started
+    #[allow(unused)]


Suggested change

#[allow(unused)]

#[expect(unused)]

Presumably this is useful for debugging.

I didn't even know about expect until your last PR. Thanks!

sunshowers · 2025-04-18T21:18:16Z

trust-quorum/src/coordinator_state.rs

@@ -49,6 +52,7 @@ impl CoordinatorState {
    /// Return the newly constructed `CoordinatorState` along with this node's
    /// `PrepareMsg` so that it can be persisted.
    pub fn new_uninitialized(
+        log: Logger,


Should this add a component context via log.new indicating that it is the coordinator_state component?

Done in 50ebf64

sunshowers · 2025-04-18T21:18:41Z

trust-quorum/src/coordinator_state.rs

@@ -86,6 +90,7 @@ impl CoordinatorState {

    /// A reconfiguration from one group to another
    pub fn new_reconfiguration(
+        log: Logger,


Same q here -- add component context?

Done in 50ebf64

sunshowers · 2025-04-18T21:22:11Z

trust-quorum/src/coordinator_state.rs

@@ -101,18 +106,21 @@ impl CoordinatorState {
            new_shares,
        };

-        Ok(CoordinatorState::new(now, msg, config, op))
+        Ok(CoordinatorState::new(log, now, msg, config, op))
    }

    // Intentionallly private!


Suggested change

// Intentionallly private!

// Intentionally private!

I guess this is because the public constructors do error checking? If so could this comment briefly mention that?

Done in 50ebf64

sunshowers · 2025-04-18T21:22:56Z

trust-quorum/src/coordinator_state.rs

@@ -137,39 +145,76 @@ impl CoordinatorState {
    // This method is "in progress" - allow unused parameters for now
    #[allow(unused)]


Suggested change

#[allow(unused)]

#[expect(unused)]

Done in 50ebf64

sunshowers · 2025-04-18T21:44:06Z

trust-quorum/tests/coordinator.rs

+    #[weight(50)]
+    Tick(
+        #[strategy(
+            (RETRY_TIMEOUT_MS/4..RETRY_TIMEOUT_MS)


Should this go past RETRY_TIMEOUT_MS to something like 5/4 times that value?

Great idea!

Done in 50ebf64

sunshowers · 2025-04-18T21:44:23Z

trust-quorum/tests/coordinator.rs

+    Tick(
+        #[strategy(
+            (RETRY_TIMEOUT_MS/4..RETRY_TIMEOUT_MS)
+            .prop_map(|ms| Duration::from_millis(ms))


This can just be

Suggested change

.prop_map(|ms| Duration::from_millis(ms))

.prop_map(Duration::from_millis)

I think

Thanks! Fixed in 50ebf64

sunshowers · 2025-04-18T21:44:53Z

trust-quorum/tests/coordinator.rs

+    /// still be duplicated due to the shift implementation used. Therefore we
+    /// instead just choose from a constrained set of usize values that we can
+    /// use directly as indexes into our fixed size structure for all tests.


Yeah, this makes sense.

Thanks! Your idea to choose from a fixed size universe was super helpful!

sunshowers · 2025-04-18T21:45:47Z

trust-quorum/tests/coordinator.rs

+    /// very efficient as opposed to a `prop_flat_map`. When we calculate the
+    /// threshold from the index we use max(2, Index), since the minimum
+    /// threshold is always 2.


This does bias the index slightly away from being uniformly at random -- I guess it isn't a huge deal but another way to do it might be 2 + index.index(...).

I think if we do it that way, we could end up going over the max index. I'm not too worried about this distribution :D

sunshowers · 2025-04-18T21:46:35Z

trust-quorum/tests/coordinator.rs

+    let threshold =
+        Threshold(usize::max(2, config.threshold.index(members.len())) as u8);


Hmm I thought index.index panicked if the value passed in was 0.

It does, but we never generate a set of members smaller than 3 (MIN_CLUSTER_SIZE).

To clarify, we just don't want the index to be 0 or 1. The length of the set is never smaller than 3.

andrewjstone and others added 30 commits March 21, 2025 23:18

clippy

b350c50

small fixes

cb75951

WIP: Shamir Secret Sharing over GF(2^8)

f081ed0

Some review cleanup

5ebc785

make split_secret_impl which takes an rng

6f0e276

pass in a TestRng for proptests

3e64fac

small-cleanup

d500b1a

box up some shares

621569d

Merge branch 'main' into ajs/shamir

05ec36c

Boxes instead of Vecs for Polynomial and Secret

47d056e

hakari

7de5edf

workspace lints and other cargo stuff

7f90eef

docs

9c2b9cd

Rework API and add share creation test

b31e001

fix comment for fn rename

0576c82

clippy

3d31277

Merge branch 'ajs/shamir' into ajs/realtq-1

7fdd238

hakari

ff68bfe

clippy

43f0b96

workspace-hack

08b0e18

workspace-deps

bc07606

clippy and docs

20652fc

Add another test

12b0023

Another test

0013d01

a few more error test conditions

0bc6136

remove unnecessary regression

921aadf

comment about sans-io

473f7a5

andrewjstone added 4 commits April 7, 2025 23:43

comment cleanup

b1b3976

review fixes

34fa618

Store coordinator_id in ValidatedReconfigureMsg

a5323a5

andrewjstone requested a review from sunshowers April 17, 2025 20:49

andrewjstone commented Apr 17, 2025

View reviewed changes

trust-quorum/tests/coordinator.rs Outdated Show resolved Hide resolved

andrewjstone commented Apr 17, 2025

View reviewed changes

trust-quorum/tests/coordinator.rs Show resolved Hide resolved

andrewjstone commented Apr 17, 2025

View reviewed changes

trust-quorum/tests/coordinator.rs Outdated Show resolved Hide resolved

self review fixes

1d979ce

sunshowers approved these changes Apr 18, 2025

View reviewed changes

Base automatically changed from ajs/realtq-2 to main April 21, 2025 22:28

andrewjstone added 3 commits April 21, 2025 23:04

Merge branch 'main' into ajs/realtq-3

71810cd

review fixes

50ebf64

clippy

d501e43

andrewjstone enabled auto-merge (squash) April 22, 2025 00:21

andrewjstone merged commit 0bebe3a into main Apr 22, 2025
16 checks passed

andrewjstone deleted the ajs/realtq-3 branch April 22, 2025 01:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trust Quorum: Prepare phase retries and testing #8000

Trust Quorum: Prepare phase retries and testing #8000

andrewjstone commented Apr 17, 2025

andrewjstone Apr 17, 2025

sunshowers left a comment

sunshowers Apr 18, 2025

andrewjstone Apr 22, 2025

sunshowers Apr 18, 2025

andrewjstone Apr 22, 2025

sunshowers Apr 18, 2025

andrewjstone Apr 22, 2025

sunshowers Apr 18, 2025

andrewjstone Apr 22, 2025

sunshowers Apr 18, 2025

andrewjstone Apr 22, 2025

sunshowers Apr 18, 2025

andrewjstone Apr 22, 2025

andrewjstone Apr 22, 2025

sunshowers Apr 18, 2025

andrewjstone Apr 22, 2025

sunshowers Apr 18, 2025

andrewjstone Apr 22, 2025

sunshowers Apr 18, 2025

andrewjstone Apr 22, 2025

sunshowers Apr 18, 2025

andrewjstone Apr 22, 2025

andrewjstone Apr 22, 2025

		@@ -137,39 +145,76 @@ impl CoordinatorState {
		// This method is "in progress" - allow unused parameters for now
		#[allow(unused)]

	.prop_map(\|ms\| Duration::from_millis(ms))
	.prop_map(Duration::from_millis)

		let threshold =
		Threshold(usize::max(2, config.threshold.index(members.len())) as u8);

Trust Quorum: Prepare phase retries and testing #8000

Trust Quorum: Prepare phase retries and testing #8000

Conversation

andrewjstone commented Apr 17, 2025

Choose a reason for hiding this comment

sunshowers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment