-
Notifications
You must be signed in to change notification settings - Fork 43
Trust Quorum: Prepare phase retries and testing #8000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is the initial commit of the "real" trust quorum code as described in RFD 238. This commit provides some of the foundation needed to implement the trust quorum reconfiguration protocol. The protocol itself will be implemented in a `Node` type in a [sans-io](https://sans-io.readthedocs.io/) style with property based tests for the full protocol, similar to LRTQ. Async code will utilize the protocol and forward messages over sprockets channels, and handle requests from Nexus. This initial code was split out to keep the PR small, although it has been used in some preliminary protocol code already that will come in a follow up PR.
This change removes the requirement to carry encrypted rack shares in every `Prepare` message. gfss can generate them given enough encrypted shares because its interpolate method is more generic than other implementations and doesn't only calculate the secret. Also fix up based on some other review comments. This removes the `Error` type altogether as it's no longer used.
This code builds upon #7859, which itself builds upon #7891. Those need to be merged in order first. A `Node` is the sans-io entity driving the trust quorum protocol. This PR starts the work of creating the `Node` type and coordinating an initial configuration. There's a property based test that generates an initial `ReconfigureMsg` that is then used to call `Node::coordinate_reconfiguration`, which will internally setup a `CoordinatorState` and create a bunch of messages to be sent. We verify that a new `PersistentState` is returned and that messages are sent. This needs a bit more documentation and testing, so it's still WIP.
This PR implements the ability for a trust quorum node to handle prepare acknowledgements and send retries when time has advanced via `Node::tick` calls. The vast majority of the code is test code. `coordinator.rs` is the start of a property based test to test the behavior of a node that is coordinating reconfigurations. The coordinating node itself is the system under test (SUT), and there is an abstract model that keeps enough information to allow asserting properties about the behavior of the SUT. A `TestInput` is generated which contains an initial configuration for the coordinating node and a generated list of abstract `Action`s to be executed by the test. Each action has a corresponding method on the `TestState` for handling it. These methods update the model state, SUT state, and then verify any properties they can. The `Action` enum is going to grow in the next few PRs such that reconfigurations beyond the initial configuration will run and messages can be dropped. These will correspond with an expansion of the `Node` implementation to allow recovering key shares from past committed configuration and the ability to handle `Commit` and `Cancel` API calls which ultimately are triggered from Nexus, as described in RFD 238.
@@ -18,9 +18,6 @@ use std::collections::BTreeMap; | |||
/// All the persistent state for this protocol | |||
#[derive(Debug, Clone, Serialize, Deserialize, Default)] | |||
pub struct PersistentState { | |||
// Ledger generation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no reason that the trust quorum protocol code should know how the persistent state is stored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just some minor comments.
/// When the reconfiguration started | ||
#[allow(unused)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#[allow(unused)] | |
#[expect(unused)] |
Presumably this is useful for debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't even know about expect
until your last PR. Thanks!
@@ -49,6 +52,7 @@ impl CoordinatorState { | |||
/// Return the newly constructed `CoordinatorState` along with this node's | |||
/// `PrepareMsg` so that it can be persisted. | |||
pub fn new_uninitialized( | |||
log: Logger, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this add a component context via log.new
indicating that it is the coordinator_state
component?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 50ebf64
@@ -86,6 +90,7 @@ impl CoordinatorState { | |||
|
|||
/// A reconfiguration from one group to another | |||
pub fn new_reconfiguration( | |||
log: Logger, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same q here -- add component
context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 50ebf64
@@ -101,18 +106,21 @@ impl CoordinatorState { | |||
new_shares, | |||
}; | |||
|
|||
Ok(CoordinatorState::new(now, msg, config, op)) | |||
Ok(CoordinatorState::new(log, now, msg, config, op)) | |||
} | |||
|
|||
// Intentionallly private! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Intentionallly private! | |
// Intentionally private! |
I guess this is because the public constructors do error checking? If so could this comment briefly mention that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 50ebf64
@@ -137,39 +145,76 @@ impl CoordinatorState { | |||
// This method is "in progress" - allow unused parameters for now | |||
#[allow(unused)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#[allow(unused)] | |
#[expect(unused)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 50ebf64
trust-quorum/tests/coordinator.rs
Outdated
#[weight(50)] | ||
Tick( | ||
#[strategy( | ||
(RETRY_TIMEOUT_MS/4..RETRY_TIMEOUT_MS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this go past RETRY_TIMEOUT_MS
to something like 5/4 times that value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 50ebf64
trust-quorum/tests/coordinator.rs
Outdated
Tick( | ||
#[strategy( | ||
(RETRY_TIMEOUT_MS/4..RETRY_TIMEOUT_MS) | ||
.prop_map(|ms| Duration::from_millis(ms)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can just be
.prop_map(|ms| Duration::from_millis(ms)) | |
.prop_map(Duration::from_millis) |
I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Fixed in 50ebf64
/// still be duplicated due to the shift implementation used. Therefore we | ||
/// instead just choose from a constrained set of usize values that we can | ||
/// use directly as indexes into our fixed size structure for all tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Your idea to choose from a fixed size universe was super helpful!
/// very efficient as opposed to a `prop_flat_map`. When we calculate the | ||
/// threshold from the index we use max(2, Index), since the minimum | ||
/// threshold is always 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does bias the index slightly away from being uniformly at random -- I guess it isn't a huge deal but another way to do it might be 2 + index.index(...)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we do it that way, we could end up going over the max index. I'm not too worried about this distribution :D
let threshold = | ||
Threshold(usize::max(2, config.threshold.index(members.len())) as u8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I thought index.index
panicked if the value passed in was 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does, but we never generate a set of members smaller than 3 (MIN_CLUSTER_SIZE
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, we just don't want the index to be 0 or 1. The length of the set is never smaller than 3.
This PR implements the ability for a trust quorum node to handle prepare acknowledgements and send retries when time has advanced via
Node::tick
calls.The vast majority of the code is test code.
coordinator.rs
is the start of a property based test to test the behavior of a node that is coordinating reconfigurations. The coordinating node itself is the system under test (SUT), and there is an abstract model that keeps enough information to allow asserting properties about the behavior of the SUT. ATestInput
is generated which contains an initial configuration for the coordinating node and a generated list of abstractAction
s to be executed by the test. Each action has a corresponding method on theTestState
for handling it. These methods update the model state, SUT state, and then verify any properties they can.The
Action
enum is going to grow in the next few PRs such that reconfigurations beyond the initial configuration will run and messages can be dropped. These will correspond with an expansion of theNode
implementation to allow recovering key shares from past committed configuration and the ability to handleCommit
andCancel
API calls which ultimately are triggered from Nexus, as described in RFD 238.