Skip to content

Trust Quorum: Start implementing Node #7929

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
Apr 21, 2025
Merged

Trust Quorum: Start implementing Node #7929

merged 34 commits into from
Apr 21, 2025

Conversation

andrewjstone
Copy link
Contributor

@andrewjstone andrewjstone commented Apr 5, 2025

This code builds upon #7859, which itself builds upon #7891. Those need to be merged in order first.

A Node is the sans-io entity driving the trust quorum protocol. This PR starts the work of creating the Node type and coordinating an initial configuration.

There's a property based test that generates an initial ReconfigureMsg that is then used to call Node::coordinate_reconfiguration, which will internally setup a CoordinatorState and create a bunch of messages to be sent. We verify that a new PersistentState is returned and that messages are sent.

Some of validation code is also tested. We test a subset of error conditions, but ensure all the error check methods themselves actually get called among the checks we choose.

andrewjstone and others added 24 commits March 21, 2025 23:18
This is the initial commit of the "real" trust quorum code as described
in RFD 238. This commit provides some of the foundation needed to
implement the trust quorum reconfiguration protocol.

The protocol itself will be implemented in a `Node` type in a
[sans-io](https://sans-io.readthedocs.io/) style with property based
tests for the full protocol, similar to LRTQ. Async code will utilize
the protocol and forward messages over sprockets channels, and handle
requests from Nexus.

This initial code was split out to keep the PR small, although
it has been used in some preliminary protocol code already that will
come in a follow up PR.
This change removes the requirement to carry encrypted rack shares in
every `Prepare` message. gfss can generate them given enough encrypted
shares because its interpolate method is more generic than other
implementations and doesn't only calculate the secret.

Also fix up based on some other review comments. This removes the
`Error` type altogether as it's no longer used.
This code builds upon #7859, which itself builds upon #7891. Those
need to be merged in order first.

A `Node` is the sans-io entity driving the trust quorum protocol. This
PR starts the work of creating the `Node` type and coordinating an
initial configuration.

There's a property based test that generates an initial `ReconfigureMsg`
that is then used to call `Node::coordinate_reconfiguration`, which will
internally setup a `CoordinatorState` and create a bunch of messages
to be sent. We verify that a new `PersistentState` is returned and that
messages are sent.

This needs a bit more documentation and testing, so it's still WIP.
@andrewjstone andrewjstone changed the title WIP: Trust Quorum: Start implementing Node Trust Quorum: Start implementing Node Apr 7, 2025
@andrewjstone andrewjstone marked this pull request as ready for review April 7, 2025 20:37
};

let state = CoordinatorState::new(my_platform_id, now, msg, config, op);
Ok((state, my_prepare_msg.unwrap()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what circumstances would we expect my_prepare_msg to be None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never. This is the Precondition discussed in the function doc comment. As part of creating the ValidatedReconfigureMsg we ensure that this node is part of the configuration and will have a Prepare message.

That being said, I really dislike the way I implemented this. In order to extract my_prepare_msg from the loop above I made my_prepare_msg a mutable option initialized to None. This is the root of the issue. I could also just always include it in prepares and then pull the value out, but that would also return an Option.

I'm really not sure of a better way forward here, although I could certainly add a safety comment above the unwrap. I'm open to other suggestions as well.

Copy link
Contributor Author

@andrewjstone andrewjstone Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Actually, I could have Configuration::new separate out this node's share and return it separately. That would prevent the need to filter here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either a refactor of a comment explaining the precondition would be helpful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did refactor a bit, and also added some comments. Now the shares are always packaged up with the digests so that they don't get out of order by accident. However, there's still a need for an Option.

I also made all fields of ValidatedReconfigureMsg private so that it cannot be constructed without running through validation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think I want to add the coordinator platform_id into the ValidatedReconfigureMsg so that it is indeed guaranteed known to be part of the membership. Some code could always call new_uninitialized with a different platform_id from itself, which would obviously be wrong, but isn't statically guaranteed not to happen.

This would also remove the extra parameter to the method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think I want to add the coordinator platform_id into the ValidatedReconfigureMsg

I went ahead and did this in a5323a5.

It had nice knock-on effects in that now Configuration and CoordinatorState have one fewer constructor parameter.

Base automatically changed from ajs/realtq-1 to main April 21, 2025 20:41
@andrewjstone andrewjstone enabled auto-merge (squash) April 21, 2025 21:05
@andrewjstone andrewjstone merged commit 74a5c0c into main Apr 21, 2025
17 checks passed
@andrewjstone andrewjstone deleted the ajs/realtq-2 branch April 21, 2025 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants