Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report unstable peers from TSS #1228

Merged
merged 37 commits into from
Jan 30, 2025
Merged

Report unstable peers from TSS #1228

merged 37 commits into from
Jan 30, 2025

Conversation

HCastano
Copy link
Collaborator

@HCastano HCastano commented Dec 20, 2024

This is a follow up to #1215.

In this PR we enable TSS servers to report each other on-chain for "offline" or
connection related failures during signing (note that this PR doesn't include DKG or
proactive refresh) using the reporting extrinsic introduced in #1215.

When trying to figure out when to report I basically walked through the signing flow to
see where connections were being made (either inbound or outbound).

For outbound connections we basically report all the errors from
open_protocol_connections(). For inbound, we report peers that we're expecting to
connect to us during signing but never do.

As far as testing goes it was a bit tricky to hit all the different networking/connection
codepaths programmatically. In the end I did mange to get one unit test for inbound
connections and one for outbound connections (thanks for the help here @ameba23).

For those errors I couldn't test programmatically I did do some manual testing with hacked
Docker images (inserting crashes at connection points) - but I definitely could've missed
something here.

@@ -139,6 +139,8 @@ async fn handle_socket_result(socket: WebSocket, app_state: AppState) {
if let Err(err) = handle_socket(socket, app_state).await {
tracing::warn!("Websocket connection closed unexpectedly {:?}", err);
// TODO here we should inform the chain that signing failed
//
// TODO (Nando): Report error up here?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the answer here is no since this would be bubbled up when we try and make a (failed) ws/ connection anyways

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't totally thought through but seems reasonable. The only errors we could catch here but not elsewhere are relating to incoming connections which do not relate to an existing listener. Eg: connections from random people or with a bad 'subscribe' message. I think we can just ignore these.

@ameba23
Copy link
Contributor

ameba23 commented Dec 21, 2024

As for (2):

  • A TSS server just doesn't show up. This should be easy to test.
  • A TSS connects and submits a valid subscribe message, but then drops the connection. We can write a test helper that does this - i think there already is something similar for testing bad subscribe messages.
  • A TSS server drops the connection in the middle of a protocol run. This is hard to test. The only thing i can think of is to wrap setup_client
    pub async fn setup_client() -> KvManager {
    in a tokio::time::timeout which kills the TSS server after some amount of time which is likely to be in the middle of the protocol. Since the signing protocol is fast, this might be too unreliable. But it would probably work for the DKG/reshare protocols. Kinda ugly though.

@ameba23
Copy link
Contributor

ameba23 commented Dec 21, 2024

Probably the most common problem will be that a TSS server is just not available for whatever reason.

I think its worth noting that the relayer node will immediately know this, because they will be unable to make the sign_tx request. But currently they do nothing about it, other than logging an error:

tracing::error!("Error in tokio::spawn task: {:?}", e);

Really here they should tell the other TSS servers that they can abandon the signing protocol. There is no point in waiting for this one to connect, as they have failed to receive the signature request.

Otherwise, if the unresponsive node is due to make an outgoing ws connection, its hard to say how long we should wait for them before reporting unresponsiveness. Are they offline or just being slow?

This is going to be the `offender` account ID which gets submitted on-chain.
This way we have access to the `api` which we can use to submit an offline report on-chain.
@HCastano
Copy link
Collaborator Author

@ameba23 thanks for the feedback!

A TSS server just doesn't show up. This should be easy to test.

This would just end up with a failure at the point of the relayer, not necessarily the
signers (which we don't deal with right now).

This ties into your second comment though, we can have a follow up in which we just deal
with the relayer <-> signer failure case.

A TSS connects and submits a valid subscribe message, but then drops the connection. We
can write a test helper that does this - i think there already is something similar for
testing bad subscribe messages.

Okay yeah, I can try and look into this.

A TSS server drops the connection in the middle of a protocol run. This is hard to
test. The only thing i can think of is to wrap setup_client

Hmm this doesn't seem ideal, and if we want to run it in tests we may end up with
nondeterministic tests...I'll poke around the code here and see if I can think of
anything else.

@HCastano HCastano force-pushed the hc/off-chain-peer-reporting branch from 69a97cc to 5be281c Compare January 20, 2025 20:03
This error will end up getting bubbled up and handled by the `ws/` handler
at a later point in time.
@HCastano
Copy link
Collaborator Author

Peg, would you be able to point me to the bad subscribe test you were talking about?

I was playing around with throwing some errors into the protocol test helpers but looks like they're too low level for what we wan to trigger in this case.

@ameba23
Copy link
Contributor

ameba23 commented Jan 21, 2025

Peg, would you be able to point me to the bad subscribe test you were talking about?

Here we start the signing process and then send a subscribe message from someone not in the signing group:

// create a SubscribeMessage from a party who is not in the signing commitee

So i guess we would change it so that it is someone who is in the signing group, who sends a subscribe message and then does nothing. The problem is, we need to stop that node from connecting as normal at the same time. So i guess we would need spawn_testing_validators to somehow omit one node. Or maybe theres some other way to go about it.

@@ -91,6 +91,8 @@ pub async fn do_signing(
)
.await?;

// TODO (Nando): Maybe report here since we're expecting a connection from somebody and they
Copy link
Collaborator Author

@HCastano HCastano Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ameba23 do you know if there's a way to hold other parties accountable here?

My current problem: I know in open_protocol_connections() that there are some parties that will initiate the protocol with us. However, if they don't then we trigger this timeout and bail - I'd like to be able to know who they were so I can report them.

I can figure out who we are expecting to connect with us using the implementation details in open_protocol_connections() (bit of a leak imo), but from that I can't really assign "blame" to the party who didn't end up connecting with us (I only who know I expected to connect, but not who actually failed to connect).

(See test_reports_peer_if_they_dont_initiate_a_signing_session for what I'd ideally want to get passing)

Another question here. Say we're expecting two peers to connect with us, and only one of them does. I think because there's some data that went through the channel this check would pass, right? If that's true then it might also be tricky for me to report the second peer from here.

Copy link
Contributor

@ameba23 ameba23 Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only who know I expected to connect, but not who actually failed to connect

Not sure if this answers your question, but we can see which nodes did not yet connect (or which we did not yet connect to) for a given session id using AppState.listener_state which is a hashmap of session IDs to Listeners. Listener.validators tracks the remaining TSS nodes we have not yet connected to or received a connection from.

Whenever we connect to or receive a connection from one of these we call listener.subscribe() which internally removes that TSS node from the remaining validators.

So if we want to report all peers who failed to make an incoming connection when the timeout was reached, we could maybe do something like this with lines 97-99 below:

match timeout(Duration::from_secs(SETUP_TIMEOUT_SECONDS), rx_ready).await {
  Ok(ready) => {
    let broadcast_out = ready??;
    Channels(broadcast_out, rx_from_others)
  }
  Err(_) => {
    let listeners = app_state.listeners.lock()?;
    let listener = listeners.get(&session_id).ok_or(ProcotolErr::NoSessionId)?;
    let remaining_validators: Vec<_> = listener.validators.keys().map(|id| SubxtAccountId32(id)).collect();
    return Err(ProtocolErr::TimedOutWaitingForConnectionsFrom(remaining_validators));
  }
}

This code wont work as it is, but you get the idea.

Since by this point open_protocol_connections has returned successfully, we can be sure that all the remaining_validators we find in this way are ones which were supposed to initiate a connection to us (and not ones which we failed to connect to ourself).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super helpful, thanks 🙏

return Err(error.to_string());
}

tracing::debug!("Reporting `{:?}` for `{}`", peers_to_report.clone(), error.to_string());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the logs these (and basically all the other AccountIds from this PR) are formatted as byte arrays. I can make them SS58 addresses, but this would potentially give us inconsistent logs with other events which use byte arrays. While they're harder to read I'd rather have more consistency across our logs - but lmk what you think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for being consistent everywhere. if i remember right subxt doesn't have the ss58 codec but sp-core does. At one point this was an issue as in some places we didn't have sp-core, but im pretty sure we sorted that and all our crates which would need to log something have it now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #1271. Could be a Friday activity for someone lol

@HCastano HCastano marked this pull request as ready for review January 29, 2025 03:01
Copy link
Contributor

@ameba23 ameba23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 this is a big step towards getting slashing finished.

As i understand the scope of this PR is only to report errors from the signing protocol, but it looks like the groundwork has been done to add this to DKG and resharing, and basically we would just need to add a call to handle_protocol_errors.

As for the hard to test cases i'm happy to leave it for now and if we don't come up with a way of doing automated tests we can maybe have instructions for doing a manual test in the pre-release checklist.

I haven't looked in much detail at how these errors get handled on the chain side. We should bear in mind that many of them could also be our own fault (eg: failed incoming connection because we lost connectivity ourself, and encryption errors because of corrupt data our end). But i guess this is accounted for in that the slashing pallet gets a sufficient number of reports from different peers before taking action.

Ok(Channels(broadcast_out, rx_from_others))
},
Err(e) => {
let unsubscribed_peers = state.unsubscribed_peers(session_id).map_err(|_| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to map the error here? Aren't we mapping it to what it already was? or am i missing something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two unsubscribed_peers() methods - the one on ListenerState (which is what's used) here returns a SubscribeErr so we have to map it.

The one on AppState returns a ProtocolErr but we don't have access to that here

@@ -768,6 +767,210 @@ async fn test_fails_to_sign_if_non_signing_group_participants_are_used() {
clean_tests();
}

#[tokio::test]
#[serial]
async fn test_reports_peer_if_they_dont_participate_in_signing() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we either rename the test to something like test_reports_peer_if_they_reject_our_signing_protocol_connection
or put a comment saying something like that.

It took me a while to figure out the difference between this test and the one below is that this is a failed outgoing connection and the other one is a failed incoming connection.


// This is a non-reportable error, so we don't do any further processing with the error
if peers_to_report.is_empty() {
return Err(error.to_string());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as for the response here shouldn't this be Ok() as in the function here did not error out, everything progressed normally within the context of this function

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree with you - but if we change this to an Ok<String> it'll mean that the caller will end up needing to treat the Ok case as a failure/error, because that's what it is. By only returning errors were we can save ourselves this and just do an unwrap_err() at the call site.

We also don't do any extra processing with the Err case - we log it and move on. If we had some different behaviour at the call site, then yes could make sense.

}

if failed_reports.is_empty() {
Err(error.to_string())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@HCastano HCastano merged commit 8545d45 into master Jan 30, 2025
7 of 8 checks passed
@HCastano HCastano deleted the hc/off-chain-peer-reporting branch January 30, 2025 01:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants