-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report unstable peers from TSS #1228
Conversation
@@ -139,6 +139,8 @@ async fn handle_socket_result(socket: WebSocket, app_state: AppState) { | |||
if let Err(err) = handle_socket(socket, app_state).await { | |||
tracing::warn!("Websocket connection closed unexpectedly {:?}", err); | |||
// TODO here we should inform the chain that signing failed | |||
// | |||
// TODO (Nando): Report error up here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the answer here is no since this would be bubbled up when we try and make a (failed) ws/
connection anyways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't totally thought through but seems reasonable. The only errors we could catch here but not elsewhere are relating to incoming connections which do not relate to an existing listener. Eg: connections from random people or with a bad 'subscribe' message. I think we can just ignore these.
As for (2):
|
Probably the most common problem will be that a TSS server is just not available for whatever reason. I think its worth noting that the relayer node will immediately know this, because they will be unable to make the
Really here they should tell the other TSS servers that they can abandon the signing protocol. There is no point in waiting for this one to connect, as they have failed to receive the signature request. Otherwise, if the unresponsive node is due to make an outgoing ws connection, its hard to say how long we should wait for them before reporting unresponsiveness. Are they offline or just being slow? |
This is going to be the `offender` account ID which gets submitted on-chain.
This way we have access to the `api` which we can use to submit an offline report on-chain.
@ameba23 thanks for the feedback!
This would just end up with a failure at the point of the relayer, not necessarily the This ties into your second comment though, we can have a follow up in which we just deal
Okay yeah, I can try and look into this.
Hmm this doesn't seem ideal, and if we want to run it in tests we may end up with |
69a97cc
to
5be281c
Compare
This error will end up getting bubbled up and handled by the `ws/` handler at a later point in time.
Peg, would you be able to point me to the bad subscribe test you were talking about? I was playing around with throwing some errors into the protocol test helpers but looks like they're too low level for what we wan to trigger in this case. |
Here we start the signing process and then send a subscribe message from someone not in the signing group:
So i guess we would change it so that it is someone who is in the signing group, who sends a subscribe message and then does nothing. The problem is, we need to stop that node from connecting as normal at the same time. So i guess we would need |
This takes down Bob via hardcoding - but proves that the test can work.
Looks like the tests still pass with him in though...
We do this by manually setting the signers each time, instead of just using the ones from the chain.
These are helpful for checking offence reports
There's no code for the actual reporting bit here yet though
I didn't end up using it for testing.
@@ -91,6 +91,8 @@ pub async fn do_signing( | |||
) | |||
.await?; | |||
|
|||
// TODO (Nando): Maybe report here since we're expecting a connection from somebody and they |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ameba23 do you know if there's a way to hold other parties accountable here?
My current problem: I know in open_protocol_connections()
that there are some parties that will initiate the protocol with us. However, if they don't then we trigger this timeout and bail - I'd like to be able to know who they were so I can report them.
I can figure out who we are expecting to connect with us using the implementation details in open_protocol_connections()
(bit of a leak imo), but from that I can't really assign "blame" to the party who didn't end up connecting with us (I only who know I expected to connect, but not who actually failed to connect).
(See test_reports_peer_if_they_dont_initiate_a_signing_session
for what I'd ideally want to get passing)
Another question here. Say we're expecting two peers to connect with us, and only one of them does. I think because there's some data that went through the channel this check would pass, right? If that's true then it might also be tricky for me to report the second peer from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only who know I expected to connect, but not who actually failed to connect
Not sure if this answers your question, but we can see which nodes did not yet connect (or which we did not yet connect to) for a given session id using AppState.listener_state
which is a hashmap of session IDs to Listener
s. Listener.validators
tracks the remaining TSS nodes we have not yet connected to or received a connection from.
Whenever we connect to or receive a connection from one of these we call listener.subscribe()
which internally removes that TSS node from the remaining validators
.
So if we want to report all peers who failed to make an incoming connection when the timeout was reached, we could maybe do something like this with lines 97-99 below:
match timeout(Duration::from_secs(SETUP_TIMEOUT_SECONDS), rx_ready).await {
Ok(ready) => {
let broadcast_out = ready??;
Channels(broadcast_out, rx_from_others)
}
Err(_) => {
let listeners = app_state.listeners.lock()?;
let listener = listeners.get(&session_id).ok_or(ProcotolErr::NoSessionId)?;
let remaining_validators: Vec<_> = listener.validators.keys().map(|id| SubxtAccountId32(id)).collect();
return Err(ProtocolErr::TimedOutWaitingForConnectionsFrom(remaining_validators));
}
}
This code wont work as it is, but you get the idea.
Since by this point open_protocol_connections
has returned successfully, we can be sure that all the remaining_validators
we find in this way are ones which were supposed to initiate a connection to us (and not ones which we failed to connect to ourself).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super helpful, thanks 🙏
Doesn't look like we need to report this error without the corresponding peers
Will bring this up in the review
return Err(error.to_string()); | ||
} | ||
|
||
tracing::debug!("Reporting `{:?}` for `{}`", peers_to_report.clone(), error.to_string()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the logs these (and basically all the other AccountId
s from this PR) are formatted as byte arrays. I can make them SS58 addresses, but this would potentially give us inconsistent logs with other events which use byte arrays. While they're harder to read I'd rather have more consistency across our logs - but lmk what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for being consistent everywhere. if i remember right subxt doesn't have the ss58 codec but sp-core does. At one point this was an issue as in some places we didn't have sp-core, but im pretty sure we sorted that and all our crates which would need to log something have it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #1271. Could be a Friday activity for someone lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 this is a big step towards getting slashing finished.
As i understand the scope of this PR is only to report errors from the signing protocol, but it looks like the groundwork has been done to add this to DKG and resharing, and basically we would just need to add a call to handle_protocol_errors
.
As for the hard to test cases i'm happy to leave it for now and if we don't come up with a way of doing automated tests we can maybe have instructions for doing a manual test in the pre-release checklist.
I haven't looked in much detail at how these errors get handled on the chain side. We should bear in mind that many of them could also be our own fault (eg: failed incoming connection because we lost connectivity ourself, and encryption errors because of corrupt data our end). But i guess this is accounted for in that the slashing pallet gets a sufficient number of reports from different peers before taking action.
Ok(Channels(broadcast_out, rx_from_others)) | ||
}, | ||
Err(e) => { | ||
let unsubscribed_peers = state.unsubscribed_peers(session_id).map_err(|_| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to map the error here? Aren't we mapping it to what it already was? or am i missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two unsubscribed_peers()
methods - the one on ListenerState
(which is what's used) here returns a SubscribeErr
so we have to map it.
The one on AppState
returns a ProtocolErr
but we don't have access to that here
crates/threshold-signature-server/src/signing_client/protocol_transport.rs
Outdated
Show resolved
Hide resolved
@@ -768,6 +767,210 @@ async fn test_fails_to_sign_if_non_signing_group_participants_are_used() { | |||
clean_tests(); | |||
} | |||
|
|||
#[tokio::test] | |||
#[serial] | |||
async fn test_reports_peer_if_they_dont_participate_in_signing() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we either rename the test to something like test_reports_peer_if_they_reject_our_signing_protocol_connection
or put a comment saying something like that.
It took me a while to figure out the difference between this test and the one below is that this is a failed outgoing connection and the other one is a failed incoming connection.
|
||
// This is a non-reportable error, so we don't do any further processing with the error | ||
if peers_to_report.is_empty() { | ||
return Err(error.to_string()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as for the response here shouldn't this be Ok() as in the function here did not error out, everything progressed normally within the context of this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree with you - but if we change this to an Ok<String>
it'll mean that the caller will end up needing to treat the Ok
case as a failure/error, because that's what it is. By only returning errors were we can save ourselves this and just do an unwrap_err()
at the call site.
We also don't do any extra processing with the Err
case - we log it and move on. If we had some different behaviour at the call site, then yes could make sense.
} | ||
|
||
if failed_reports.is_empty() { | ||
Err(error.to_string()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
This is a follow up to #1215.
In this PR we enable TSS servers to report each other on-chain for "offline" or
connection related failures during signing (note that this PR doesn't include DKG or
proactive refresh) using the reporting extrinsic introduced in #1215.
When trying to figure out when to report I basically walked through the signing flow to
see where connections were being made (either inbound or outbound).
For outbound connections we basically report all the errors from
open_protocol_connections()
. For inbound, we report peers that we're expecting toconnect to us during signing but never do.
As far as testing goes it was a bit tricky to hit all the different networking/connection
codepaths programmatically. In the end I did mange to get one unit test for inbound
connections and one for outbound connections (thanks for the help here @ameba23).
For those errors I couldn't test programmatically I did do some manual testing with hacked
Docker images (inserting crashes at connection points) - but I definitely could've missed
something here.