Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report unstable peers from TSS #1228

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft

Conversation

HCastano
Copy link
Collaborator

This is the WIP follow up to #1215.

In this PR the plan is to have TSS servers use the reporting extrinsic introduced in
#1215 whenever they experience some sort of "offline" or connection failure from a peer
during signing.

There are two main areas of work for this:

  1. Figuring out what constitutes valid errors to report
  2. Figuring out how to do automated testing for this

For (1), I've just been going through the signing flow for now, and basically annotating
the one function where I think all the important connections are made.

For (2), I'm not really sure how to approach this yet. So far I've just been testing this
manually using hacked Docker images with panics at points where connections should be
made.

@ameba23 if you get a chance to take a peek at this I'd appreciate some ideas on (1) and
(2) from you.

This is going to be the `offender` account ID which gets submitted on-chain.
This way we have access to the `api` which we can use to submit an offline report on-chain.
@@ -139,6 +139,8 @@ async fn handle_socket_result(socket: WebSocket, app_state: AppState) {
if let Err(err) = handle_socket(socket, app_state).await {
tracing::warn!("Websocket connection closed unexpectedly {:?}", err);
// TODO here we should inform the chain that signing failed
//
// TODO (Nando): Report error up here?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the answer here is no since this would be bubbled up when we try and make a (failed) ws/ connection anyways

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't totally thought through but seems reasonable. The only errors we could catch here but not elsewhere are relating to incoming connections which do not relate to an existing listener. Eg: connections from random people or with a bad 'subscribe' message. I think we can just ignore these.

@ameba23
Copy link
Contributor

ameba23 commented Dec 21, 2024

As for (2):

  • A TSS server just doesn't show up. This should be easy to test.
  • A TSS connects and submits a valid subscribe message, but then drops the connection. We can write a test helper that does this - i think there already is something similar for testing bad subscribe messages.
  • A TSS server drops the connection in the middle of a protocol run. This is hard to test. The only thing i can think of is to wrap setup_client
    pub async fn setup_client() -> KvManager {
    in a tokio::time::timeout which kills the TSS server after some amount of time which is likely to be in the middle of the protocol. Since the signing protocol is fast, this might be too unreliable. But it would probably work for the DKG/reshare protocols. Kinda ugly though.

@ameba23
Copy link
Contributor

ameba23 commented Dec 21, 2024

Probably the most common problem will be that a TSS server is just not available for whatever reason.

I think its worth noting that the relayer node will immediately know this, because they will be unable to make the sign_tx request. But currently they do nothing about it, other than logging an error:

tracing::error!("Error in tokio::spawn task: {:?}", e);

Really here they should tell the other TSS servers that they can abandon the signing protocol. There is no point in waiting for this one to connect, as they have failed to receive the signature request.

Otherwise, if the unresponsive node is due to make an outgoing ws connection, its hard to say how long we should wait for them before reporting unresponsiveness. Are they offline or just being slow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants