Report unstable peers from TSS #1228

HCastano · 2024-12-20T21:49:57Z

This is a follow up to #1215.

In this PR we enable TSS servers to report each other on-chain for "offline" or
connection related failures during signing (note that this PR doesn't include DKG or
proactive refresh) using the reporting extrinsic introduced in #1215.

When trying to figure out when to report I basically walked through the signing flow to
see where connections were being made (either inbound or outbound).

For outbound connections we basically report all the errors from
open_protocol_connections(). For inbound, we report peers that we're expecting to
connect to us during signing but never do.

As far as testing goes it was a bit tricky to hit all the different networking/connection
codepaths programmatically. In the end I did mange to get one unit test for inbound
connections and one for outbound connections (thanks for the help here @ameba23).

For those errors I couldn't test programmatically I did do some manual testing with hacked
Docker images (inserting crashes at connection points) - but I definitely could've missed
something here.

HCastano · 2024-12-20T21:50:32Z

crates/threshold-signature-server/src/signing_client/api.rs

@@ -139,6 +139,8 @@ async fn handle_socket_result(socket: WebSocket, app_state: AppState) {
    if let Err(err) = handle_socket(socket, app_state).await {
        tracing::warn!("Websocket connection closed unexpectedly {:?}", err);
        // TODO here we should inform the chain that signing failed
+        //
+        // TODO (Nando): Report error up here?


I think the answer here is no since this would be bubbled up when we try and make a (failed) ws/ connection anyways

I haven't totally thought through but seems reasonable. The only errors we could catch here but not elsewhere are relating to incoming connections which do not relate to an existing listener. Eg: connections from random people or with a bad 'subscribe' message. I think we can just ignore these.

ameba23 · 2024-12-21T10:51:24Z

As for (2):

A TSS server just doesn't show up. This should be easy to test.
A TSS connects and submits a valid subscribe message, but then drops the connection. We can write a test helper that does this - i think there already is something similar for testing bad subscribe messages.
A TSS server drops the connection in the middle of a protocol run. This is hard to test. The only thing i can think of is to wrap setup_client

entropy-core/crates/threshold-signature-server/src/helpers/tests.rs

Line 74 in a5c69b2

pub async fn setup_client() -> KvManager {

in a tokio::time::timeout which kills the TSS server after some amount of time which is likely to be in the middle of the protocol. Since the signing protocol is fast, this might be too unreliable. But it would probably work for the DKG/reshare protocols. Kinda ugly though.

ameba23 · 2024-12-21T12:29:36Z

Probably the most common problem will be that a TSS server is just not available for whatever reason.

I think its worth noting that the relayer node will immediately know this, because they will be unable to make the sign_tx request. But currently they do nothing about it, other than logging an error:

entropy-core/crates/threshold-signature-server/src/user/api.rs

Line 206 in a5c69b2

tracing::error!("Error in tokio::spawn task: {:?}", e);

Really here they should tell the other TSS servers that they can abandon the signing protocol. There is no point in waiting for this one to connect, as they have failed to receive the signature request.

Otherwise, if the unresponsive node is due to make an outgoing ws connection, its hard to say how long we should wait for them before reporting unresponsiveness. Are they offline or just being slow?

This is going to be the `offender` account ID which gets submitted on-chain.

This way we have access to the `api` which we can use to submit an offline report on-chain.

HCastano · 2025-01-20T20:02:56Z

@ameba23 thanks for the feedback!

A TSS server just doesn't show up. This should be easy to test.

This would just end up with a failure at the point of the relayer, not necessarily the
signers (which we don't deal with right now).

This ties into your second comment though, we can have a follow up in which we just deal
with the relayer <-> signer failure case.

A TSS connects and submits a valid subscribe message, but then drops the connection. We
can write a test helper that does this - i think there already is something similar for
testing bad subscribe messages.

Okay yeah, I can try and look into this.

A TSS server drops the connection in the middle of a protocol run. This is hard to
test. The only thing i can think of is to wrap setup_client

Hmm this doesn't seem ideal, and if we want to run it in tests we may end up with
nondeterministic tests...I'll poke around the code here and see if I can think of
anything else.

This error will end up getting bubbled up and handled by the `ws/` handler at a later point in time.

HCastano · 2025-01-21T02:09:18Z

Peg, would you be able to point me to the bad subscribe test you were talking about?

I was playing around with throwing some errors into the protocol test helpers but looks like they're too low level for what we wan to trigger in this case.

ameba23 · 2025-01-21T08:08:24Z

Peg, would you be able to point me to the bad subscribe test you were talking about?

Here we start the signing process and then send a subscribe message from someone not in the signing group:

entropy-core/crates/threshold-signature-server/src/user/tests.rs

Line 726 in 58069f5

// create a SubscribeMessage from a party who is not in the signing commitee

So i guess we would change it so that it is someone who is in the signing group, who sends a subscribe message and then does nothing. The problem is, we need to stop that node from connecting as normal at the same time. So i guess we would need spawn_testing_validators to somehow omit one node. Or maybe theres some other way to go about it.

This takes down Bob via hardcoding - but proves that the test can work.

Looks like the tests still pass with him in though...

We do this by manually setting the signers each time, instead of just using the ones from the chain.

These are helpful for checking offence reports

There's no code for the actual reporting bit here yet though

I didn't end up using it for testing.

HCastano · 2025-01-28T00:22:48Z

crates/threshold-signature-server/src/helpers/signing.rs

@@ -91,6 +91,8 @@ pub async fn do_signing(
    )
    .await?;

+    // TODO (Nando): Maybe report here since we're expecting a connection from somebody and they


@ameba23 do you know if there's a way to hold other parties accountable here?

My current problem: I know in open_protocol_connections() that there are some parties that will initiate the protocol with us. However, if they don't then we trigger this timeout and bail - I'd like to be able to know who they were so I can report them.

I can figure out who we are expecting to connect with us using the implementation details in open_protocol_connections() (bit of a leak imo), but from that I can't really assign "blame" to the party who didn't end up connecting with us (I only who know I expected to connect, but not who actually failed to connect).

(See test_reports_peer_if_they_dont_initiate_a_signing_session for what I'd ideally want to get passing)

Another question here. Say we're expecting two peers to connect with us, and only one of them does. I think because there's some data that went through the channel this check would pass, right? If that's true then it might also be tricky for me to report the second peer from here.

I only who know I expected to connect, but not who actually failed to connect

Not sure if this answers your question, but we can see which nodes did not yet connect (or which we did not yet connect to) for a given session id using AppState.listener_state which is a hashmap of session IDs to Listeners. Listener.validators tracks the remaining TSS nodes we have not yet connected to or received a connection from.

Whenever we connect to or receive a connection from one of these we call listener.subscribe() which internally removes that TSS node from the remaining validators.

So if we want to report all peers who failed to make an incoming connection when the timeout was reached, we could maybe do something like this with lines 97-99 below:

match timeout(Duration::from_secs(SETUP_TIMEOUT_SECONDS), rx_ready).await { Ok(ready) => { let broadcast_out = ready??; Channels(broadcast_out, rx_from_others) } Err(_) => { let listeners = app_state.listeners.lock()?; let listener = listeners.get(&session_id).ok_or(ProcotolErr::NoSessionId)?; let remaining_validators: Vec<_> = listener.validators.keys().map(|id| SubxtAccountId32(id)).collect(); return Err(ProtocolErr::TimedOutWaitingForConnectionsFrom(remaining_validators)); } }

This code wont work as it is, but you get the idea.

Since by this point open_protocol_connections has returned successfully, we can be sure that all the remaining_validators we find in this way are ones which were supposed to initiate a connection to us (and not ones which we failed to connect to ourself).

Super helpful, thanks 🙏

…nect

Doesn't look like we need to report this error without the corresponding peers

Will bring this up in the review

HCastano · 2025-01-29T02:40:50Z

crates/threshold-signature-server/src/user/api.rs

+        return Err(error.to_string());
+    }
+
+    tracing::debug!("Reporting `{:?}` for `{}`", peers_to_report.clone(), error.to_string());


In the logs these (and basically all the other AccountIds from this PR) are formatted as byte arrays. I can make them SS58 addresses, but this would potentially give us inconsistent logs with other events which use byte arrays. While they're harder to read I'd rather have more consistency across our logs - but lmk what you think

+1 for being consistent everywhere. if i remember right subxt doesn't have the ss58 codec but sp-core does. At one point this was an issue as in some places we didn't have sp-core, but im pretty sure we sorted that and all our crates which would need to log something have it now.

Opened #1271. Could be a Friday activity for someone lol

ameba23

💯 this is a big step towards getting slashing finished.

As i understand the scope of this PR is only to report errors from the signing protocol, but it looks like the groundwork has been done to add this to DKG and resharing, and basically we would just need to add a call to handle_protocol_errors.

As for the hard to test cases i'm happy to leave it for now and if we don't come up with a way of doing automated tests we can maybe have instructions for doing a manual test in the pre-release checklist.

I haven't looked in much detail at how these errors get handled on the chain side. We should bear in mind that many of them could also be our own fault (eg: failed incoming connection because we lost connectivity ourself, and encryption errors because of corrupt data our end). But i guess this is accounted for in that the slashing pallet gets a sufficient number of reports from different peers before taking action.

ameba23 · 2025-01-29T08:17:18Z

crates/threshold-signature-server/src/signing_client/api.rs

+            Ok(Channels(broadcast_out, rx_from_others))
+        },
+        Err(e) => {
+            let unsubscribed_peers = state.unsubscribed_peers(session_id).map_err(|_| {


Why do we need to map the error here? Aren't we mapping it to what it already was? or am i missing something.

There are two unsubscribed_peers() methods - the one on ListenerState (which is what's used) here returns a SubscribeErr so we have to map it.

The one on AppState returns a ProtocolErr but we don't have access to that here

crates/threshold-signature-server/src/signing_client/protocol_transport.rs

ameba23 · 2025-01-29T08:51:27Z

crates/threshold-signature-server/src/user/tests.rs

@@ -768,6 +767,210 @@ async fn test_fails_to_sign_if_non_signing_group_participants_are_used() {
    clean_tests();
 }

+#[tokio::test]
+#[serial]
+async fn test_reports_peer_if_they_dont_participate_in_signing() {


Can we either rename the test to something like test_reports_peer_if_they_reject_our_signing_protocol_connection
or put a comment saying something like that.

It took me a while to figure out the difference between this test and the one below is that this is a failed outgoing connection and the other one is a failed incoming connection.

JesseAbram · 2025-01-29T15:50:39Z

crates/threshold-signature-server/src/user/api.rs

+
+    // This is a non-reportable error, so we don't do any further processing with the error
+    if peers_to_report.is_empty() {
+        return Err(error.to_string());


as for the response here shouldn't this be Ok() as in the function here did not error out, everything progressed normally within the context of this function

I do agree with you - but if we change this to an Ok<String> it'll mean that the caller will end up needing to treat the Ok case as a failure/error, because that's what it is. By only returning errors were we can save ourselves this and just do an unwrap_err() at the call site.

We also don't do any extra processing with the Err case - we log it and move on. If we had some different behaviour at the call site, then yes could make sense.

JesseAbram · 2025-01-29T15:50:46Z

crates/threshold-signature-server/src/user/api.rs

+    }
+
+    if failed_reports.is_empty() {
+        Err(error.to_string())


Co-authored-by: peg <[email protected]>

HCastano commented Dec 20, 2024

View reviewed changes

HCastano added 3 commits January 20, 2025 11:23

Add an account_id to errors which may be reported

51ebc94

This is going to be the `offender` account ID which gets submitted on-chain.

Add cursed match statement for protocol errors

8541cb0

Move cursed match up to the caller

5be281c

This way we have access to the `api` which we can use to submit an offline report on-chain.

HCastano force-pushed the hc/off-chain-peer-reporting branch from 69a97cc to 5be281c Compare January 20, 2025 20:03

Remove outdated comment in socket handler

ae0640e

This error will end up getting bubbled up and handled by the `ws/` handler at a later point in time.

HCastano added 15 commits January 22, 2025 15:54

Try hacking together a passing subscribe test

71ffda7

Get test working for basic reporting flow

031a741

This takes down Bob via hardcoding - but proves that the test can work.

Wrap TSS spawn helper so that we can access the task handles

257b29a

Manually stop Bob's TSS from spinning up again

0c88996

Add a timeout when waiting for NoteReport events

a675856

Add Bob back during spinup

2f6b7b5

Looks like the tests still pass with him in though...

Have a reporting test which consistently passes

8a9c9de

We do this by manually setting the signers each time, instead of just using the ones from the chain.

Add a bit more logging

e024127

Add Charlie and Dave stash address constants

299df21

These are helpful for checking offence reports

Add test for reporting peers that don't connect to us

94973c7

There's no code for the actual reporting bit here yet though

Temporarily ignore test

604d613

RustFmt

c004741

Update some comments

fdd9a23

Remove helper that gives out TSS handles

0914a1e

I didn't end up using it for testing.

Add TODO comment about timeout errors

9287f7c

HCastano commented Jan 28, 2025

View reviewed changes

HCastano added 3 commits January 27, 2025 16:24

Fix typo

a65103b

Clean up reportable error

1abd6dd

Add functionality to report peers if we time out while waiting to con…

06e2115

…nect

HCastano added 7 commits January 28, 2025 18:25

Clean up how we get unsubscribed peers

2d82e01

RustFmt

e175082

Remove test ignore

4c5483a

Make naming for a bit more consistent

058ede6

Remove Option from Timeout error

367e846

Doesn't look like we need to report this error without the corresponding peers

RustFmt

607570b

Remove TODO comment

c5d6053

Will bring this up in the review

HCastano commented Jan 29, 2025

View reviewed changes

HCastano added 3 commits January 28, 2025 18:41

Remove dbg! statements

3dfa5db

Add CHANGELOG entry

42935bc

Merge branch 'master' into hc/off-chain-peer-reporting

70a76db

HCastano marked this pull request as ready for review January 29, 2025 03:01

HCastano requested review from ameba23 and JesseAbram January 29, 2025 03:01

ameba23 approved these changes Jan 29, 2025

View reviewed changes

JesseAbram approved these changes Jan 29, 2025

View reviewed changes

HCastano and others added 5 commits January 29, 2025 15:40

Fix typo in debug message

8362290

Co-authored-by: peg <[email protected]>

Improve test name

251e422

Avoid using full import path

62060dc

Update lockfile

b2c1852

Merge branch 'master' into hc/off-chain-peer-reporting

7853f62

HCastano mentioned this pull request Jan 30, 2025

Use SS58 AccountIDs in logs? #1271

Open

HCastano merged commit 8545d45 into master Jan 30, 2025
7 of 8 checks passed

HCastano deleted the hc/off-chain-peer-reporting branch January 30, 2025 01:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report unstable peers from TSS #1228

Report unstable peers from TSS #1228

HCastano commented Dec 20, 2024 •

edited

Loading

HCastano Dec 20, 2024

ameba23 Dec 21, 2024

ameba23 commented Dec 21, 2024

ameba23 commented Dec 21, 2024

HCastano commented Jan 20, 2025

HCastano commented Jan 21, 2025

ameba23 commented Jan 21, 2025

HCastano Jan 28, 2025 •

edited

Loading

ameba23 Jan 28, 2025 •

edited

Loading

HCastano Jan 29, 2025

HCastano Jan 29, 2025

ameba23 Jan 29, 2025

HCastano Jan 30, 2025

ameba23 left a comment

ameba23 Jan 29, 2025

HCastano Jan 29, 2025

ameba23 Jan 29, 2025

JesseAbram Jan 29, 2025

HCastano Jan 30, 2025

JesseAbram Jan 29, 2025

Report unstable peers from TSS #1228

Report unstable peers from TSS #1228

Conversation

HCastano commented Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ameba23 commented Dec 21, 2024

ameba23 commented Dec 21, 2024

HCastano commented Jan 20, 2025

HCastano commented Jan 21, 2025

ameba23 commented Jan 21, 2025

HCastano Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

ameba23 Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ameba23 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HCastano commented Dec 20, 2024 •

edited

Loading

HCastano Jan 28, 2025 •

edited

Loading

ameba23 Jan 28, 2025 •

edited

Loading