-
Notifications
You must be signed in to change notification settings - Fork 764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libp2p may be falsely re-establishing connections for reserved peers, or dropping messages #1499
Comments
@paritytech/networking can you look into this with high priority? Afair you have done some changes in substream handling recently? |
@bkchr last change to anything related to substreams was the This sounds a little like the finality lagging issue we were seeing earlier this year where the substream was one-sided but the node was not able to detect it because it didn't have anything to send. |
I have couple of questions:
This immediate reconnection after disconnection is suspicious because there should be back-off mechanism for closed substreams. The back-off is not working if the whole connection is closed which would indicate that the validator/collator pair had only one protocol open and that the validator/collator was redialed immediately after the connection was closed. |
What I find weird is that authority discovery is doing something which to me doesn't make sense:
It keeps disconnecting and then reconnecting to the same nodes. I'm not sure why it disconnects the peer in the first place but then reconnecting to it in a few milliseconds doesn't sound correct. Any comment on this? Grafana for validator One issue in the current implementation of What I think can happen is that if local node closes a connection to remote and opens it again before remote node can send any data, the new inbound substream (from remote's PoV) is discarded and this is noticed by the local node only when it tries to send data. Since the I need to check if any of that is actually true but I think this ping-pong in the connection state caused by the authority discovery is playing a role here. |
I have enabled that, what particular log lines should I check? |
|
Just a guess: is it possible that a collator removes validators for some reason from the reserved peers and this causes the disconnect? EDIT. May be it's expected behavior, and if not — we'll need to also enable |
They are likely connected on sync and
That's correct. Validators do not add collators to reserved peers on any protocol, and collators add validators to their reserved peers on the Collation protocol only. |
There's one potential issue with gossip-support subsystem running on collators and potentially removing all validators from the reserved
I think that might also trigger a disconnect. We should probably not even run gossip-support on collators. |
I found this one from the logs
@ordian would that affect the collation set though? |
I don't know if the current code affects collation peerset, I expect that this doesn't polkadot-sdk/polkadot/node/network/bridge/src/validator_discovery.rs Lines 95 to 100 in ee6eeb7
But yeah, the issue might be in collator side issuing disconnecting requests polkadot-sdk/polkadot/node/network/collator-protocol/src/collator_side/validators_buffer.rs Line 95 in ee6eeb7
|
I think we should try disabling those disconnection requests, at least if they're targeting a node that will be added a few moments later again. I don't see the purpose of those but I could be misunderstanding something. If those are disabled and the issue still persists, it would give us some new information. |
I can see how the underlying But this does not fix the underlying issue, that the connection is being falsely "opened" just because it was closed just before. Is there any good fix for that? |
I don't know any good fix off the top of my head but I'll keep looking. Disappointing results in #1526 though. Looks like the immediate reconnections are gone but now the authority discovery only finds one then keep disconnecting and connecting to that one node and sometimes switching between which node it connects to. But the backoff mechanisms looks to be working since It's also advertising collation after each new connection. |
I've performed some tests, and it looks like @altonen's hypothesis is correct. I've added a custom ( Here is the relevant logs for the "client":
For the "server":
(If we wait longer, the nodes get disconnected on the default peer set and the dummy protocol connection logic gets screwed, but this is not relevant.) This can explain why when we connect to the validator for the second time we don't receive a |
Merging paritytech/substrate#13396 (reverted) would fix the issue, but when we merged it (and fixed the bug), the syncing got screwed, because it relied on this bug present under the conditions of insufficient inbound slots on the network. |
There is at least one problem in the handling of reserved peers that I can find. Here are relevant logs. What happens is that at 17:04:24.116 Then there are few changes in the validators set and I assume one of these updates doesn't contain
After these updates to reserved peers (but before this polkadot-sdk/polkadot/node/network/protocol/src/peer_set.rs Lines 103 to 110 in 61be78c
After another change in the set at 17:04:24.159, the connection is still kept open but after this change, a collation message is received from
since this is a debug it could be harmless but after the latest update at to peer set 17:04:24.159, At 17:04:48.066 there is another change in the validator set, this time removing peers and now the After that another reconnection is made and this time it works and the nodes send each other messages. I think this immediate disconnection/reconnection causes the peers to have differing view of the connection and that causes the node that does the connecting to have an inbound substream that is not active, causing it to receive no messages. Could someone who knows the this collator/validator discovery code better go over this explanation and see if I've misunderstood something about this process? Right now, based on what's in the logs, it looks like Really nice work Dmitry! I would prefer fixing #508 instead since I've once already got my ass handed to me because of closed inbound substreams and AFAIK there isn't a single protocol in Polkadot SDK which uses this feature. Regardless of which substream we use to detect if the connection is closed, the end result will be the same and some people may have zero peers if they don't open their firewall. FWIW, #1370 will cause that same thing to happen because the block announcement handshake will fail if the remote node is full. |
In our scenario it is quite unlikely - eras are common but the validator set is not changing.
This is unrelated and is addressed in #1475
Yes, this sounds like what we observed - but the network bridge is likely not the issue here (it simply wraps Is the recommendation to merge #1563 in favor of #1526 and this should fix the issue? |
In testing the Polkadot collator-protocol for asynchronous backing, we have noticed some anomalous behavior, where collators are sometimes unable to broadcast their collations to validators because validators don't send them a
ViewUpdate
message (which they are supposed to do upon initiating the connection).The behavior that we're seeing, repeatedly, is that collators initiating a connection immediately following a disconnection appear to open up a substream that is at least initially one-sided: the collator sends its view message, but receives nothing in return and therefore does not advertise its collation.
The underlying code is using
ProtocolHandle::set_reserved_peers
with some validators to connect to. The (collator) node is not a validator, and the validators do not put this node in theirreserved_peers
. The validators also disconnect the collator after some time, after which the collator is likely to initiate a reconnect. This reconnection is what appears tenuous. Versi is currently running #1410, which contains the debugging logs seen in the Grafana links here.https://grafana.teleport.parity.io/goto/X-qj7ViIR?orgId=1 This is the general query I have been using for debugging.
Between XX:48:00.083 and XX:48:06 in the following Grafana query you can see an example of this - the validators disconnect, the collator sees an immediate reconnect, but no view message is propagated https://grafana.teleport.parity.io/goto/2K9d44mSR?orgId=1 .
Reproduced here
For an example of how it should look, with logs on both sides:
expand
The expected behavior is that the
NotificationStreamOpened
event should not land on one side unless it is clear that it will land on both sides, and that no messages will be dropped.The text was updated successfully, but these errors were encountered: