Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(l1): Delete peer from kademlia table on failure when establishing tcp connection. #1811

Closed
ElFantasma opened this issue Jan 27, 2025 · 0 comments · Fixed by #1885
Closed

Comments

@ElFantasma
Copy link
Contributor

When we fail during handshake or during connection, we do update the kademlia table to reflect the peer disconnection, but we don't do that when we fail to establish the tcp connection with peer. Or when we fail to decode the initial message.

We should do the same, but that code is commented out as it makes a test to fail.

We should comment out the test or find another way to test that scenario to enable the kademlia update on failure.

(Note this code is undergoing some refactor, so this ticket may be outdated. But the issue may still be present in some other form)

github-merge-queue bot pushed a commit that referenced this issue Feb 11, 2025
**Motivation**
We observed that our peer count was significantly lower compared to
other execution clients.

**Description**
This PR refines the exploratory work done in #1875. While debugging, we
developed tools to track individual peer traces (see
[here](https://github.com/MarcosNicolau/eth-el-clients-debugging) if
interested). Through this, we discovered that although we were
exchanging messages with many peers, they were not appearing in the
stats. This indicated that they were not being inserted into the
`KademliaTable`.

The root cause was that the Bucket was already full, which happens
quickly since each bucket only holds 10 entries. Below is a list of
changes that improved the peer situation:
- Peers were connected but not properly inserted into the table: We
temporarily fixed this by forcing peer insertion, even if it exceeded
the protocol’s allowed limit. An alternative solution could be
maintaining a secondary table for P2P peers, but that would be better
suited for the incoming refactor.
- Frame decoding verification was causing connection crashes:
Specifically, an assertion on `header-data` was failing. While the spec
dictates that it should be RLP-encoded zero, some clients would not send
the encoded values. Additionally, we found that no other clients seem to
enforce verification on this field.
-  Added logging for message exchanges between peers.
- Refactored functions that were over-locking the KademliaTable: This
became particularly noticeable when a high number of peers were
discovered.

**Other Findings**
- Incorrect capability exchange logic: We attempt to connect to peers
with the snap capability without first verifying if they support our eth
protocol version. This leads to failed connections. A separate PR will
address this, this is the cause of the observed error message related to
capability exchange failures due to mismatched eth and snap protocols.
- Most disconnections occur because peers have already reached their
connection limit (sending us `Too many peers` as the reason).
- A future improvement could be implementing a reconnection mechanism.
- The majority of our connections are incoming.

Closes #1811
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant