-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug(l1): Delete peer from kademlia table on failure when establishing tcp connection. #1811
Comments
github-merge-queue bot
pushed a commit
that referenced
this issue
Feb 11, 2025
**Motivation** We observed that our peer count was significantly lower compared to other execution clients. **Description** This PR refines the exploratory work done in #1875. While debugging, we developed tools to track individual peer traces (see [here](https://github.com/MarcosNicolau/eth-el-clients-debugging) if interested). Through this, we discovered that although we were exchanging messages with many peers, they were not appearing in the stats. This indicated that they were not being inserted into the `KademliaTable`. The root cause was that the Bucket was already full, which happens quickly since each bucket only holds 10 entries. Below is a list of changes that improved the peer situation: - Peers were connected but not properly inserted into the table: We temporarily fixed this by forcing peer insertion, even if it exceeded the protocol’s allowed limit. An alternative solution could be maintaining a secondary table for P2P peers, but that would be better suited for the incoming refactor. - Frame decoding verification was causing connection crashes: Specifically, an assertion on `header-data` was failing. While the spec dictates that it should be RLP-encoded zero, some clients would not send the encoded values. Additionally, we found that no other clients seem to enforce verification on this field. - Added logging for message exchanges between peers. - Refactored functions that were over-locking the KademliaTable: This became particularly noticeable when a high number of peers were discovered. **Other Findings** - Incorrect capability exchange logic: We attempt to connect to peers with the snap capability without first verifying if they support our eth protocol version. This leads to failed connections. A separate PR will address this, this is the cause of the observed error message related to capability exchange failures due to mismatched eth and snap protocols. - Most disconnections occur because peers have already reached their connection limit (sending us `Too many peers` as the reason). - A future improvement could be implementing a reconnection mechanism. - The majority of our connections are incoming. Closes #1811
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When we fail during handshake or during connection, we do update the kademlia table to reflect the peer disconnection, but we don't do that when we fail to establish the tcp connection with peer. Or when we fail to decode the initial message.
We should do the same, but that code is commented out as it makes a test to fail.
We should comment out the test or find another way to test that scenario to enable the kademlia update on failure.
(Note this code is undergoing some refactor, so this ticket may be outdated. But the issue may still be present in some other form)
The text was updated successfully, but these errors were encountered: