bug(l1): Delete peer from kademlia table on failure when establishing tcp connection. #1811

ElFantasma · 2025-01-27T17:03:25Z

When we fail during handshake or during connection, we do update the kademlia table to reflect the peer disconnection, but we don't do that when we fail to establish the tcp connection with peer. Or when we fail to decode the initial message.

We should do the same, but that code is commented out as it makes a test to fail.

We should comment out the test or find another way to test that scenario to enable the kademlia update on failure.

(Note this code is undergoing some refactor, so this ticket may be outdated. But the issue may still be present in some other form)

**Motivation** We observed that our peer count was significantly lower compared to other execution clients. **Description** This PR refines the exploratory work done in #1875. While debugging, we developed tools to track individual peer traces (see [here](https://github.com/MarcosNicolau/eth-el-clients-debugging) if interested). Through this, we discovered that although we were exchanging messages with many peers, they were not appearing in the stats. This indicated that they were not being inserted into the `KademliaTable`. The root cause was that the Bucket was already full, which happens quickly since each bucket only holds 10 entries. Below is a list of changes that improved the peer situation: - Peers were connected but not properly inserted into the table: We temporarily fixed this by forcing peer insertion, even if it exceeded the protocol’s allowed limit. An alternative solution could be maintaining a secondary table for P2P peers, but that would be better suited for the incoming refactor. - Frame decoding verification was causing connection crashes: Specifically, an assertion on `header-data` was failing. While the spec dictates that it should be RLP-encoded zero, some clients would not send the encoded values. Additionally, we found that no other clients seem to enforce verification on this field. - Added logging for message exchanges between peers. - Refactored functions that were over-locking the KademliaTable: This became particularly noticeable when a high number of peers were discovered. **Other Findings** - Incorrect capability exchange logic: We attempt to connect to peers with the snap capability without first verifying if they support our eth protocol version. This leads to failed connections. A separate PR will address this, this is the cause of the observed error message related to capability exchange failures due to mismatched eth and snap protocols. - Most disconnections occur because peers have already reached their connection limit (sending us `Too many peers` as the reason). - A future improvement could be implementing a reconnection mechanism. - The majority of our connections are incoming. Closes #1811

github-project-automation bot added this to ethrex_l1 Jan 27, 2025

MarcosNicolau mentioned this issue Feb 7, 2025

fix(l1): peers tcp connections #1885

Merged

MarcosNicolau closed this as completed in #1885 Feb 11, 2025

github-project-automation bot moved this to Done in ethrex_l1 Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(l1): Delete peer from kademlia table on failure when establishing tcp connection. #1811

bug(l1): Delete peer from kademlia table on failure when establishing tcp connection. #1811

ElFantasma commented Jan 27, 2025

bug(l1): Delete peer from kademlia table on failure when establishing tcp connection. #1811

bug(l1): Delete peer from kademlia table on failure when establishing tcp connection. #1811

Comments

ElFantasma commented Jan 27, 2025