fix(l1): peers tcp connections #1885

MarcosNicolau · 2025-02-06T19:05:42Z

Motivation
We observed that our peer count was significantly lower compared to other execution clients.

Description
This PR refines the exploratory work done in #1875. While debugging, we developed tools to track individual peer traces (see here if interested). Through this, we discovered that although we were exchanging messages with many peers, they were not appearing in the stats. This indicated that they were not being inserted into the KademliaTable.

The root cause was that the Bucket was already full, which happens quickly since each bucket only holds 10 entries. Below is a list of changes that improved the peer situation:

Peers were connected but not properly inserted into the table: We temporarily fixed this by forcing peer insertion, even if it exceeded the protocol’s allowed limit. An alternative solution could be maintaining a secondary table for P2P peers, but that would be better suited for the incoming refactor.
Frame decoding verification was causing connection crashes: Specifically, an assertion on header-data was failing. While the spec dictates that it should be RLP-encoded zero, some clients would not send the encoded values. Additionally, we found that no other clients seem to enforce verification on this field.
Added logging for message exchanges between peers.
Refactored functions that were over-locking the KademliaTable: This became particularly noticeable when a high number of peers were discovered.

Other Findings

Incorrect capability exchange logic: We attempt to connect to peers with the snap capability without first verifying if they support our eth protocol version. This leads to failed connections. A separate PR will address this, this is the cause of the observed error message related to capability exchange failures due to mismatched eth and snap protocols.
Most disconnections occur because peers have already reached their connection limit (sending us Too many peers as the reason).
A future improvement could be implementing a reconnection mechanism.
The majority of our connections are incoming.

Closes #1811

and refactors some methods that were locking the table too much

github-actions · 2025-02-06T19:08:39Z

| File                                                                     | Lines | Diff |
+--------------------------------------------------------------------------+-------+------+
| /home/runner/work/ethrex/ethrex/crates/networking/p2p/discv4/server.rs   | 677   | +1   |
+--------------------------------------------------------------------------+-------+------+
| /home/runner/work/ethrex/ethrex/crates/networking/p2p/kademlia.rs        | 492   | +1   |
+--------------------------------------------------------------------------+-------+------+
| /home/runner/work/ethrex/ethrex/crates/networking/p2p/network.rs         | 169   | +20  |
+--------------------------------------------------------------------------+-------+------+
| /home/runner/work/ethrex/ethrex/crates/networking/p2p/rlpx/connection.rs | 414   | +20  |
+--------------------------------------------------------------------------+-------+------+
| /home/runner/work/ethrex/ethrex/crates/networking/p2p/rlpx/frame.rs      | 194   | -1   |
+--------------------------------------------------------------------------+-------+------+

Total lines added: +42
Total lines removed: 1
Total lines changed: 43

Oppen · 2025-02-06T20:30:33Z

Will you make a PR with the tooling? I think it would be a good thing to have in the repo.

Oppen · 2025-02-06T20:35:53Z

crates/networking/p2p/discv4/server.rs

@@ -693,7 +694,7 @@ pub(super) mod tests {
            broadcast,
        };

-        let discv4 = Discv4Server::try_new(ctx).await?;
+        let discv4 = Discv4Server::try_new(ctx.clone()).await?;


Is ctx safe to clone? How does storage behave?

Yes, it is safe! You can see the Store struct here, cloning is just incrementing the reference count.

crates/networking/p2p/kademlia.rs

Oppen · 2025-02-06T20:37:58Z

crates/networking/p2p/kademlia.rs

+    /// Inserts a node into the table, even if the bucket is filled.
+    /// # Returns
+    /// A tuple containing:
+    ///     1. PeerData: none if the peer was already in the table or as a potential replacement


I would expect the opposite here, that is, if it was in the table I may want to check the old value. How is it currently used?

I agree with you. That would be the expected and usual behaviour. But this is a copy of a previous function (see here and here for a usage example). I wouldn't change this behaviour now as the future networking refactor is probably going to do this already.

Oppen · 2025-02-06T20:42:08Z

I have a few high level questions:

How do we determine the bucket for a given node?
Is the bucket getting full or the whole table?
If it's just the bucket, did you consider it might be a distribution problem (that is, we might have too many collisions), and that maybe we need to change how we determine the bucker? Or is that an untouchable part of the protocol?

crates/networking/p2p/net.rs

crates/networking/p2p/rlpx/connection.rs

Oppen · 2025-02-06T20:51:17Z

crates/networking/p2p/rlpx/frame.rs

@@ -122,10 +122,6 @@ impl Decoder for RLPxCodec {
        let mut temp_ingress_aes = self.ingress_aes.clone();
        temp_ingress_aes.apply_keystream(header_text);

-        // header-data = [capability-id, context-id]
-        // Both are unused, and always zero
-        assert_eq!(&header_text[3..6], &(0_u8, 0_u8).encode_to_vec());


Was this check a bug?

This check is ok, and it aligns with the spec (see here), but:

The fields are not being used anywhere in the protocol now.

Via the logs, I noticed that some clients may send a different value, but the connection would still be correct.

Additionally, I found that the rest of the execution clients (at least geth, besu, and nethermind) don't enforce this header to be RLP Encoded 0, so I preferred to remove it simply.

MarcosNicolau · 2025-02-07T14:25:07Z

I have a few high level questions:

How do we determine the bucket for a given node?

Is the bucket getting full or the whole table?

If it's just the bucket, did you consider it might be a distribution problem (that is, we might have too many collisions), and that maybe we need to change how we determine the bucker? Or is that an untouchable part of the protocol?

The bucket is defined as the XOR between the keecak256 of the node public key and our local node public key.
The bucket is getting filled, not the whole table; the temporary fix is to bypass the spec max size when we know the peer TCP connection was successful.
I suspect that the issue might lie in the way we are discarding peers. I am actually debugging that, but for the time being, I would leave this solution to fix the peer situation asap.

Most of it can be read on the spec here.

MarcosNicolau added 5 commits February 6, 2025 13:45

fix: remove header-data assertion

fea28ce

fix: close frame connection and show mesages exchanges debug

ffed29c

feat: kademlia push force node

48639f4

and refactors some methods that were locking the table too much

fix: more traces and remove peers when connection fails

c2dd627

test: fix discv4 tests failing with new peer replacements on rlpx error

332e109

Merge branch 'main' into fix/peers-tcp-connections

eca8aba

MarcosNicolau marked this pull request as ready for review February 6, 2025 19:45

MarcosNicolau requested a review from a team as a code owner February 6, 2025 19:45

MarcosNicolau self-assigned this Feb 6, 2025

MarcosNicolau mentioned this pull request Feb 6, 2025

fix(l1): peers connections #1875

Closed

test: update wrong comment

f6b8b43

Oppen reviewed Feb 6, 2025

View reviewed changes

crates/networking/p2p/kademlia.rs Outdated Show resolved Hide resolved

Oppen reviewed Feb 6, 2025

View reviewed changes

crates/networking/p2p/net.rs Outdated Show resolved Hide resolved

Oppen reviewed Feb 6, 2025

View reviewed changes

crates/networking/p2p/rlpx/connection.rs Outdated Show resolved Hide resolved

Oppen reviewed Feb 6, 2025

View reviewed changes

MarcosNicolau mentioned this pull request Feb 6, 2025

fix(l1): peers negotiated capabilities #1887

Open

refactor: pelito review comments

1750680

MarcosNicolau added 2 commits February 7, 2025 11:27

Merge branch 'main' into fix/peers-tcp-connections

84205a9

fix: merge

7adc698

MarcosNicolau requested a review from Oppen February 7, 2025 19:38

mpaulucci approved these changes Feb 10, 2025

View reviewed changes

Merge branch 'main' into fix/peers-tcp-connections

e197990

MarcosNicolau added this pull request to the merge queue Feb 11, 2025

Merged via the queue into main with commit c38c601 Feb 11, 2025
22 checks passed

MarcosNicolau deleted the fix/peers-tcp-connections branch February 11, 2025 14:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(l1): peers tcp connections #1885

fix(l1): peers tcp connections #1885

MarcosNicolau commented Feb 6, 2025 •

edited

Loading

github-actions bot commented Feb 6, 2025 •

edited

Loading

Oppen commented Feb 6, 2025

Oppen Feb 6, 2025

MarcosNicolau Feb 7, 2025

Oppen Feb 6, 2025

MarcosNicolau Feb 7, 2025

Oppen commented Feb 6, 2025

Oppen Feb 6, 2025

MarcosNicolau Feb 7, 2025

MarcosNicolau commented Feb 7, 2025

fix(l1): peers tcp connections #1885

fix(l1): peers tcp connections #1885

Conversation

MarcosNicolau commented Feb 6, 2025 • edited Loading

github-actions bot commented Feb 6, 2025 • edited Loading

Oppen commented Feb 6, 2025

Oppen Feb 6, 2025

Choose a reason for hiding this comment

MarcosNicolau Feb 7, 2025

Choose a reason for hiding this comment

Oppen Feb 6, 2025

Choose a reason for hiding this comment

MarcosNicolau Feb 7, 2025

Choose a reason for hiding this comment

Oppen commented Feb 6, 2025

Oppen Feb 6, 2025

Choose a reason for hiding this comment

MarcosNicolau Feb 7, 2025

Choose a reason for hiding this comment

MarcosNicolau commented Feb 7, 2025

MarcosNicolau commented Feb 6, 2025 •

edited

Loading

github-actions bot commented Feb 6, 2025 •

edited

Loading