Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(l1): peers tcp connections #1885

Merged
merged 11 commits into from
Feb 11, 2025
Merged

fix(l1): peers tcp connections #1885

merged 11 commits into from
Feb 11, 2025

Conversation

MarcosNicolau
Copy link
Contributor

@MarcosNicolau MarcosNicolau commented Feb 6, 2025

Motivation
We observed that our peer count was significantly lower compared to other execution clients.

Description
This PR refines the exploratory work done in #1875. While debugging, we developed tools to track individual peer traces (see here if interested). Through this, we discovered that although we were exchanging messages with many peers, they were not appearing in the stats. This indicated that they were not being inserted into the KademliaTable.

The root cause was that the Bucket was already full, which happens quickly since each bucket only holds 10 entries. Below is a list of changes that improved the peer situation:

  • Peers were connected but not properly inserted into the table: We temporarily fixed this by forcing peer insertion, even if it exceeded the protocol’s allowed limit. An alternative solution could be maintaining a secondary table for P2P peers, but that would be better suited for the incoming refactor.
  • Frame decoding verification was causing connection crashes: Specifically, an assertion on header-data was failing. While the spec dictates that it should be RLP-encoded zero, some clients would not send the encoded values. Additionally, we found that no other clients seem to enforce verification on this field.
  • Added logging for message exchanges between peers.
  • Refactored functions that were over-locking the KademliaTable: This became particularly noticeable when a high number of peers were discovered.

Other Findings

  • Incorrect capability exchange logic: We attempt to connect to peers with the snap capability without first verifying if they support our eth protocol version. This leads to failed connections. A separate PR will address this, this is the cause of the observed error message related to capability exchange failures due to mismatched eth and snap protocols.
  • Most disconnections occur because peers have already reached their connection limit (sending us Too many peers as the reason).
  • A future improvement could be implementing a reconnection mechanism.
  • The majority of our connections are incoming.

Closes #1811

Copy link

github-actions bot commented Feb 6, 2025

| File                                                                     | Lines | Diff |
+--------------------------------------------------------------------------+-------+------+
| /home/runner/work/ethrex/ethrex/crates/networking/p2p/discv4/server.rs   | 677   | +1   |
+--------------------------------------------------------------------------+-------+------+
| /home/runner/work/ethrex/ethrex/crates/networking/p2p/kademlia.rs        | 492   | +1   |
+--------------------------------------------------------------------------+-------+------+
| /home/runner/work/ethrex/ethrex/crates/networking/p2p/network.rs         | 169   | +20  |
+--------------------------------------------------------------------------+-------+------+
| /home/runner/work/ethrex/ethrex/crates/networking/p2p/rlpx/connection.rs | 414   | +20  |
+--------------------------------------------------------------------------+-------+------+
| /home/runner/work/ethrex/ethrex/crates/networking/p2p/rlpx/frame.rs      | 194   | -1   |
+--------------------------------------------------------------------------+-------+------+

Total lines added: +42
Total lines removed: 1
Total lines changed: 43

@MarcosNicolau MarcosNicolau marked this pull request as ready for review February 6, 2025 19:45
@MarcosNicolau MarcosNicolau requested a review from a team as a code owner February 6, 2025 19:45
@MarcosNicolau MarcosNicolau self-assigned this Feb 6, 2025
@Oppen
Copy link
Contributor

Oppen commented Feb 6, 2025

Will you make a PR with the tooling? I think it would be a good thing to have in the repo.

@@ -693,7 +694,7 @@ pub(super) mod tests {
broadcast,
};

let discv4 = Discv4Server::try_new(ctx).await?;
let discv4 = Discv4Server::try_new(ctx.clone()).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ctx safe to clone? How does storage behave?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is safe! You can see the Store struct here, cloning is just incrementing the reference count.

/// Inserts a node into the table, even if the bucket is filled.
/// # Returns
/// A tuple containing:
/// 1. PeerData: none if the peer was already in the table or as a potential replacement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect the opposite here, that is, if it was in the table I may want to check the old value. How is it currently used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. That would be the expected and usual behaviour. But this is a copy of a previous function (see here and here for a usage example). I wouldn't change this behaviour now as the future networking refactor is probably going to do this already.

@Oppen
Copy link
Contributor

Oppen commented Feb 6, 2025

I have a few high level questions:

  1. How do we determine the bucket for a given node?
  2. Is the bucket getting full or the whole table?
  3. If it's just the bucket, did you consider it might be a distribution problem (that is, we might have too many collisions), and that maybe we need to change how we determine the bucker? Or is that an untouchable part of the protocol?

crates/networking/p2p/net.rs Outdated Show resolved Hide resolved
@@ -122,10 +122,6 @@ impl Decoder for RLPxCodec {
let mut temp_ingress_aes = self.ingress_aes.clone();
temp_ingress_aes.apply_keystream(header_text);

// header-data = [capability-id, context-id]
// Both are unused, and always zero
assert_eq!(&header_text[3..6], &(0_u8, 0_u8).encode_to_vec());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this check a bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is ok, and it aligns with the spec (see here), but:

  1. The fields are not being used anywhere in the protocol now.
  2. Via the logs, I noticed that some clients may send a different value, but the connection would still be correct.
  3. Additionally, I found that the rest of the execution clients (at least geth, besu, and nethermind) don't enforce this header to be RLP Encoded 0, so I preferred to remove it simply.

@MarcosNicolau
Copy link
Contributor Author

I have a few high level questions:

  1. How do we determine the bucket for a given node?
  2. Is the bucket getting full or the whole table?
  3. If it's just the bucket, did you consider it might be a distribution problem (that is, we might have too many collisions), and that maybe we need to change how we determine the bucker? Or is that an untouchable part of the protocol?
  1. The bucket is defined as the XOR between the keecak256 of the node public key and our local node public key.
  2. The bucket is getting filled, not the whole table; the temporary fix is to bypass the spec max size when we know the peer TCP connection was successful.
  3. I suspect that the issue might lie in the way we are discarding peers. I am actually debugging that, but for the time being, I would leave this solution to fix the peer situation asap.

Most of it can be read on the spec here.

@MarcosNicolau MarcosNicolau requested a review from Oppen February 7, 2025 19:38
@MarcosNicolau MarcosNicolau added this pull request to the merge queue Feb 11, 2025
Merged via the queue into main with commit c38c601 Feb 11, 2025
22 checks passed
@MarcosNicolau MarcosNicolau deleted the fix/peers-tcp-connections branch February 11, 2025 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug(l1): Delete peer from kademlia table on failure when establishing tcp connection.
3 participants