Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix sync of beacon nodes on MacOS Holesky host #201

Open
jakubgs opened this issue Sep 24, 2024 · 18 comments
Open

Fix sync of beacon nodes on MacOS Holesky host #201

jakubgs opened this issue Sep 24, 2024 · 18 comments
Assignees

Comments

@jakubgs
Copy link
Member

jakubgs commented Sep 24, 2024

Currently two nodes on macm2-01.ih-eu-mda1.nimbus.holesky are failing the BeaconChainNoAggregatesReceived metrics check:

image

And the API does confirm that:

[email protected]:~ % for port in $(seq 9301 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"979007","sync_distance":"1627664","is_syncing":true,"is_optimistic":true,"el_offline":true}}
{"data":{"head_slot":"1400927","sync_distance":"1205744","is_syncing":true,"is_optimistic":true,"el_offline":false}}
{"data":{"head_slot":"2606671","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":false}}
@jakubgs jakubgs self-assigned this Sep 24, 2024
@jakubgs
Copy link
Member Author

jakubgs commented Sep 24, 2024

If we look at stable node on macm2-01.ih-eu-mda1.nimbus.holesky we can see that the sync distance is just growing:

image

The testing node has a similar pattern:

image

While unstable appears to be fine:

image

@jakubgs
Copy link
Member Author

jakubgs commented Sep 24, 2024

The interesting part is that both stable:

image

And testing:

image

Are unable to get to max peers, while unstable gets there without issues:

image

@jakubgs
Copy link
Member Author

jakubgs commented Sep 24, 2024

But LibP2P ports for all 3 nodes are correctly open on the firewall:

 > nmap -Pn -p9001-9003 macm2-01.ih-eu-mda1.nimbus.holesky.status.im
Starting Nmap 7.94 ( https://nmap.org ) at 2024-09-24 15:13 CEST
Nmap scan report for macm2-01.ih-eu-mda1.nimbus.holesky.status.im (185.181.230.76)
Host is up (0.057s latency).
rDNS record for 185.181.230.76: no-rdns.innovahosting.net

PORT     STATE SERVICE
9001/tcp open  tor-orport
9002/tcp open  dynamid
9003/tcp open  unknown

Nmap done: 1 IP address (1 host up) scanned in 0.12 seconds

So what is different about stable and testing that makes connectivity so bad?

@jakubgs
Copy link
Member Author

jakubgs commented Sep 24, 2024

What's also interesting is that their respective Geth nodes on `` have different matching state too:

[email protected]:~ % for dir in /docker/geth-*; do $dir/rpc.sh eth_syncing | jq -c; done
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":{"currentBlock":"0x0","healedBytecodeBytes":"0x0","healedBytecodes":"0x0","healedTrienodeBytes":"0x0","healedTrienodes":"0x0","healingBytecode":"0x0","healingTrienodes":"0x0","highestBlock":"0x0","startingBlock":"0x0","syncedAccountBytes":"0x0","syncedAccounts":"0x0","syncedBytecodeBytes":"0x0","syncedBytecodes":"0x0","syncedStorage":"0x0","syncedStorageBytes":"0x0","txIndexFinishedBlocks":"0x0","txIndexRemainingBlocks":"0x1"}}

The Geth node for unstable is not synced, while ones for stable and testing appear to be synced:

[email protected]:~ % sudo du -hsc /docker/*
131G	/docker/geth-holesky-01
150G	/docker/geth-holesky-02
4.4G	/docker/geth-holesky-03
284G	total

This is very confusing.

@jakubgs
Copy link
Member Author

jakubgs commented Sep 24, 2024

I found some EL connection failures in stable node logs:

{"lvl":"DBG","ts":"2024-09-24 16:30:46.299+03:00","msg":"Sending message to RPC server","topics":"JSONRPC-HTTP-CLIENT","address":{"oResultPrivate":true,"vResultPrivate":{"id":"holesky-01.ih-eu-mda1.nimbus.geth.wg:8551","scheme":"NonSecure","hostname":"holesky-01.ih-eu-mda1.nimbus.geth.wg","port":8551,"path":"","query":"","anchor":"","username":"","password":"","addresses":[{"family":"IPv4","address_v4":[10,15,0,12],"port":8551}]}},"msg_len":342,"name":"engine_forkchoiceUpdatedV1"}
{"lvl":"DBG","ts":"2024-09-24 16:30:46.300+03:00","msg":"Cancelled POST Request with JSON-RPC","topics":"JSONRPC-HTTP-CLIENT","e":"Future operation cancelled!"}

And indeed the port is closed:

[email protected]:~ % sudo nmap -Pn -p8551-8553 holesky-01.ih-eu-mda1.nimbus.geth.wg
Starting Nmap 7.95 ( https://nmap.org ) at 2024-09-24 16:31 EEST
Nmap scan report for holesky-01.ih-eu-mda1.nimbus.geth.wg (10.15.0.12)
Host is up (0.00085s latency).

PORT     STATE  SERVICE
8551/tcp closed unknown
8552/tcp open   unknown
8553/tcp open   unknown

Nmap done: 1 IP address (1 host up) scanned in 0.03 seconds

But the one for testing node is open, so that doesn't explain low peers at all.

@jakubgs
Copy link
Member Author

jakubgs commented Sep 24, 2024

The geth-holesky-01-node logs on holesky-01.ih-eu-mda1.nimbus.geth show a lot of these:

WARN [09-24|13:30:34.987] Beacon client online, but no consensus updates received in a while. Please fix your beacon client to follow the chain!

But that would make sense since that BN is not actually synced.

@jakubgs
Copy link
Member Author

jakubgs commented Sep 24, 2024

Oh, I see, the port start is wrong, it should start at 8552:

[email protected]:~ % sudo nmap -Pn -p8552-8554 holesky-01.ih-eu-mda1.nimbus.geth.wg
Starting Nmap 7.95 ( https://nmap.org ) at 2024-09-24 16:36 EEST
Nmap scan report for holesky-01.ih-eu-mda1.nimbus.geth.wg (10.15.0.12)
Host is up (0.00079s latency).

PORT     STATE SERVICE
8552/tcp open  unknown
8553/tcp open  unknown
8554/tcp open  rtsp-alt

Nmap done: 1 IP address (1 host up) scanned in 0.02 seconds

This would also explain why last Geth node is not synced. Since it has no BN connected to it.

@jakubgs
Copy link
Member Author

jakubgs commented Sep 24, 2024

Fixed Geth ports:

@jakubgs
Copy link
Member Author

jakubgs commented Sep 25, 2024

This drop in peers of stable node is bizzare:

image

It was doing okay and then it crashes and burns:

image

@jakubgs
Copy link
Member Author

jakubgs commented Sep 25, 2024

There's not a single error in logs, just some warning:

[email protected]:~/beacon-node-holesky-stable/logs % grep ' 07:3' service.20240925_0747.log | grep WRN
{"lvl":"WRN","ts":"2024-09-25 07:31:20.367+03:00","msg":"Peer count low, no new peers discovered","topics":"networking","discovered_nodes":3,"new_peers":[],"current_peers":80,"wanted_peers":320}
{"lvl":"DBG","ts":"2024-09-25 07:35:24.368+03:00","msg":"Peer connection upgraded","topics":"networking","peer":"16Uiu2HAmRnBjeA9DxxWRNno3kx7bFZuoXaB9QXrpf5kZS32P6LWn","connections":1}
{"lvl":"DBG","ts":"2024-09-25 07:35:25.524+03:00","msg":"Peer disconnected","topics":"networking","peer":"16Uiu2HAmRnBjeA9DxxWRNno3kx7bFZuoXaB9QXrpf5kZS32P6LWn","connections":0}
{"lvl":"WRN","ts":"2024-09-25 07:35:27.286+03:00","msg":"Peer count low, no new peers discovered","topics":"networking","discovered_nodes":0,"new_peers":[],"current_peers":80,"wanted_peers":320}
{"lvl":"WRN","ts":"2024-09-25 07:35:55.855+03:00","msg":"Peer count low, no new peers discovered","topics":"networking","discovered_nodes":1,"new_peers":[],"current_peers":79,"wanted_peers":320}
{"lvl":"WRN","ts":"2024-09-25 07:36:07.224+03:00","msg":"Peer count low, no new peers discovered","topics":"networking","discovered_nodes":0,"new_peers":[],"current_peers":76,"wanted_peers":320}
{"lvl":"WRN","ts":"2024-09-25 07:36:34.411+03:00","msg":"Peer count low, no new peers discovered","topics":"networking","discovered_nodes":5,"new_peers":[],"current_peers":78,"wanted_peers":320}

But the disconnects are higher then number of successful connections:

[email protected]:~/beacon-node-holesky-stable/logs % grep ' 07:3' service.20240925_0747.log | grep 'Lost connection to peer' | wc -l
     255

[email protected]:~/beacon-node-holesky-stable/logs % grep ' 07:3' service.20240925_0747.log | grep 'Peer successfully connected' | wc -l
     133

@jakubgs
Copy link
Member Author

jakubgs commented Sep 25, 2024

At the same time the unstable node had absolutely no issues with connectivity on the same host:

image

If network issues were the problem this node would have them too, but it doesn't.

@jakubgs
Copy link
Member Author

jakubgs commented Sep 25, 2024

Here's the full log at that time, I can't find anything relevant: service.20240925_0747.log.gz

@jakubgs
Copy link
Member Author

jakubgs commented Sep 26, 2024

I have purged testing node as stable appears to be syncing better than before:

image

@jakubgs
Copy link
Member Author

jakubgs commented Oct 9, 2024

As we can see from this graph the purging of stable node has not helped it sync:

Image

@jakubgs
Copy link
Member Author

jakubgs commented Oct 9, 2024

This is the most bizzarre part, today suddenly around 5am the stable node just lost a bunch of peers for no clear reason:

Image

Kinda looks like a bug to me. But during the same time testing node was struggling to get any reasonable amount of peers:

Image

And unstable was totally unaffected:

Image

https://grafana.infra.status.im/d/pgeNfj2Wz23/nimbus-beacon-node?orgId=1&var-instance=macm2-01.ih-eu-mda1.nimbus.holesky&var-container=beacon-node-holesky-stable&from=1728425848167&to=1728469048167&query=fluf

@jakubgs
Copy link
Member Author

jakubgs commented Oct 9, 2024

Here are logs from that time from the stable node: service.20241009_0515.log.gz

@jakubgs
Copy link
Member Author

jakubgs commented Oct 9, 2024

What the fuck is happening here?

Image

@jakubgs
Copy link
Member Author

jakubgs commented Oct 9, 2024

The most bizzarre part must be that incoming traffic at the same time drops to nearly zero:

Image

While outgoing goes up matching the raise in peers:

Image

How does any of this make sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant