Fix sync of beacon nodes on MacOS Holesky host #201

jakubgs · 2024-09-24T13:07:59Z

Currently two nodes on macm2-01.ih-eu-mda1.nimbus.holesky are failing the BeaconChainNoAggregatesReceived metrics check:

And the API does confirm that:

[email protected]:~ % for port in $(seq 9301 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"979007","sync_distance":"1627664","is_syncing":true,"is_optimistic":true,"el_offline":true}}
{"data":{"head_slot":"1400927","sync_distance":"1205744","is_syncing":true,"is_optimistic":true,"el_offline":false}}
{"data":{"head_slot":"2606671","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":false}}

The text was updated successfully, but these errors were encountered:

jakubgs · 2024-09-24T13:10:33Z

If we look at stable node on macm2-01.ih-eu-mda1.nimbus.holesky we can see that the sync distance is just growing:

The testing node has a similar pattern:

While unstable appears to be fine:

jakubgs · 2024-09-24T13:12:01Z

The interesting part is that both stable:

And testing:

Are unable to get to max peers, while unstable gets there without issues:

jakubgs · 2024-09-24T13:13:36Z

But LibP2P ports for all 3 nodes are correctly open on the firewall:

 > nmap -Pn -p9001-9003 macm2-01.ih-eu-mda1.nimbus.holesky.status.im
Starting Nmap 7.94 ( https://nmap.org ) at 2024-09-24 15:13 CEST
Nmap scan report for macm2-01.ih-eu-mda1.nimbus.holesky.status.im (185.181.230.76)
Host is up (0.057s latency).
rDNS record for 185.181.230.76: no-rdns.innovahosting.net

PORT     STATE SERVICE
9001/tcp open  tor-orport
9002/tcp open  dynamid
9003/tcp open  unknown

Nmap done: 1 IP address (1 host up) scanned in 0.12 seconds

So what is different about stable and testing that makes connectivity so bad?

jakubgs · 2024-09-24T13:16:36Z

What's also interesting is that their respective Geth nodes on `` have different matching state too:

[email protected]:~ % for dir in /docker/geth-*; do $dir/rpc.sh eth_syncing | jq -c; done
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":{"currentBlock":"0x0","healedBytecodeBytes":"0x0","healedBytecodes":"0x0","healedTrienodeBytes":"0x0","healedTrienodes":"0x0","healingBytecode":"0x0","healingTrienodes":"0x0","highestBlock":"0x0","startingBlock":"0x0","syncedAccountBytes":"0x0","syncedAccounts":"0x0","syncedBytecodeBytes":"0x0","syncedBytecodes":"0x0","syncedStorage":"0x0","syncedStorageBytes":"0x0","txIndexFinishedBlocks":"0x0","txIndexRemainingBlocks":"0x1"}}

The Geth node for unstable is not synced, while ones for stable and testing appear to be synced:

[email protected]:~ % sudo du -hsc /docker/*
131G	/docker/geth-holesky-01
150G	/docker/geth-holesky-02
4.4G	/docker/geth-holesky-03
284G	total

This is very confusing.

jakubgs · 2024-09-24T13:32:28Z

I found some EL connection failures in stable node logs:

{"lvl":"DBG","ts":"2024-09-24 16:30:46.299+03:00","msg":"Sending message to RPC server","topics":"JSONRPC-HTTP-CLIENT","address":{"oResultPrivate":true,"vResultPrivate":{"id":"holesky-01.ih-eu-mda1.nimbus.geth.wg:8551","scheme":"NonSecure","hostname":"holesky-01.ih-eu-mda1.nimbus.geth.wg","port":8551,"path":"","query":"","anchor":"","username":"","password":"","addresses":[{"family":"IPv4","address_v4":[10,15,0,12],"port":8551}]}},"msg_len":342,"name":"engine_forkchoiceUpdatedV1"}
{"lvl":"DBG","ts":"2024-09-24 16:30:46.300+03:00","msg":"Cancelled POST Request with JSON-RPC","topics":"JSONRPC-HTTP-CLIENT","e":"Future operation cancelled!"}

And indeed the port is closed:

[email protected]:~ % sudo nmap -Pn -p8551-8553 holesky-01.ih-eu-mda1.nimbus.geth.wg
Starting Nmap 7.95 ( https://nmap.org ) at 2024-09-24 16:31 EEST
Nmap scan report for holesky-01.ih-eu-mda1.nimbus.geth.wg (10.15.0.12)
Host is up (0.00085s latency).

PORT     STATE  SERVICE
8551/tcp closed unknown
8552/tcp open   unknown
8553/tcp open   unknown

Nmap done: 1 IP address (1 host up) scanned in 0.03 seconds

But the one for testing node is open, so that doesn't explain low peers at all.

jakubgs · 2024-09-24T13:36:29Z

The geth-holesky-01-node logs on holesky-01.ih-eu-mda1.nimbus.geth show a lot of these:

WARN [09-24|13:30:34.987] Beacon client online, but no consensus updates received in a while. Please fix your beacon client to follow the chain!

But that would make sense since that BN is not actually synced.

jakubgs · 2024-09-24T13:37:01Z

Oh, I see, the port start is wrong, it should start at 8552:

[email protected]:~ % sudo nmap -Pn -p8552-8554 holesky-01.ih-eu-mda1.nimbus.geth.wg
Starting Nmap 7.95 ( https://nmap.org ) at 2024-09-24 16:36 EEST
Nmap scan report for holesky-01.ih-eu-mda1.nimbus.geth.wg (10.15.0.12)
Host is up (0.00079s latency).

PORT     STATE SERVICE
8552/tcp open  unknown
8553/tcp open  unknown
8554/tcp open  rtsp-alt

Nmap done: 1 IP address (1 host up) scanned in 0.02 seconds

This would also explain why last Geth node is not synced. Since it has no BN connected to it.

jakubgs · 2024-09-24T13:47:42Z

Fixed Geth ports:

infra-nimbus#126afac1 - holesky: fix Geth port numbers on macm2-01

jakubgs · 2024-09-25T08:38:44Z

This drop in peers of stable node is bizzare:

It was doing okay and then it crashes and burns:

jakubgs · 2024-09-25T08:39:04Z

There's not a single error in logs, just some warning:

[email protected]:~/beacon-node-holesky-stable/logs % grep ' 07:3' service.20240925_0747.log | grep WRN
{"lvl":"WRN","ts":"2024-09-25 07:31:20.367+03:00","msg":"Peer count low, no new peers discovered","topics":"networking","discovered_nodes":3,"new_peers":[],"current_peers":80,"wanted_peers":320}
{"lvl":"DBG","ts":"2024-09-25 07:35:24.368+03:00","msg":"Peer connection upgraded","topics":"networking","peer":"16Uiu2HAmRnBjeA9DxxWRNno3kx7bFZuoXaB9QXrpf5kZS32P6LWn","connections":1}
{"lvl":"DBG","ts":"2024-09-25 07:35:25.524+03:00","msg":"Peer disconnected","topics":"networking","peer":"16Uiu2HAmRnBjeA9DxxWRNno3kx7bFZuoXaB9QXrpf5kZS32P6LWn","connections":0}
{"lvl":"WRN","ts":"2024-09-25 07:35:27.286+03:00","msg":"Peer count low, no new peers discovered","topics":"networking","discovered_nodes":0,"new_peers":[],"current_peers":80,"wanted_peers":320}
{"lvl":"WRN","ts":"2024-09-25 07:35:55.855+03:00","msg":"Peer count low, no new peers discovered","topics":"networking","discovered_nodes":1,"new_peers":[],"current_peers":79,"wanted_peers":320}
{"lvl":"WRN","ts":"2024-09-25 07:36:07.224+03:00","msg":"Peer count low, no new peers discovered","topics":"networking","discovered_nodes":0,"new_peers":[],"current_peers":76,"wanted_peers":320}
{"lvl":"WRN","ts":"2024-09-25 07:36:34.411+03:00","msg":"Peer count low, no new peers discovered","topics":"networking","discovered_nodes":5,"new_peers":[],"current_peers":78,"wanted_peers":320}

But the disconnects are higher then number of successful connections:

[email protected]:~/beacon-node-holesky-stable/logs % grep ' 07:3' service.20240925_0747.log | grep 'Lost connection to peer' | wc -l
     255

[email protected]:~/beacon-node-holesky-stable/logs % grep ' 07:3' service.20240925_0747.log | grep 'Peer successfully connected' | wc -l
     133

jakubgs · 2024-09-25T08:39:59Z

At the same time the unstable node had absolutely no issues with connectivity on the same host:

If network issues were the problem this node would have them too, but it doesn't.

jakubgs · 2024-09-25T08:41:12Z

Here's the full log at that time, I can't find anything relevant: service.20240925_0747.log.gz

jakubgs · 2024-09-26T06:43:15Z

I have purged testing node as stable appears to be syncing better than before:

jakubgs · 2024-10-09T10:08:34Z

As we can see from this graph the purging of stable node has not helped it sync:

jakubgs · 2024-10-09T10:10:05Z

This is the most bizzarre part, today suddenly around 5am the stable node just lost a bunch of peers for no clear reason:

Kinda looks like a bug to me. But during the same time testing node was struggling to get any reasonable amount of peers:

And unstable was totally unaffected:

https://grafana.infra.status.im/d/pgeNfj2Wz23/nimbus-beacon-node?orgId=1&var-instance=macm2-01.ih-eu-mda1.nimbus.holesky&var-container=beacon-node-holesky-stable&from=1728425848167&to=1728469048167&query=fluf

jakubgs · 2024-10-09T10:16:31Z

Here are logs from that time from the stable node: service.20241009_0515.log.gz

jakubgs · 2024-10-09T10:22:05Z

What the fuck is happening here?

jakubgs · 2024-10-09T10:24:27Z

The most bizzarre part must be that incoming traffic at the same time drops to nearly zero:

While outgoing goes up matching the raise in peers:

How does any of this make sense?

jakubgs self-assigned this Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sync of beacon nodes on MacOS Holesky host #201

Fix sync of beacon nodes on MacOS Holesky host #201

jakubgs commented Sep 24, 2024

jakubgs commented Sep 24, 2024

jakubgs commented Sep 24, 2024

jakubgs commented Sep 24, 2024

jakubgs commented Sep 24, 2024

jakubgs commented Sep 24, 2024 •

edited

Loading

jakubgs commented Sep 24, 2024

jakubgs commented Sep 24, 2024 •

edited

Loading

jakubgs commented Sep 24, 2024

jakubgs commented Sep 25, 2024

jakubgs commented Sep 25, 2024 •

edited

Loading

jakubgs commented Sep 25, 2024

jakubgs commented Sep 25, 2024

jakubgs commented Sep 26, 2024

jakubgs commented Oct 9, 2024

jakubgs commented Oct 9, 2024 •

edited

Loading

jakubgs commented Oct 9, 2024

jakubgs commented Oct 9, 2024

jakubgs commented Oct 9, 2024

Fix sync of beacon nodes on MacOS Holesky host #201

Fix sync of beacon nodes on MacOS Holesky host #201

Comments

jakubgs commented Sep 24, 2024

jakubgs commented Sep 24, 2024

jakubgs commented Sep 24, 2024

jakubgs commented Sep 24, 2024

jakubgs commented Sep 24, 2024

jakubgs commented Sep 24, 2024 • edited Loading

jakubgs commented Sep 24, 2024

jakubgs commented Sep 24, 2024 • edited Loading

jakubgs commented Sep 24, 2024

jakubgs commented Sep 25, 2024

jakubgs commented Sep 25, 2024 • edited Loading

jakubgs commented Sep 25, 2024

jakubgs commented Sep 25, 2024

jakubgs commented Sep 26, 2024

jakubgs commented Oct 9, 2024

jakubgs commented Oct 9, 2024 • edited Loading

jakubgs commented Oct 9, 2024

jakubgs commented Oct 9, 2024

jakubgs commented Oct 9, 2024

jakubgs commented Sep 24, 2024 •

edited

Loading

jakubgs commented Sep 24, 2024 •

edited

Loading

jakubgs commented Sep 25, 2024 •

edited

Loading

jakubgs commented Oct 9, 2024 •

edited

Loading