Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexer Causes Unclean Shutdown Race Condition #1387

Open
crypto-services opened this issue Dec 12, 2024 · 4 comments
Open

Indexer Causes Unclean Shutdown Race Condition #1387

crypto-services opened this issue Dec 12, 2024 · 4 comments
Assignees

Comments

@crypto-services
Copy link

crypto-services commented Dec 12, 2024

System information

Bor client version: 0.5.1

Heimdall client version: 1.0.10

OS & Version: Ubuntu 22.04/24.04

Environment: Polygon Mainnet

Type of node: All

Overview of the problem

When attempting to shutdown the Bor service (SIGINT) the process appears unable to exit if the indexer was running at the time. This results in the SIGINT timing out triggering SIGKILL which either dumps the recent state or worse corrupts the db.

Reproduction Steps

Happens often when stopping the Bor service with SIGINT.

Logs / Traces / Output / Error Messages

Gracefully shutting down agent...
INFO [12-12|03:06:12.072] HTTP server stopped                      endpoint=127.0.0.1:8545
INFO [12-12|03:06:12.072] IPC endpoint closed                      url=/home/polygon/.bor/data/bor.ipc
INFO [12-12|03:06:12.072] Stats daemon stopped
INFO [12-12|03:06:12.073] Ethereum protocol stopped
INFO [12-12|03:06:12.073] Transaction pool stopped
INFO [12-12|03:06:12.074] Waiting background transaction indexer to exit
INFO [12-12|03:06:22.080] Looking for peers                        peercount=1 tried=135 static=5
INFO [12-12|03:06:32.085] Looking for peers                        peercount=1 tried=168 static=5
...

The node is not able to reach step Writing cached state to disk and is eventually killed. This does not happen when the indexer isn't running at the moment of shutdown.

@crypto-services
Copy link
Author

This has just occurred again while updating one of our sentries. We've updated 6 of our nodes in the past couple of days and it happened on 50% of them all with the same behaviour. We have 10 minutes timeout set for our service and we're deploying/updating with Ansible so there is no change to the process.

All print Waiting background transaction indexer to exit then Looking for peers for 10 minutes followed by Killing process 3706391 (bor) with signal SIGKILL. Writing cached state to disk never occurs which causes the node to loose unwritten blocks as per the image underneath.

Screenshot 2024-12-13 at 09 02 27

@marcello33
Copy link
Contributor

Hey @crypto-services have you tried with a more recent version of bor?

@crypto-services
Copy link
Author

Hey @crypto-services have you tried with a more recent version of bor?

The issue still appears to be present on v1.5.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
@marcello33 @manav2401 @crypto-services and others