Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JVM crash when restarting teku (24.10.3) running rocksDB #8939

Closed
tbenr opened this issue Dec 19, 2024 · 3 comments · May be fixed by #9080
Closed

JVM crash when restarting teku (24.10.3) running rocksDB #8939

tbenr opened this issue Dec 19, 2024 · 3 comments · May be fixed by #9080
Assignees

Comments

@tbenr
Copy link
Contributor

tbenr commented Dec 19, 2024

Teku is shutting down
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000000000001, pid=692, tid=1558552
#
# JRE version: OpenJDK Runtime Environment (21.0.5+11) (build 21.0.5+11-Ubuntu-1ubuntu122.04)
# Java VM: OpenJDK 64-Bit Server VM (21.0.5+11-Ubuntu-1ubuntu122.04, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C  [librocksdbjni4770057874007981762.so+0x68f758]{"@timestamp":"2024-12-19T11:12:26,793","level":"INFO","thread":"Thread-12","class":"Javalin","message":"Stopping Javalin ...","throwable":""}
{"@timestamp":"2024-12-19T11:12:26,794","level":"INFO","thread":"Thread-12","class":"Server","message":"Stopped Server@60a43613{STOPPING}[11.0.23,sto=0]","throwable":""}
  rocksdb::RandomAccessFileReader::Read(rocksdb::IOOptions const&, unsigned long, unsigned long, rocksdb::Slice*, char*, std::unique_ptr<char [], std::default_delete<char []> >*) const+0xa58
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /opt/teku/teku-24.10.3/core.692)
#
@rolfyone
Copy link
Contributor

did we get the stack from the core dump?

@gfukushima
Copy link
Contributor

After doing a fair bit of investigation on this I've come to mainly 2 conclusions.

  1. There are some operations that we're currently performing and for some reason are costly. Long range streams on some columns can take a minute in some cases. See blob and block pruner timer metrics before Reuse earliest blob slot #9031 gets merged. We should avoid using this unnecessary long streams since they use Iterator and in this case as specified in the rocksDB wiki they can hold resources. (https://github.com/facebook/rocksdb/wiki/Iterator#resource-pinned-by-iterators-and-iterator-refreshing)
    I've pushed a few PR to get rid of some of the costly streaming that we were using in the pruners unnecessarily, since we do hold the earliest entries of blobs and blocks column in db variables.

  2. Is that our current implementation doesn't necessarily stop all the services/channels in a timely manner. We do call the stop methods which in some cases don't ensure that things have been stopped necessarily. We can see that some of the segfaults stack traces are coming up with call from the CombinedStorageChannelSplitter which should've been stopped before the storage service gets stopped according to

    . There are likely executions that have been running prior to the database shutdown and are holding resources as mentioned on item number 1 since we do have checks before creating new streams in the RocksDB level

@gfukushima
Copy link
Contributor

There's been a few PR's merged into main that should significantly reduce this.
#9031
#9046
#9054
#9066

Ultimately a better configuration up on creating/ initializing rocksdb should get us in a better place with the iterators performance and reduce the time resources are held by those iterators.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants