JVM crash when restarting teku (24.10.3) running rocksDB #8939

tbenr · 2024-12-19T11:15:37Z

Teku is shutting down
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000000000001, pid=692, tid=1558552
#
# JRE version: OpenJDK Runtime Environment (21.0.5+11) (build 21.0.5+11-Ubuntu-1ubuntu122.04)
# Java VM: OpenJDK 64-Bit Server VM (21.0.5+11-Ubuntu-1ubuntu122.04, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C  [librocksdbjni4770057874007981762.so+0x68f758]{"@timestamp":"2024-12-19T11:12:26,793","level":"INFO","thread":"Thread-12","class":"Javalin","message":"Stopping Javalin ...","throwable":""}
{"@timestamp":"2024-12-19T11:12:26,794","level":"INFO","thread":"Thread-12","class":"Server","message":"Stopped Server@60a43613{STOPPING}[11.0.23,sto=0]","throwable":""}
  rocksdb::RandomAccessFileReader::Read(rocksdb::IOOptions const&, unsigned long, unsigned long, rocksdb::Slice*, char*, std::unique_ptr<char [], std::default_delete<char []> >*) const+0xa58
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /opt/teku/teku-24.10.3/core.692)
#

The text was updated successfully, but these errors were encountered:

rolfyone · 2024-12-25T22:35:27Z

did we get the stack from the core dump?

gfukushima · 2025-01-29T05:41:33Z

After doing a fair bit of investigation on this I've come to mainly 2 conclusions.

There are some operations that we're currently performing and for some reason are costly. Long range streams on some columns can take a minute in some cases. See blob and block pruner timer metrics before Reuse earliest blob slot #9031 gets merged. We should avoid using this unnecessary long streams since they use Iterator and in this case as specified in the rocksDB wiki they can hold resources. (https://github.com/facebook/rocksdb/wiki/Iterator#resource-pinned-by-iterators-and-iterator-refreshing)
I've pushed a few PR to get rid of some of the costly streaming that we were using in the pruners unnecessarily, since we do hold the earliest entries of blobs and blocks column in db variables.
Is that our current implementation doesn't necessarily stop all the services/channels in a timely manner. We do call the stop methods which in some cases don't ensure that things have been stopped necessarily. We can see that some of the segfaults stack traces are coming up with call from the CombinedStorageChannelSplitter which should've been stopped before the storage service gets stopped according to

teku/teku/src/main/java/tech/pegasys/teku/AbstractNode.java

Line 234 in 633716e

public void stop() {

. There are likely executions that have been running prior to the database shutdown and are holding resources as mentioned on item number 1 since we do have checks before creating new streams in the RocksDB level

gfukushima · 2025-02-04T01:19:19Z

There's been a few PR's merged into main that should significantly reduce this.
#9031
#9046
#9054
#9066

Ultimately a better configuration up on creating/ initializing rocksdb should get us in a better place with the iterators performance and reduce the time resources are held by those iterators.

gfukushima self-assigned this Jan 20, 2025

gfukushima mentioned this issue Jan 20, 2025

Assert DB is open before createStreamKeyRaw #9014

Closed

2 tasks

gfukushima closed this as completed Feb 4, 2025

gfukushima mentioned this issue Feb 4, 2025

Rocksdb performance improvements #9080

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JVM crash when restarting teku (24.10.3) running rocksDB #8939

JVM crash when restarting teku (24.10.3) running rocksDB #8939

tbenr commented Dec 19, 2024

rolfyone commented Dec 25, 2024

gfukushima commented Jan 29, 2025

gfukushima commented Feb 4, 2025

JVM crash when restarting teku (24.10.3) running rocksDB #8939

JVM crash when restarting teku (24.10.3) running rocksDB #8939

Comments

tbenr commented Dec 19, 2024

rolfyone commented Dec 25, 2024

gfukushima commented Jan 29, 2025

gfukushima commented Feb 4, 2025