ingest/ledgerbackend: Improve thread-safety of stellarCoreRunner.close() #5307
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR Checklist
PR Structure
otherwise).
services/friendbot
, orall
ordoc
if the changes are broad or impact manypackages.
Thoroughness
.md
files, etc... affected by this change). Take a look in the
docs
folder for a given service,like this one.
Release planning
needed with deprecations, added features, breaking changes, and DB schema changes.
semver, or if it's mainly a patch change. The PR is targeted at the next
release branch if it's not a patch change.
What
While looking at ingestion logs I observed an unexpected error
error preparing range: error starting prepare range: the previous Stellar-Core instance is still running
which occurred when we rolled out a new version of core:The error occurred in this block of code:
go/ingest/ledgerbackend/captive_core_backend.go
Lines 456 to 460 in ab3a926
I was surprised that stellar-core was still running even though we just called
c.stellarCoreRunner.close()
. I expected theclose()
function on stellarCoreRunner to block until the core process was reaped. However, it turns out that is not the case in this scenario because theclose()
function is called by multiple go routines.Once the new core binary is detected, the file watcher go routine will first call
close()
go/ingest/ledgerbackend/file_watcher.go
Line 66 in ab3a926
close()
and that returns immediately based on this condition:go/ingest/ledgerbackend/stellar_core_runner.go
Lines 555 to 559 in ab3a926
This PR fixes this issue by guaranteeing that the core process will be reaped after any invocation of
close()
, even ifclose()
is called concurrently. This behavior is implemented by using https://pkg.go.dev/sync#OnceWhy
This bug does not prevent ingestion from progressing because upon errors we will still retry to ingest and eventually
PrepareRange()
is able to succeed once the core process is reaped. However, it is still good to get rid of this error case for the following reasons:Known limitations
[N/A]