Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest/ledgerbackend: Improve thread-safety of stellarCoreRunner.close() #5307

Merged
merged 4 commits into from
May 10, 2024

Conversation

tamirms
Copy link
Contributor

@tamirms tamirms commented May 9, 2024

PR Checklist

PR Structure

  • This PR has reasonably narrow scope (if not, break it down into smaller PRs).
  • This PR avoids mixing refactoring changes with feature changes (split into two PRs
    otherwise).
  • This PR's title starts with name of package that is most changed in the PR, ex.
    services/friendbot, or all or doc if the changes are broad or impact many
    packages.

Thoroughness

  • This PR adds tests for the most critical parts of the new functionality or fixes.
  • I've updated any docs (developer docs, .md
    files, etc... affected by this change). Take a look in the docs folder for a given service,
    like this one.

Release planning

  • I've updated the relevant CHANGELOG (here for Horizon) if
    needed with deprecations, added features, breaking changes, and DB schema changes.
  • I've decided if this PR requires a new major/minor version according to
    semver, or if it's mainly a patch change. The PR is targeted at the next
    release branch if it's not a patch change.

What

While looking at ingestion logs I observed an unexpected error error preparing range: error starting prepare range: the previous Stellar-Core instance is still running which occurred when we rolled out a new version of core:

 level=warning msg="detected new version of captive core binary /usr/bin/stellar-core , aborting session." service=ingest subservice=stellar-core
 level=info msg="default: got signal 2, shutting down" service=ingest subservice=stellar-core
 level=info msg="Processed ledger" commit=true duration=0.732382788 ledger=true sequence=51570776 service=ingest state=true
 level=info msg="Ingestion system state machine transition" current_state="resume(latestSuccessfullyProcessedLedger=51570775)" next_state="resume(latestSuccessfullyProcessedLedger=51570776)" service=ingest level=error msg="Error in ingestion state machine" current_state="resume(latestSuccessfullyProcessedLedger=51570776)" error="error preparing range: error starting prepare range: the previous Stellar-Core instance is still running" next_state=start service=ingest level=info msg="default: Application destructing" service=ingest subservice=stellar-core
 level=info msg="default: Application destroyed" pid=31974 service=ingest subservice=stellar-core

The error occurred in this block of code:

// Make sure Stellar-Core is terminated before starting a new instance.
processExited, _ := c.stellarCoreRunner.getProcessExitError()
if !processExited {
return false, errors.New("the previous Stellar-Core instance is still running")
}

I was surprised that stellar-core was still running even though we just called c.stellarCoreRunner.close(). I expected the close() function on stellarCoreRunner to block until the core process was reaped. However, it turns out that is not the case in this scenario because the close() function is called by multiple go routines.

Once the new core binary is detected, the file watcher go routine will first call close()

runner.log.Warnf("detected new version of captive core binary %s , aborting session.", runner.executablePath)
. This invocation will block until the core process is reaped. However, it takes a few seconds for that to happen. So, while that is occurring PrepareRange() calls close() and that returns immediately based on this condition:

// check if we have already closed
if storagePath == "" {
r.lock.Unlock()
return nil
}

This PR fixes this issue by guaranteeing that the core process will be reaped after any invocation of close() , even if close() is called concurrently. This behavior is implemented by using https://pkg.go.dev/sync#Once

Why

This bug does not prevent ingestion from progressing because upon errors we will still retry to ingest and eventually PrepareRange() is able to succeed once the core process is reaped. However, it is still good to get rid of this error case for the following reasons:

  • reduce unnecessary delays caused by ingestion retries
  • reduce error messages in the ingestion logs
  • we will soon alert on ingestion errors so it's good to remove cases which should not actually be errors

Known limitations

[N/A]

@tamirms tamirms marked this pull request as ready for review May 9, 2024 13:57
@tamirms tamirms requested a review from a team May 9, 2024 13:58
@tamirms tamirms enabled auto-merge (squash) May 10, 2024 10:07
@tamirms tamirms merged commit a4e5a3f into stellar:master May 10, 2024
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants