ingest/ledgerbackend: Improve thread-safety of stellarCoreRunner.close() #5307

tamirms · 2024-05-09T10:15:19Z

PR Checklist

PR Structure

This PR has reasonably narrow scope (if not, break it down into smaller PRs).
This PR avoids mixing refactoring changes with feature changes (split into two PRs
otherwise).
This PR's title starts with name of package that is most changed in the PR, ex.
services/friendbot, or all or doc if the changes are broad or impact many
packages.

Thoroughness

This PR adds tests for the most critical parts of the new functionality or fixes.
I've updated any docs (developer docs, .md
files, etc... affected by this change). Take a look in the docs folder for a given service,
like this one.

Release planning

I've updated the relevant CHANGELOG (here for Horizon) if
needed with deprecations, added features, breaking changes, and DB schema changes.
I've decided if this PR requires a new major/minor version according to
semver, or if it's mainly a patch change. The PR is targeted at the next
release branch if it's not a patch change.

What

While looking at ingestion logs I observed an unexpected error error preparing range: error starting prepare range: the previous Stellar-Core instance is still running which occurred when we rolled out a new version of core:

 level=warning msg="detected new version of captive core binary /usr/bin/stellar-core , aborting session." service=ingest subservice=stellar-core
 level=info msg="default: got signal 2, shutting down" service=ingest subservice=stellar-core
 level=info msg="Processed ledger" commit=true duration=0.732382788 ledger=true sequence=51570776 service=ingest state=true
 level=info msg="Ingestion system state machine transition" current_state="resume(latestSuccessfullyProcessedLedger=51570775)" next_state="resume(latestSuccessfullyProcessedLedger=51570776)" service=ingest level=error msg="Error in ingestion state machine" current_state="resume(latestSuccessfullyProcessedLedger=51570776)" error="error preparing range: error starting prepare range: the previous Stellar-Core instance is still running" next_state=start service=ingest level=info msg="default: Application destructing" service=ingest subservice=stellar-core
 level=info msg="default: Application destroyed" pid=31974 service=ingest subservice=stellar-core

The error occurred in this block of code:

go/ingest/ledgerbackend/captive_core_backend.go

Lines 456 to 460 in ab3a926

    
           // Make sure Stellar-Core is terminated before starting a new instance. 
        
           processExited, _ := c.stellarCoreRunner.getProcessExitError() 
        
           if !processExited { 
        
           	return false, errors.New("the previous Stellar-Core instance is still running") 
        
           }

I was surprised that stellar-core was still running even though we just called c.stellarCoreRunner.close(). I expected the close() function on stellarCoreRunner to block until the core process was reaped. However, it turns out that is not the case in this scenario because the close() function is called by multiple go routines.

Once the new core binary is detected, the file watcher go routine will first call close()

go/ingest/ledgerbackend/file_watcher.go

Line 66 in ab3a926

    
           runner.log.Warnf("detected new version of captive core binary %s , aborting session.", runner.executablePath)

. This invocation will block until the core process is reaped. However, it takes a few seconds for that to happen. So, while that is occurring PrepareRange() calls close() and that returns immediately based on this condition:

go/ingest/ledgerbackend/stellar_core_runner.go

Lines 555 to 559 in ab3a926

    
           // check if we have already closed 
        
           if storagePath == "" { 
        
           	r.lock.Unlock() 
        
           	return nil 
        
           }

This PR fixes this issue by guaranteeing that the core process will be reaped after any invocation of close() , even if close() is called concurrently. This behavior is implemented by using https://pkg.go.dev/sync#Once

Why

This bug does not prevent ingestion from progressing because upon errors we will still retry to ingest and eventually PrepareRange() is able to succeed once the core process is reaped. However, it is still good to get rid of this error case for the following reasons:

reduce unnecessary delays caused by ingestion retries
reduce error messages in the ingestion logs
we will soon alert on ingestion errors so it's good to remove cases which should not actually be errors

Known limitations

[N/A]

ingest/ledgerbackend/stellar_core_runner_test.go

tamirms added 3 commits May 9, 2024 11:12

improve thread-safety of stellarCoreRunner.close()

384fd9e

Add test

cf2ecf6

fix test

1664865

tamirms marked this pull request as ready for review May 9, 2024 13:57

tamirms requested a review from a team May 9, 2024 13:58

sreuland reviewed May 9, 2024

View reviewed changes

ingest/ledgerbackend/stellar_core_runner_test.go Show resolved Hide resolved

sreuland approved these changes May 9, 2024

View reviewed changes

Merge branch 'master' into stellar-core-runner-close

934caa1

tamirms enabled auto-merge (squash) May 10, 2024 10:07

tamirms merged commit a4e5a3f into stellar:master May 10, 2024
31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest/ledgerbackend: Improve thread-safety of stellarCoreRunner.close() #5307

ingest/ledgerbackend: Improve thread-safety of stellarCoreRunner.close() #5307

tamirms commented May 9, 2024 •

edited

Loading

	// Make sure Stellar-Core is terminated before starting a new instance.
	processExited, _ := c.stellarCoreRunner.getProcessExitError()
	if !processExited {
	return false, errors.New("the previous Stellar-Core instance is still running")
	}

	// check if we have already closed
	if storagePath == "" {
	r.lock.Unlock()
	return nil
	}

ingest/ledgerbackend: Improve thread-safety of stellarCoreRunner.close() #5307

ingest/ledgerbackend: Improve thread-safety of stellarCoreRunner.close() #5307

Conversation

tamirms commented May 9, 2024 • edited Loading

PR Structure

Thoroughness

Release planning

What

Why

Known limitations

tamirms commented May 9, 2024 •

edited

Loading