Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest/ledgerbackend: Stopping captive core can be slow #5349

Closed
2opremio opened this issue Jun 16, 2024 · 5 comments
Closed

ingest/ledgerbackend: Stopping captive core can be slow #5349

2opremio opened this issue Jun 16, 2024 · 5 comments

Comments

@2opremio
Copy link
Contributor

We only cancel the running processes after acquiring the lock (which can be really slow in some cases)

func (c *CaptiveStellarCore) Close() error {
	c.stellarCoreLock.RLock()
	defer c.stellarCoreLock.RUnlock()

	c.closed = true

	// after the CaptiveStellarCore context is canceled all subsequent calls to PrepareRange() will fail
	c.cancel()
	[...]
}

I don't know if it's safe to cancel before acquiring the lock (I bet it is) or if we can reduce the critical sections around the lock.

Related to #5347 ?

@2opremio 2opremio changed the title ingest/ledgerbackend: Stopping captive core is slow ingest/ledgerbackend: Stopping captive core can be slow Jun 16, 2024
@2opremio
Copy link
Contributor Author

For context: I found this during an integration test which spends most of the time waiting for captive core to close

@2opremio
Copy link
Contributor Author

2opremio commented Jun 16, 2024

Uhm, actually moving the cancel() out of the criticial section didn't help.

For more context, core was trying to catchup when closing it. It took 20 seconds to close it.

time="2024-06-16T16:45:17.338+02:00" level=warning msg="Process: process 65163 exited 22: curl -sf http://localhost:57043/history/00/00/00/history-00000017.json -o buckets/tmp/history-2ea087fae299fb7b/4abab04a1c772310-stellar-history.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:17.338+02:00" level=error msg="History: Could not download file: archive h0 maybe missing file history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:17.338+02:00" level=error msg="History: Missing HAS for ledger 23: maybe stale archive h0" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:21.717+02:00" level=info msg="History: Downloading history archive state: history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:21.717+02:00" level=warning msg="Process: process 65166 exited 7: curl -sf http://localhost:57043/history/00/00/00/history-00000017.json -o buckets/tmp/history-2ea087fae299fb7b/4abab04a1c772310-stellar-history.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:21.717+02:00" level=error msg="History: Could not download file: archive h0 maybe missing file history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:21.717+02:00" level=error msg="History: Missing HAS for ledger 23: maybe stale archive h0" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:21.717+02:00" level=info msg="History: Downloading history archive state: history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:21.733+02:00" level=warning msg="Process: process 65170 exited 7: curl -sf http://localhost:57043/history/00/00/00/history-00000017.json -o buckets/tmp/history-2ea087fae299fb7b/4abab04a1c772310-stellar-history.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:21.733+02:00" level=error msg="History: Could not download file: archive h0 maybe missing file history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:21.733+02:00" level=error msg="History: Missing HAS for ledger 23: maybe stale archive h0" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:23.385+02:00" level=info msg="History: Downloading history archive state: history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:23.411+02:00" level=warning msg="Process: process 65173 exited 7: curl -sf http://localhost:57043/history/00/00/00/history-00000017.json -o buckets/tmp/history-2ea087fae299fb7b/4abab04a1c772310-stellar-history.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:23.411+02:00" level=error msg="History: Could not download file: archive h0 maybe missing file history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:23.411+02:00" level=error msg="History: Missing HAS for ledger 23: maybe stale archive h0" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:28.412+02:00" level=info msg="History: Downloading history archive state: history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:28.436+02:00" level=warning msg="Process: process 65176 exited 7: curl -sf http://localhost:57043/history/00/00/00/history-00000017.json -o buckets/tmp/history-2ea087fae299fb7b/4abab04a1c772310-stellar-history.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:28.436+02:00" level=error msg="History: Could not download file: archive h0 maybe missing file history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:28.436+02:00" level=error msg="History: Missing HAS for ledger 23: maybe stale archive h0" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:37.438+02:00" level=info msg="History: Downloading history archive state: history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:37.464+02:00" level=warning msg="Process: process 65179 exited 7: curl -sf http://localhost:57043/history/00/00/00/history-00000017.json -o buckets/tmp/history-2ea087fae299fb7b/4abab04a1c772310-stellar-history.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:37.464+02:00" level=error msg="History: Could not download file: archive h0 maybe missing file history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:37.464+02:00" level=error msg="History: Missing HAS for ledger 23: maybe stale archive h0" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:37.464+02:00" level=warning msg="History: Catchup failed" pid=65148 subservice=stellar-core

@tamirms
Copy link
Contributor

tamirms commented Jun 16, 2024

I think the issue is that there is a code path where we execute stellar-core catchup before stellar-core run:

if from > 2 {
cmd, err = r.createCmd("catchup", fmt.Sprintf("%d/0", from-1))
} else {
cmd, err = r.createCmd("catchup", "2/0")
}
if err != nil {
return errors.Wrap(err, "error creating command")
}
if err = cmd.Run(); err != nil {
return errors.Wrap(err, "error runing stellar-core catchup")
}

In that code path we do not abort the stellar-core catchup command in case the context is canceled. We can fix this issue by constructing the Command instance using https://pkg.go.dev/os/exec#CommandContext and also configuring the Cmd.WaitDelay and Cmd.Cancel properties. Basically, we need to use the same techniques described in
#5347

@tamirms
Copy link
Contributor

tamirms commented Jun 16, 2024

I don't know if it's safe to cancel before acquiring the lock (I bet it is) or if we can reduce the critical sections around the lock.

yes, I think we need to also move the cancel before acquiring the lock

@2opremio 2opremio mentioned this issue Jun 16, 2024
@2opremio
Copy link
Contributor Author

Great analysis, thanks!

@tamirms tamirms moved this from Backlog to In Progress in Platform Scrum Jun 26, 2024
@tamirms tamirms added this to the platform sprint 48 milestone Jun 26, 2024
@tamirms tamirms moved this from In Progress to Needs Review in Platform Scrum Jun 26, 2024
@tamirms tamirms closed this as completed Jun 28, 2024
@github-project-automation github-project-automation bot moved this from Needs Review to Done in Platform Scrum Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

3 participants
@tamirms @2opremio and others