Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Horizon: missing information passed to captive-core when configured to run "on disk" #4538

Open
MonsieurNicolas opened this issue Aug 11, 2022 · 20 comments

Comments

@MonsieurNicolas
Copy link
Contributor

What version are you using?

v19.02.19.0

What did you do?

We had a testnet outage that revealed that captive-core reconstructed a ledger and replayed transactions from an untrusted archive.

What did you expect to see?

We should always verify data before replaying. In this case we should always pass down proofs thatt were acquired from the network.

What did you see instead?

captive-core replayed transactions from a bad archive, horizon ingested this data and corrupted itself. This required clearing core and horizon.

Additional information

the root cause of the issue is that:

  • when running using the --in-memory flag, core uses --start-at-ledger and --start-at-hash to bootstrap trust (ledger information comes from the latest ledger that Horizon ingested). This works as expected
  • when running "on disk", Horizon invokes core in two steps. The first one is the problematic one: when running catchup (offline catchup), core needs to "anchor" trust somewhere. The way to do it is by storing trusted hashes in a file and passing it down to core using the --trusted-checkpoint-hashes option. Note that unlike --start-at-ledger the hashes that can be passed to core this way must be "checkpoint ledgers" and catchup has to be invoked to reconstruct that checkpoint (ie: the "to" ledger must be a checkpoint ledger).

Related core issue stellar/stellar-core#3503 to surface this better in the future as it's a footgun for sure.

@MonsieurNicolas
Copy link
Contributor Author

not sure if you discussed this during your release triage @ire-and-curses but as this created an outage before, we are looking at prioritizing the linked issue for core's January release

@MonsieurNicolas
Copy link
Contributor Author

@ire-and-curses can you confirm that this is getting (or was) prioritized for your next release? Core is currently blocked on this one to merge stellar/stellar-core#3615

@mollykarcher
Copy link
Contributor

@MonsieurNicolas this just came to my attention and we unfortunately did not prioritize this for our upcoming sprint/release. We don't have much capacity leading up to the next soroban release, but will prioritize getting it in the queue for our next sprint.

@MonsieurNicolas
Copy link
Contributor Author

no problem - when would that next sprint ship roughly?

@tsachiherman
Copy link
Contributor

@MonsieurNicolas given that the first week of Jan will be devoted to the upcoming release, and the next week or two would be devoted to the hackathon, I'd realistically plan to have this issue being reviewed, analyzed and handled by end of Jan.

@ire-and-curses ire-and-curses moved this from Backlog to Next Sprint Proposal in Platform Scrum Jan 5, 2023
@mollykarcher mollykarcher moved this from Next Sprint Proposal to Current Sprint in Platform Scrum Jan 5, 2023
@bartekn
Copy link
Contributor

bartekn commented Jan 10, 2023

The information in "Additional information" section is a little bit wrong. Currently, Horizon in on disk mode will run catchup command only when Horizon db is empty. This way it will init Stellar-Core state at the starting ledger which is the latest checkpoint. When restarting, Horizon runs run command so it simply tells Stellar-Core to start where it left off. Horizon will also run catchup when reingesting (which can be a parallel reingest).

Additionally, when in memory mode, Horizon with a clean DB takes the values for --start-at-ledger and --start-at-hash from history archives. In this case, there is no command Horizon can use to start Stellar-Core safely.

This all makes me think that maybe we should take all these cases into account when working on this? How long does it take to generate the trusted hashes file? Maybe Stellar-Core could generate the file if it's not passed and store it in storage directory?

@MonsieurNicolas
Copy link
Contributor Author

The information in "Additional information" section is a little bit wrong. Currently, Horizon in on disk mode will run catchup command only when Horizon db is empty. This way it will init Stellar-Core state at the starting ledger which is the latest checkpoint. When restarting, Horizon runs run command so it simply tells Stellar-Core to start where it left off. Horizon will also run catchup when reingesting (which can be a parallel reingest).

this is not true: there are situations (as we experienced during the testnet outage), where Horizon resets the node and runs "catchup" without any trusted anchor.

Additionally, when in memory mode, Horizon with a clean DB takes the values for --start-at-ledger and --start-at-hash from history archives. In this case, there is no command Horizon can use to start Stellar-Core safely.

This is confusing to me:

  • either Horizon already ingested the checkpoint (same case than when it needs to reset core), or
  • it needs to bootstrap trust which can be done just by asking core to generate the trusted hashes (command that was added as part of the design of captive core for that purpose).

This all makes me think that maybe we should take all these cases into account when working on this? How long does it take to generate the trusted hashes file? Maybe Stellar-Core could generate the file if it's not passed and store it in storage directory?

The reason it's a separate command is that it allows Horizon to verify archives as well, so it does not make sense to let Horizon rebuild state without having that information (and if rebuilt state, it should have that trust information already). This was part of the original design for captive-core 2+ years ago.

I do agree that at some point we probably need to simplify all this: there is a bunch of weird code in Horizon that was added to make "captive-core" work without "in-memory", but this ended up re-creating issues that are solved in core already. That should be a separate issue on how to clean things up after we've migrated to "buckets db". @marta-lokhova FYI

@bartekn
Copy link
Contributor

bartekn commented Jan 20, 2023

The information in "Additional information" section is a little bit wrong. Currently, Horizon in on disk mode will run catchup command only when Horizon db is empty. This way it will init Stellar-Core state at the starting ledger which is the latest checkpoint. When restarting, Horizon runs run command so it simply tells Stellar-Core to start where it left off. Horizon will also run catchup when reingesting (which can be a parallel reingest).

this is not true: there are situations (as we experienced during the testnet outage), where Horizon resets the node and runs "catchup" without any trusted anchor.

I was describing the current code but, yes, it's possible that during that outage something you explained (clear storage dir when horizon DB is not empty) was possible. We fixed the issue in 621d634.

This is confusing to me:

  • either Horizon already ingested the checkpoint (same case than when it needs to reset core), or
  • it needs to bootstrap trust which can be done just by asking core to generate the trusted hashes (command that was added as part of the design of captive core for that purpose).

Horizon currently doesn't pass trusted hashes to Stellar-Core anywhere in the code. This is part of the problem I described in the previous comment.

The reason it's a separate command is that it allows Horizon to verify archives as well, so it does not make sense to let Horizon rebuild state without having that information (and if rebuilt state, it should have that trust information already). This was part of the original design for captive-core 2+ years ago.

Horizon doesn't need trusted hashes. It depends 100% on Stellar-Core for tx meta and for buckets it just checks buckets hash in ledger header it gets from Stellar-Core - if hashes doesn't match the DB tx that inserts entries to Horizon DB is cancelled.

@mollykarcher mollykarcher moved this from Current Sprint to Next Sprint Proposal in Platform Scrum Mar 9, 2023
@mollykarcher mollykarcher moved this from Next Sprint Proposal to Backlog in Platform Scrum Apr 18, 2023
@mollykarcher
Copy link
Contributor

To summarize some offline conversation about this issue:

  • RPC had the same issue, but that was resolved here
  • Horizon live/forward ingestion could not replay from an untrusted archive because of how it verifies bucket hashes on download
  • Horizon reingestion could replay from an untrusted archive, so we want to adjust reingest to pass in trusted checkpoint hashes pulled from core to catchup

@mollykarcher mollykarcher moved this from Backlog to To Do in Platform Scrum Apr 26, 2024
@chowbao chowbao added this to the platform sprint 49 milestone Jul 2, 2024
@sreuland sreuland self-assigned this Jul 23, 2024
@sreuland sreuland moved this from To Do to In Progress in Platform Scrum Jul 23, 2024
sreuland added a commit to sreuland/go that referenced this issue Aug 19, 2024
sreuland added a commit to sreuland/go that referenced this issue Aug 20, 2024
sreuland added a commit to sreuland/go that referenced this issue Aug 26, 2024
sreuland added a commit to sreuland/go that referenced this issue Aug 26, 2024
sreuland added a commit to sreuland/go that referenced this issue Aug 26, 2024
sreuland added a commit to sreuland/go that referenced this issue Aug 27, 2024
sreuland added a commit to sreuland/go that referenced this issue Aug 27, 2024
sreuland added a commit to sreuland/go that referenced this issue Aug 27, 2024
sreuland added a commit to sreuland/go that referenced this issue Aug 28, 2024
…md flag parsing as it's needed now since captive core does run mode
sreuland added a commit to sreuland/go that referenced this issue Aug 29, 2024
@sreuland
Copy link
Contributor

sreuland commented Sep 3, 2024

moving this to blocked for now, after initial solution was proposed on #5431, review highlighted need to take step back and first capture/evaluate design options for obtaining trusted hashes at reingest time on doc, once these are reviewed, development can proceed with best option.

@sreuland
Copy link
Contributor

we had a design review with team on 9/18 of the options outlined in trusted hash design options, the meeting wrapped up with an action item for @ThomasBrady to investigate --verify-checkpoints potential usage of skip lists for optimizing runtime duration. We will have another meeting to review outcome from that investigation to determine best usage of --verify-checkpoints, i.e. can its invocation be automated internally in captive core sdk wrapper or will it need to be run out-of-band by user.

@marta-lokhova
Copy link
Contributor

@sreuland unfortunately, the skip list in core has been broken since genesis (it will be fixed in protocol 22, but only for new ledgers).

Is the concern that even after the initial file generation from genesis, it'll be too expensive to generate newer hashes?

@sreuland
Copy link
Contributor

sreuland commented Oct 1, 2024

@marta-lokhova , yes, it was related to runtime duration of verify-checkpoints, however, in recent days we've had further discussion on skip list and a suggestion was made to repair it which would in turn enable catchup to emit ledger metadata on pipe with network anchored hashes, this appears to have gained consensus. I've summarized it as another option in our investigation of trusted hashes.

given the repair of skip list, it would eliminate the need for us to change the captive core wrapper sdk, as it already invokes catchup and would just obtain trusted hashes in ledger metadata emitted from meta pipe transparently when the underlying core bin version has the repair. Has a core ticket been created for skip list repair? I would like to refer to it here before closing this as no-op.

@ThomasBrady
Copy link

We have this PR for skip list repair stellar/stellar-core#4470 the PR says p22 but it is not going to be in the next release.

@sreuland
Copy link
Contributor

sreuland commented Oct 1, 2024

@ThomasBrady, thanks! will stellar/stellar-core#4470 propagate the repair to existing history archives? or is that mechanism anticipated as a separate ticket/pr?

this bug covers reingestion of history which potentially could be from already printed checkpoint archives, so, would like to include reference to any other efforts that may be required before catchup will emit anchored hashes on ledgers in tx-meta pipe by default.

@sreuland
Copy link
Contributor

sreuland commented Oct 2, 2024

@ThomasBrady , just to confirm, we don't anticipate changing captive core wrapper sdk to use verify-checkpoints in the interim while skip list repairs are performed, you were investigating enhancing verify-checkpoints performance, and just want to make sure that wasn't contingent on this use case, thanks!

@ThomasBrady
Copy link

Hey, yeah the verify-checkpoints changes are largely motivated by this use case. The work is already implemented and in review stellar/stellar-core#4487 and will most likely be included in v22.1.

AIUI there are a few components of the skip list repair including: a correct skip list in each ledger header -- the PR linked above -- which will probably not be fixed until p23, and then also designing and implementing a solution to backfill/distribute the pre-23 skip lists -- which has no current timeline. I can check on if we have a more concrete timeline, but I imagine those components won't be available until Q1 2025 at the earliest.

@sreuland
Copy link
Contributor

sreuland commented Oct 2, 2024

@ThomasBrady ,thanks for additional insight.

@mollykarcher , does this type of tentative timeline into next year for skip list rollout sound reasonable as far as holding off on changing captive core wrapper sdk to use verify-checkpoints in interim?

@sreuland
Copy link
Contributor

moving this to Backlog for further review on interim strategy, refer to channel discussion and rollout options for more context.

@mollykarcher
Copy link
Contributor

I still don't believe this is worth it to prioritize a different short-term fix ahead of the long-term fix we want (skiplist being fixed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backlog
Development

Successfully merging a pull request may close this issue.

10 participants