-
Notifications
You must be signed in to change notification settings - Fork 502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Horizon: missing information passed to captive-core when configured to run "on disk" #4538
Comments
not sure if you discussed this during your release triage @ire-and-curses but as this created an outage before, we are looking at prioritizing the linked issue for core's January release |
@ire-and-curses can you confirm that this is getting (or was) prioritized for your next release? Core is currently blocked on this one to merge stellar/stellar-core#3615 |
@MonsieurNicolas this just came to my attention and we unfortunately did not prioritize this for our upcoming sprint/release. We don't have much capacity leading up to the next soroban release, but will prioritize getting it in the queue for our next sprint. |
no problem - when would that next sprint ship roughly? |
@MonsieurNicolas given that the first week of Jan will be devoted to the upcoming release, and the next week or two would be devoted to the hackathon, I'd realistically plan to have this issue being reviewed, analyzed and handled by end of Jan. |
The information in "Additional information" section is a little bit wrong. Currently, Horizon in on disk mode will run Additionally, when in memory mode, Horizon with a clean DB takes the values for This all makes me think that maybe we should take all these cases into account when working on this? How long does it take to generate the trusted hashes file? Maybe Stellar-Core could generate the file if it's not passed and store it in storage directory? |
this is not true: there are situations (as we experienced during the testnet outage), where Horizon resets the node and runs "catchup" without any trusted anchor.
This is confusing to me:
The reason it's a separate command is that it allows Horizon to verify archives as well, so it does not make sense to let Horizon rebuild state without having that information (and if rebuilt state, it should have that trust information already). This was part of the original design for captive-core 2+ years ago. I do agree that at some point we probably need to simplify all this: there is a bunch of weird code in Horizon that was added to make "captive-core" work without "in-memory", but this ended up re-creating issues that are solved in core already. That should be a separate issue on how to clean things up after we've migrated to "buckets db". @marta-lokhova FYI |
I was describing the current code but, yes, it's possible that during that outage something you explained (clear storage dir when horizon DB is not empty) was possible. We fixed the issue in 621d634.
Horizon currently doesn't pass trusted hashes to Stellar-Core anywhere in the code. This is part of the problem I described in the previous comment.
Horizon doesn't need trusted hashes. It depends 100% on Stellar-Core for tx meta and for buckets it just checks buckets hash in ledger header it gets from Stellar-Core - if hashes doesn't match the DB tx that inserts entries to Horizon DB is cancelled. |
To summarize some offline conversation about this issue:
|
…sts that use catchup
…md flag parsing as it's needed now since captive core does run mode
moving this to blocked for now, after initial solution was proposed on #5431, review highlighted need to take step back and first capture/evaluate design options for obtaining trusted hashes at reingest time on doc, once these are reviewed, development can proceed with best option. |
we had a design review with team on 9/18 of the options outlined in trusted hash design options, the meeting wrapped up with an action item for @ThomasBrady to investigate |
@sreuland unfortunately, the skip list in core has been broken since genesis (it will be fixed in protocol 22, but only for new ledgers). Is the concern that even after the initial file generation from genesis, it'll be too expensive to generate newer hashes? |
@marta-lokhova , yes, it was related to runtime duration of verify-checkpoints, however, in recent days we've had further discussion on skip list and a suggestion was made to repair it which would in turn enable given the repair of skip list, it would eliminate the need for us to change the captive core wrapper sdk, as it already invokes |
We have this PR for skip list repair stellar/stellar-core#4470 the PR says p22 but it is not going to be in the next release. |
@ThomasBrady, thanks! will stellar/stellar-core#4470 propagate the repair to existing history archives? or is that mechanism anticipated as a separate ticket/pr? this bug covers reingestion of history which potentially could be from already printed checkpoint archives, so, would like to include reference to any other efforts that may be required before |
@ThomasBrady , just to confirm, we don't anticipate changing captive core wrapper sdk to use |
Hey, yeah the AIUI there are a few components of the skip list repair including: a correct skip list in each ledger header -- the PR linked above -- which will probably not be fixed until p23, and then also designing and implementing a solution to backfill/distribute the pre-23 skip lists -- which has no current timeline. I can check on if we have a more concrete timeline, but I imagine those components won't be available until Q1 2025 at the earliest. |
@ThomasBrady ,thanks for additional insight. @mollykarcher , does this type of tentative timeline into next year for skip list rollout sound reasonable as far as holding off on changing captive core wrapper sdk to use |
moving this to Backlog for further review on interim strategy, refer to channel discussion and rollout options for more context. |
I still don't believe this is worth it to prioritize a different short-term fix ahead of the long-term fix we want (skiplist being fixed). |
What version are you using?
v19.02.19.0
What did you do?
We had a testnet outage that revealed that captive-core reconstructed a ledger and replayed transactions from an untrusted archive.
What did you expect to see?
We should always verify data before replaying. In this case we should always pass down proofs thatt were acquired from the network.
What did you see instead?
captive-core replayed transactions from a bad archive, horizon ingested this data and corrupted itself. This required clearing core and horizon.
Additional information
the root cause of the issue is that:
--in-memory
flag, core uses--start-at-ledger
and--start-at-hash
to bootstrap trust (ledger information comes from the latest ledger that Horizon ingested). This works as expectedcatchup
(offline catchup), core needs to "anchor" trust somewhere. The way to do it is by storing trusted hashes in a file and passing it down to core using the--trusted-checkpoint-hashes
option. Note that unlike--start-at-ledger
the hashes that can be passed to core this way must be "checkpoint ledgers" and catchup has to be invoked to reconstruct that checkpoint (ie: the "to" ledger must be a checkpoint ledger).Related core issue stellar/stellar-core#3503 to surface this better in the future as it's a footgun for sure.
The text was updated successfully, but these errors were encountered: