Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[epic] dry-run horizon database truncation #5236

Closed
5 of 8 tasks
mollykarcher opened this issue Mar 7, 2024 · 1 comment
Closed
5 of 8 tasks

[epic] dry-run horizon database truncation #5236

mollykarcher opened this issue Mar 7, 2024 · 1 comment

Comments

@mollykarcher
Copy link
Contributor

mollykarcher commented Mar 7, 2024

We'll be truncating SDF Horizon's history retention to 1 year later this year. To our knowledge, most partners that enable history retention use 1-3 months of history, so it's possible that there could be issues that only present with this data profile (not retaining full history, but large retention window) that we simply haven't seen or heard of yet.

We should dry-run the truncation and mirror traffic to it for some amount of time, observing the performance impact and resolving any issues that arise from this process. Given the timing and the need to continue using staging to test/issue releases of Horizon prior to the truncation, we should not be doing this on the staging cluster and will need to spin up a new/independent one.

At a minimum:

  • Spin up another staging-like cluster of Horizon (https://github.com/stellar/ops/issues/2900)
  • Upgrade PostgreSQL 12 ➡️ 16 (services/horizon: upgrade psql support to most recent versions #4831)
  • Enabling reaping on that instance and set the retention to 1 year
    • There are different ways this can be accomplished and we need time to evaluate that. For example, we could turn on reaping on the whole DB and see what happens (which may result in a lockup due to the massive amount of data that needs to be reaped, plus a possible full vacuum) or we could start from scratch and ingest a year+ of data and then enable reaping, or there may be other options.
    • After discussion, it appears we must approach this by reaping the whole DB, because reingestion may take on the order of months. The hope 🤞 is that because the database will be much smaller, the full vacuum will be feasible without any extra operational concerns.
    • some of reaping performance epic, [Epic] Improving Reap Performance of History Lookup Tables #4870, may be applicable and needed here to ensure reaping exhibits acceptable performance on the truncated 1 year pubnet history db.
    • The periodic reaping frequency should be configurable services/horizon: Reap in batches of 100k ledgers per second, to play nicely with others #3823
  • Document the operational plan that should be performed to repeat this process for live prd on blue/green prod clusters eventually https://github.com/stellar/go-internal/issues/18
  • Mirror traffic from production to this cluster
  • Brainstorm how we could (or if we need to) simulate load from transaction submission
  • Observe, identify, and resolve (or defer/prioritize) any performance degradation
@mollykarcher mollykarcher added this to the Sprint 44 milestone Mar 7, 2024
@mollykarcher mollykarcher moved this from Backlog to Current Sprint in Platform Scrum Mar 7, 2024
@mollykarcher mollykarcher changed the title dry-run horizon database truncation [epic] dry-run horizon database truncation Mar 19, 2024
@mollykarcher mollykarcher moved this from Current Sprint to Next Sprint Proposal in Platform Scrum Mar 19, 2024
@mollykarcher mollykarcher moved this from Next Sprint Proposal to In Progress in Platform Scrum Mar 27, 2024
@sreuland
Copy link
Contributor

sreuland commented May 16, 2024

@tamirms , what is the latest status on spinning up the test db cluster for this reaping test effort? I think you've mentioned it was in progress but pending due to PG16 issues?

I ask b/c @aditya1702 and I are triaging reports of the reaper sql becoming non-performant in pubnet db ingestion deployments of horizon, such as this one from community member observed reaper timeouts on issues/5299 and #5320

triaging the reported problem is very similar to doing the dry-run validation effort, we can probably converge on this and join the effort to obtain reaper results in a staging environment as it helps both cases?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

5 participants
@Shaptic @tamirms @mollykarcher @sreuland and others