Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migrate more backups to BackBlaze to reduce costs #291

Open
10 of 13 tasks
rahulbot opened this issue May 24, 2024 · 8 comments
Open
10 of 13 tasks

migrate more backups to BackBlaze to reduce costs #291

rahulbot opened this issue May 24, 2024 · 8 comments
Assignees
Labels
enhancement New feature or request infrastructure
Milestone

Comments

@rahulbot
Copy link
Contributor

rahulbot commented May 24, 2024

Following up on #270, we want to continue migrating backups from S3 to B2. This should include:

  • rss-fetcher postgres backups (old files migrated, new files written to B2)
  • start writing production WARC files to B2
  • start writing 2022 CSV backfill WARC files to B2
  • start writing 2022 RSS backfill WARC files to B2
  • stop writing production WARC files to S3
  • stop writing 2022 backfill WARC files to S3
  • transfer old story-indexer archive (WARC) files, some files at ramos:/srv/data/docker/indexer/worker_data/archiver/
  • create public mediacloud-public bucket, requires verified email address
  • transfer rss-fetcher synthetic RSS files: files in tarbell:/space/dokku/data/storage/rss-fetcher-storage/rss-output-files/
  • transfer historic synthetic RSS files: files in tarbell:/space/S3/mediacloud-public/daily-rss/
  • web-app postgres backups (old files migrated, new files written to B2)
  • ES snapshots
  • other mish-mash of historical files on S3?

2024-06-26: All production stacks (daily, 2022 csv and 2022 rss) are writing to both S3 and B2

@rahulbot rahulbot added enhancement New feature or request infrastructure labels May 24, 2024
@rahulbot rahulbot added this to the Production Beta 7 milestone May 24, 2024
@philbudne
Copy link
Contributor

philbudne commented May 24, 2024

  1. Do you want to migrate old rss-fetcher PG dumps to B2?
  2. Thoughts about a retention policy (I wrote a program that can keep N of yearly, monthly, weekly (sunday), daily dumps
  3. Re: RSS files: make a public (mediacloud-public) bucket? subdir names: daily-rss (for rss-fetcher), legacy-rss (for legacy system)???
  4. web app PG dumps: migrate old? retention policy (see 1 & 2 above)

@rahulbot
Copy link
Contributor Author

  1. Old rss-fetcher PG dumps: I don't think we need them. Thought perhaps good to grab a handful and transfer them for longevity: perhaps the first of each month in 2024 so far?
  2. Retention: For rss-fetcher and web-app my first thought is: last ∞ yearly (ie. all), last 6 monthly, last 8 weekly, last 30 daily. Totally open to alternatives.
  3. synthetic rss-files: You suggestion sounds great. It isn't codified anywhere, but I do feel we have a responsibility to keep our "daily discovered url" files available publicly in perpetuity. And to be honest moving the server from s3 to b2 will surface any users we don't know about that are consuming them, which will be good to know about.
  4. web-app dumps: I'd treat it the same way as I suggest for (1) and (2) above - ie. grab a reasonable set of monthly-s to migrate and then apply same retention policy.

@philbudne
Copy link
Contributor

Today I:

  • created a B2 mediacloud-mcweb-backup folder
  • migrated selected rss-fetcher PG backups from S3 to B2. files at tarbell:/space/S3/mediacloud-rss-fetcher-backup/xfer
  • migrated selected mcweb PG backups from S3 to B2. files at tarbell:/space/S3/mediacloud-mcweb-backup/xfer
  • changed mcweb PG backups to B2 and tested
  • downloaded legacy RSS files to tarbell:/space/S3/mediacloud-public/daily-rss/
  • updated the checklist at the top of this issue

I tried creating a public mediacloud-public bucket, but the API call failed with an error that the account email address had not been verified.

@philbudne
Copy link
Contributor

Regarding WARC files:

There are about 10K WARC files, taking up 1.8TB on ramos (November 2023 thru early March 2024).
There are about 75K WARC files in the S3 mediacloud-indexer-archive bucket taking about 13TB

So we could be talking about $1000 to transfer the WARC files we don't have locally

@philbudne
Copy link
Contributor

Now writing new current-day WARC files to both B2 and S3

@rahulbot
Copy link
Contributor Author

I tried creating a public mediacloud-public bucket, but the API call failed with an error that the account email address had not been verified.

@philbudne I was able to poke around the settings page and verify my email. Please test again at your convenience and let me know if still fails.

@philbudne
Copy link
Contributor

philbudne commented Jun 7, 2024

Did a bit of googling on how to set ES to use a specific S3 API URL for backblaze:

elastic/elasticsearch#21283 (comment)

B2 has S3 compatible API. It works fine for us. We are using a snapshot like this:

{
  "type": "s3",
  "settings": {
    "bucket": "elastic-backup",
    "region": "",
    "endpoint": "s3.us-west-001.backblazeb2.com"
  }
}

In our case the endpoint would be s3.us-east-005.backblazeb2.com

@pgulley
Copy link
Member

pgulley commented Jul 24, 2024

I've broken out the task of "closing s3 writes" into a new issue (#316)- I'll leave this as a reference to the longer-term task of extracting data from s3 once we're no longer writing to it.

@pgulley pgulley modified the milestones: 2 - July, long-term Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request infrastructure
Projects
Status: In Progress
Status: No status
Development

No branches or pull requests

4 participants