Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add decent error recovery and Ctrl-C handling #14

Open
jwodder opened this issue Nov 18, 2024 · 11 comments
Open

Add decent error recovery and Ctrl-C handling #14

jwodder opened this issue Nov 18, 2024 · 11 comments
Assignees
Labels
needs design decisions UX Relating to the user experience
Milestone

Comments

@jwodder
Copy link
Member

jwodder commented Nov 18, 2024

Prior discussion from dandi/dandi-utils#3:

jwodder:

Do you have any ideas or preferences for how the backup program should handle download failures? What about crashes or a Ctrl-C in the middle of a backup?

yarikoptic:

  • Ideally it should be robust to download failures, and be able to "track back", e.g. if it was under git like with backups2datalad -- git reset --hard; git clean -dfx . But here it is not feasible, hence we might want to explicitly "model" the "rollback" regime to get back to prior state -- remove fresh (complete or incomplete) downloads, undo mv's already done -- might be worth keeping a journal of operations or just being able to take prior state and "match" it.

jwodder:

Please elaborate on exactly what behavior you want.

yarikoptic:

rollback or match the prior state: add a function which would ensure that current tree is matching specific inventory one:

  • for a folder, go through the union of paths found in inventory, .dandi-s3-backup-versions.json, and on the drive (excluding .old. ones)
    • if file is on drive and/or .dandi-s3-backup-versions.json but not in inventory - remove file at {path} and from .dandi-s3-backup-versions.json
    • if file record in inventory not matching the one in .dandi-s3-backup-versions.json - remove from drive and the .dandi-s3-backup-versions.json
    • if file is not present on drive but in inventory - if there is a corresponding {path}.old.{versionid}.{etag} -- rename it to {path}, adjust .dandi-s3-backup-versions.json accordingly
      • if there is no .old. file -- fail, shouldn't happen

But while thinking about it, I realized that overall approach does not cover the case of key switching between being a file and directory.

  • When file becomes a directory -- all is easy, prior version gets renamed into {path}.old.{versionid}.{etag}, and for new {path}/ folder created.
  • When directory becomes a file -- just rename directory to {path}.old.dandi-s3-backup if such does not exist yet. If exists already -- nothing to be done.
    • need to add check for path to not end with .old.dandi-s3-backup into conflict detection above
    • to reconstruct some prior key for versionId we would need to inspect all parents to potentially carrying the .old.dandi-s3-backup suffix now

jwodder:

for a folder, go through the union of paths found in inventory,

Because each set of inventories lists every single item in the bucket, this won't scale well. Just a single CSV file from the manifest you showed in the original comment contains three million entries.

yarikoptic:

yes, there is a scalability concern as we are expecting hundreds of millions entries (e.g https://github.com/dandisets/000108 alone accounts for 300 million files across its zarrs). If those lists are sorted though -- might be quite easy since then all files in a folder would be a sequential batch and we would process that sequential batch from inventory + files on drive and in .dandi-s3-backup-versions.json only for that folder, which would be either tiny or some thousands -- not more at once.

jwodder:

But wouldn't the rollback have to be run against the entire backup tree, which would be huge? Doing that in response to an error or Ctrl-C seems absurdly time-consuming.

yarikoptic:

Indeed. FWIW, aiming for incremental changes, I think we can minimize the amount of time when interruption would lead to requiring such a full blown roll back. E.g. if we

  1. do full analysis of what keys need to be downloaded, renamed, and removed without any changes to the data on drive. Interruption would result in non need for roll back or any other actions to cleanup
  2. do all necessary downloads into temporary space, e.g. .dandi-s3-backup-downloads/{versionid}.{etag} at the top folder. If interrupted in this stage -- would just need to rm -rf .dandi-s3-backup-downloads/.
  3. final stage: "expensive" to recover from if interrupted, hence interruption should be guarded (e.g. at least 3 CTrl-C's within 5 seconds, otherwise do not react): perform all planned rm, and mvs, and remove empty .dandi-s3-backup-downloads/ at the end (all files should be gone)
    • actually if we establish journal of those rm and mvs we could probably play them back as well leading to "lighter" way to recover, but I am not sure I would 100% trust it, thus running full fsck would still be desired for paranoid me.

WDYT? may be some better ways?

jwodder:

An alternative: Don't rollback on errors/Ctrl-C, just exit (after cleaning up any partially-downloaded files). Instead, make the command capable of both (a) running against a backup directory in which a previous backup run was interrupted (so that a failed command can be immediately rerun, hopefully this time to completion) and (b) reverting a backup directory to an earlier backup by specifying an earlier date (so that a failed command can be forcefully rolled back by running the command with the date that the backup was previously at).

yarikoptic:

Re Alternative: it reads like what i have suggested as "rollback or match the prior state" feature above and you had a legitimate concern

But wouldn't the rollback have to be run against the entire backup tree, which would be huge?

or did you think back then I was suggesting it as something to do right upon error/Ctrl-C?

Also -- how would you know if current state of the backup is "legit" and not some partial one? I think we better have an explicit annotation for that.

In either case, I think that separating out analysis, from "fetching" and then actual "doing" as I suggested above might benefit greatly in minimizing time for when we could leave the backup in some incomplete/broken state. Don't you think so?

@jwodder jwodder added needs design decisions UX Relating to the user experience labels Nov 18, 2024
@jwodder jwodder added this to the v0.1.0 milestone Nov 18, 2024
@jwodder
Copy link
Member Author

jwodder commented Nov 25, 2024

@yarikoptic Is your suggested "analysis" step intended to be run over the entire file tree before doing anything? That would involve storing every single key in memory at once, which would likely be problematic.

@yarikoptic
Copy link
Member

Do you think it would be too tricky to allow for two modes -- "staged" (the "analysis" based above) and "interleaved"? I guess if was in Python could be at the level of what_todo "generator" vs "list" decision making.

My thinking: we would need a full heavy list only in the initial backup. That is where we can proceed in "interleaved" mode, just getting a new key from the inventory, checking if present locally and if not -- yield the "action" to do to the next step. In "staged" -- it would establish the full list of actions to tackle first, and since changes should not be multitude if ran frequently -- should fit into memory easily IMHO.

@jwodder
Copy link
Member Author

jwodder commented Nov 25, 2024

@yarikoptic That seems doable, though I really don't like the names "interleaved" and "staged". What about "immediate" and "pre-planned"?

I also have to point out that, based on my observations of previous runs of the program, the "staged"/"pre-planned" mode, when run on the dandiarchive inventories, will have to spend about 10 and a half hours just going over all the items before it gets around to actually doing anything.

In "staged"/"pre-planned" mode, when cleaning up after an error or Ctrl-C, do you want the actions performed so far to be rolled back or not? (Just what is the fsck you're referring to? I don't see how the fsck(8) command is relevant here.)

How do you expect removal of files from the local backup that don't correspond to any items in the inventories (#33) to work? Determining whether a file should be deleted basically requires storing all the CSV entries in memory at once.

@jwodder jwodder self-assigned this Nov 26, 2024
@yarikoptic
Copy link
Member

I have no strong opinion on names. "immediate" and "pre-planned" sound good.

will have to spend about 10 and a half hours just going over all the items before it gets around to actually doing anything.

I wonder why and either could be avoided e.g. via parallelization .

In "staged"/"pre-planned" mode, when cleaning up after an error or Ctrl-C, do you want the actions performed so far to be rolled back or not? (Just what is the fsck you're referring to?

sorry for likely using inappropriate terminology -- I often use fsck word for consistency-check. In the scope of this it would entail seeing local backup consistent with what is observed in inventory and the .s3invsync.versions.json (formerly .dandi-s3-backup-versions.json) files.

How do you expect removal of files from the local backup that don't correspond to any items in the inventories (#33) to work? Determining whether a file should be deleted basically requires storing all the CSV entries in memory at once.

Nothing 100% solid idea yet, but followed up on #33 (comment)

@jwodder
Copy link
Member Author

jwodder commented Dec 9, 2024

@yarikoptic

I wonder why and either could be avoided e.g. via parallelization .

Because there are over 400 CSVs to process, each with 300,000 lines, and they're already being processed concurrently (see the --inventory-jobs option).


You didn't answer this question:

In "staged"/"pre-planned" mode, when cleaning up after an error or Ctrl-C, do you want the actions performed so far to be rolled back or not?

@yarikoptic
Copy link
Member

I wonder why and either could be avoided e.g. via parallelization .

Because there are over 400 CSVs to process, each with 300,000 lines, and they're already being processed concurrently (see the --inventory-jobs option).

sure -- lots of work... do we know what it spends most of the time on? I am thinking about the path-filter and "printing"/logging . As for path-filter -- if regex, likely is the cpu intensive part and if was replaced with --path-prefix-filter to use simpler analog of python's ".startwith" but might shorten time considerably?

In "staged"/"pre-planned" mode, when cleaning up after an error or Ctrl-C, do you want the actions performed so far to be rolled back or not?

I guess so since we do need to be able to get into a "consistent state".

@jwodder
Copy link
Member Author

jwodder commented Jan 14, 2025

@yarikoptic I'm going to repeat my suggestion from an earlier comment:

An alternative: Don't rollback on errors/Ctrl-C, just exit (after cleaning up any partially-downloaded files). Instead, make the command capable of both (a) running against a backup directory in which a previous backup run was interrupted (so that a failed command can be immediately rerun, hopefully this time to completion) and (b) reverting a backup directory to an earlier backup by specifying an earlier date (so that a failed command can be forcefully rolled back by running the command with the date that the backup was previously at).

We know the program can do (b), and (a) is possible as long as the program was allowed to clean up and wasn't killed by SIGTERM or SIGKILL.1 Thus, I suggest that we don't implement any clever rollbacks on errors and instead just leave the error handling as-is.

Footnotes

  1. If the program was killed by a signal, I believe the only consequence that could lead to errors on subsequent runs would be if the keys in an .s3invsync.versions.json file weren't updated to match the non-old filenames in the containing directory. It'd be possible to handle this scenario as well, likely by just redownloading whatever files weren't added to the JSON database.

@yarikoptic
Copy link
Member

I wonder if we could/should at least get some .s3invsync.state.json in the top directory which would reflect the state of the backup: latest invocation options, and whether was completed... or may be even a list of N last invocations? Then subsequent run on semi broken one could error out unless invoked for the same date state in attempt to mitigate/use (a)?

@jwodder
Copy link
Member Author

jwodder commented Jan 14, 2025

@yarikoptic What exactly would the goal of that be? Why must an incomplete run only be followed by a run with the same options?

@yarikoptic
Copy link
Member

I felt like for what you formulated as

so that a failed command can be immediately rerun, hopefully this time to completion

but I guess indeed no need to be the same -- it might then just "update" it to a new state. But overall, I would just like to have an opportunity to know that prior run was not completed ok.

@jwodder
Copy link
Member Author

jwodder commented Jan 14, 2025

@yarikoptic How about something like this: We add an .s3invsync.state.json file to the root of the backup containing the following fields:

  • last_backup_started — start time of the last backup
  • last_successful_backup_finished — end time of the last successful backup

Then, if last_successful_backup_finished is less than last_backup_started, it means the last backup was unsuccessful (or is currently in progress).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs design decisions UX Relating to the user experience
Projects
None yet
Development

No branches or pull requests

2 participants