Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete files from backup that don't match any inventory items #33

Open
jwodder opened this issue Nov 19, 2024 · 10 comments · May be fixed by #117
Open

Delete files from backup that don't match any inventory items #33

jwodder opened this issue Nov 19, 2024 · 10 comments · May be fixed by #117

Comments

@jwodder
Copy link
Member

jwodder commented Nov 19, 2024

No description provided.

@yarikoptic
Copy link
Member

If we are to avoid storing the full list of files in memory to figure out which aren't in inventory, some raw ideas to see if anything sticks

"treeification"/sorting of inventory paths

I think s5cmd relies on order as well, and does sorting of lists disk to accommodate large listings.
If listing is sorted, could be "zipped" with (sorted) local tree navigation and remove local files as soon as not observed in sorted inventory listing.

cons:

  • would require n*log(n) time to sort first if inventory not sorted.
  • would require "smart" sorting to operate in reasonable time and memory available

Move in place

abuse filesystem: upon initiating backup, do mv * .moveback/ and upon encountering a new key, move all prior versions (could be a full folder if a single file like likely to be in our dandi keystore for blobs/) in place. At the end .moveback/ would contain files which are no longer in inventory. Then we could utilize mtime to enforce some policy to retain only up until some age.

pros:

  • no extra memory involvement at all

cons:

  • would be quick for initial mv but then due to need to recreate all subfolders
  • recovery from interrupted operation would also be tricky

store only "hash" of the path

At 500 million keys, if we use e.g. md5sum we need only 16 bytes per each file, should require only ~8GB of RAM to store the listing of all paths in inventory. We would need it though as a "set", so might take more but still not prohibitively. Then for each path on filesystem we could check if in the set and remove if not.

cons:

  • not sure how fast lookup in a set of such size would be and how likely false positives/collisions
  • would still be n*log(n) if lookup log(n) so similar to "sort first" anyways

@jwodder
Copy link
Member Author

jwodder commented Dec 9, 2024

@yarikoptic Regarding sorting inventory paths, while each individual CSV file seems to be sorted, the order in which the CSVs are listed in the manifest does not preserve overall ordering, so getting a properly-sorted list would still involve retrieving & storing all keys in memory first.

@yarikoptic
Copy link
Member

ha -- interesting! So, we could potentially fetch first e.g. 1kb of each csv (quick), extract and thus order CSVs and proceed in sorted order across them?

@jwodder
Copy link
Member Author

jwodder commented Dec 19, 2024

@yarikoptic Here's a sketch of a potential way to implement this:

  • Download each CSV file, store them in a temporary directory all at once, and fetch the key from the first line of each one

    • Alternatively, as you mentioned, we could just fetch an initial portion of each file, enough to get the first line, but the files are gzipped, so we'd need to at least determine how long the gzip header is.
  • Iterate over the CSVs one at a time, ordered by first-line key

    • Since we'd still be processing the same number of objects at once, I don't think switching to linear CSV processing would slow things down much.
  • For each CSV entry, before downloading, add the entry's path to a "tree tracker" structure that stores all entries seen so far in the slice of the directory tree that's currently being processed.

    • If the tree tracker is given paths that aren't in sorted order, error out immediately.
  • When the tree tracker receives a path that indicates that we're past the "current directory" (e.g., receiving foo/quux/apple.txt after foo/bar/banana.txt indicates that we've seen everything in foo/bar/), spawn a task to delete extraneous files in that directory. The task is given a list of all files & directories that should be in the directory, and everything else is deleted.

    • I might be able to code this so that the "delete extraneous files" task doesn't start until after all downloads to the directory have completed. This would ensure that temporary download files left behind for some reason are cleaned up, and it would prevent problems that might arise from iterating over a directory that's being modified. On the other hand, if one file in the directory takes a long time to download, we could end up with a backlog of tasks.
    • Question: How should --path-filter be handled for this? Should paths that don't match the regex always be deleted? Should the deletion code act as though --path-filter wasn't given? Something else?

Thoughts?

@yarikoptic
Copy link
Member

yarikoptic commented Dec 19, 2024

question: does code now wait for download of all inventory files or starts processing entries as soon as first inventory file comes in? or would it be hard to refactor to start process as soon as first one comes in?

I am asking because if we could start processing (CPU bound, not IO) as soon as the first (in desired order) CSV is fetched, then I would prefer the "Alternatively," since it would allow for a quick fetch of first blocks [*], order quickly and then start downloading/using CSV files -- likely would be a wortwhile optimization.

  • indeed worth checking on gzip header.. chatgpt suggest that it is at least 10 bytes and suggests to fetch 100, so fetching 1kb must be enough for header + first row(s)

your logic sounds correct to me besides seems not spelling out logic for root directory (nothing special there I guess though -- just at the end of the loop when leaving it - do the same). ping @asmacdo for a review of it since we had to go through a similar one in dandi/dandi-hub#199 (comment) so worth extra "comprehension".

re Question: good question.

  • The difference between two alternatives, if I got it right, is that if --path-filter really just tells what to download, but otherwise performs the same actions (so e.g. if there is no early shortcut of not looking into files under b/ if --path-filter ^a), then in the latter ("as though --path-filter wasn't given") case we would update/download files matching the --path-filter and then remove only those which were removed on S3.
  • Someone could have used --path-filter for some "fixup of sync for only some files", but ATM we use it as overall restriction on what to consider for syncing, so I would not expect it to "get changed" for a single run.
    • with that in mind -- first first choice ("always delete not matching") sounds good enough and might be easier to implement. But would warrant a warning in documentation/--help.
    • if you see people using --path-filter for some runs only, might be worth going with the 2nd option.

overall -- either of those would fit our use case AFAIK, you decide.

@jwodder
Copy link
Member Author

jwodder commented Dec 19, 2024

@yarikoptic

does code now wait for download of all inventory files or starts processing entries as soon as first inventory file comes in?

Entries are processed as soon as the first CSV is downloaded. Specifically, there are at most 201 CSV tasks running at all times; each one downloads a CSV to disk and then, once it's fully downloaded, it reads through it and feeds the entries to the object-downloading tasks. Once the end of a CSV file is reached, the respective task starts work on a new CSV.

Note that, regardless of whether CSVs' headers are fetched in advance, my proposal requires only one CSV to be iterated over at a time in order to ensure that entries are processed in strictly-ascending sorted order. Also, fetching headers would still require fetching the header for each & every CSV before beginning to iterate over any of them in full, in order to ensure we process the CSVs in the right order.

Footnotes

  1. This number can be changed via the --inventory-jobs option.

@jwodder jwodder linked a pull request Jan 7, 2025 that will close this issue
4 tasks
@jwodder
Copy link
Member Author

jwodder commented Jan 8, 2025

@yarikoptic FYI, I started out by implementing fetching the initial portion of the CSVs, sorting by first key, and then operating on one CSV at a time. As you can see from the run report, this doubled the runtime for the test case.

@yarikoptic
Copy link
Member

yarikoptic commented Jan 8, 2025

thanks for the update. where do you see the double (seems more like 50% increase)?

Runtime: 2 hours, 48.913 seconds
Maximum memory usage: 101.5 MB
Maximum CPU usage: 414.0%

Runtime: 2 hours, 55 minutes, 42.074 seconds
Maximum memory usage: 190.6 MB
Maximum CPU usage: 872.0%

it did use twice the cpu though (which is kinda odd but I hope ok given that it needs to do more analysis now for delete)

@yarikoptic
Copy link
Member

but overall point that indeed took longer... but just a single sample so may be bandwidth etc effected... may be due to higher cpu utilization got indeed a bit slower with downloads though. Would need some kind of profiling to figure out, but I guess it is ok for now as long as it would be taking care about deletions correctly as well.

@jwodder
Copy link
Member Author

jwodder commented Jan 8, 2025

@yarikoptic You need to compare against this run. The one you're looking at is for the dandiarchive-inventory bucket, which I did not test on due to it only having a single CSV file in the first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants