-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delete files from backup that don't match any inventory items #33
Comments
If we are to avoid storing the full list of files in memory to figure out which aren't in inventory, some raw ideas to see if anything sticks "treeification"/sorting of inventory pathsI think s5cmd relies on order as well, and does sorting of lists disk to accommodate large listings. cons:
Move in placeabuse filesystem: upon initiating backup, do pros:
cons:
store only "hash" of the pathAt 500 million keys, if we use e.g. md5sum we need only 16 bytes per each file, should require only ~8GB of RAM to store the listing of all paths in inventory. We would need it though as a "set", so might take more but still not prohibitively. Then for each path on filesystem we could check if in the set and remove if not. cons:
|
@yarikoptic Regarding sorting inventory paths, while each individual CSV file seems to be sorted, the order in which the CSVs are listed in the manifest does not preserve overall ordering, so getting a properly-sorted list would still involve retrieving & storing all keys in memory first. |
ha -- interesting! So, we could potentially fetch first e.g. 1kb of each csv (quick), extract and thus order CSVs and proceed in sorted order across them? |
@yarikoptic Here's a sketch of a potential way to implement this:
Thoughts? |
question: does code now wait for download of all inventory files or starts processing entries as soon as first inventory file comes in? or would it be hard to refactor to start process as soon as first one comes in? I am asking because if we could start processing (CPU bound, not IO) as soon as the first (in desired order) CSV is fetched, then I would prefer the "Alternatively," since it would allow for a quick fetch of first blocks [*], order quickly and then start downloading/using CSV files -- likely would be a wortwhile optimization.
your logic sounds correct to me besides seems not spelling out logic for root directory (nothing special there I guess though -- just at the end of the loop when leaving it - do the same). ping @asmacdo for a review of it since we had to go through a similar one in dandi/dandi-hub#199 (comment) so worth extra "comprehension". re Question: good question.
overall -- either of those would fit our use case AFAIK, you decide. |
Entries are processed as soon as the first CSV is downloaded. Specifically, there are at most 201 CSV tasks running at all times; each one downloads a CSV to disk and then, once it's fully downloaded, it reads through it and feeds the entries to the object-downloading tasks. Once the end of a CSV file is reached, the respective task starts work on a new CSV. Note that, regardless of whether CSVs' headers are fetched in advance, my proposal requires only one CSV to be iterated over at a time in order to ensure that entries are processed in strictly-ascending sorted order. Also, fetching headers would still require fetching the header for each & every CSV before beginning to iterate over any of them in full, in order to ensure we process the CSVs in the right order. Footnotes
|
@yarikoptic FYI, I started out by implementing fetching the initial portion of the CSVs, sorting by first key, and then operating on one CSV at a time. As you can see from the run report, this doubled the runtime for the test case. |
thanks for the update. where do you see the double (seems more like 50% increase)? Runtime: 2 hours, 48.913 seconds
Runtime: 2 hours, 55 minutes, 42.074 seconds it did use twice the cpu though (which is kinda odd but I hope ok given that it needs to do more analysis now for delete) |
but overall point that indeed took longer... but just a single sample so may be bandwidth etc effected... may be due to higher cpu utilization got indeed a bit slower with downloads though. Would need some kind of profiling to figure out, but I guess it is ok for now as long as it would be taking care about deletions correctly as well. |
@yarikoptic You need to compare against this run. The one you're looking at is for the dandiarchive-inventory bucket, which I did not test on due to it only having a single CSV file in the first place. |
No description provided.
The text was updated successfully, but these errors were encountered: