-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delete obsolete Parquet and DuckDB files #2384
Comments
We should also delete |
Note that, if the dataset has been deleted from the Hub, there is no branch to delete :) |
The only cases:
I think we can close. |
https://huggingface.co/datasets/Cnam-LMSSC/vibravox/discussions/4#66854a97118934e841dcf35c
|
It is now an issue because the private storage is billed: https://huggingface.co/docs/hub/en/storage-limits#storage-plans. Note that currently the dataset viewer is enabled on private datasets for PRO users and Enterprise Hub orgs, and thus the parquet and duckdb files are taken into account in the storage limit. See #1774 for a way to reduce the storage for them (and for HF in public repos) |
Various ways to implement it The first way is in a task called by the webhook service. Possible drawbacks:
Another way I see is to create a new kind of job, which aims at deleting things, would have the highest priority, but should be blocked by other existing jobs for the same dataset so that these have the time to finish and we avoid conflicts. A third way is to be able to kill the started jobs before deleting the branches. I'm not sure how we would implement the ability to interrupt a job. Maybe we could have a collection of the datasets states (old idea): we would mark the dataset as "deleted" (or delete it from the collection), and every job would check if the dataset still exists (+ has the correct revision) before running, or even multiple times (before every critical step) in jobs that take time. Interested in your opinions @AndreaFrancis @lhoestq (+ I haven't touched the code for a while, and I might be forgetting things) |
Also, when a pipeline runs, it keeps old stuff even if some steps fail. We should probably clear that out to avoid confusion (but I am unsure since I also didn't touch the code for a long time). I wonder if we could determine which Kubernetes container runs each job. That might make it easier to stop them if needed. Or maybe we could use something like Kafka? We already have a way to stop jobs, but we'd need to be careful handling messages since there could be a lot at once. I was also thinking about a new type of job specifically for datasets. Maybe this could be the first job in the pipeline. Then, if the dataset changes, we could automatically delete anything that depends on it. There might still be some things left over, but we could have a separate process to clean those up. Also, we could probably do much of this cleaning during the backfill process. Right now, it just checks if the dataset exists. We could also check if it's blocked, private, or disabled and delete everything related to it. We could run this cleaner more often than the regular backfill (which might be easier to implement, same as doing it on the webhook but not sure if there is any advantage/disadvantage on doing it on the service on in a kubernetes job). |
Good points. Is it worth doing this big refactoring? |
The viewer is enabled if (PRO or PUBLIC and not Moreover we use a lock when we write to When the cleaning job runs, it locks the branch, then it checks that the user didn't re-enable the viewer in the meantime and finally cleans the branch. WDYT ? anyway if there is 0.0001% where this doesn't work we can still add a backfill-like task later |
Of all the mentioned options, the cron job mentioned by @AndreaFrancis might be the most adequate tradeoff:
|
yup sounds good |
OK. I'll work on it after #3131 (which should be more impactful IMHO) |
replaces #1613 and #980.
When we call
delete_dataset()
, we should remove all the parquet and duckdb files.And maybe even the
refs/convert/parquet
branch altogether?The text was updated successfully, but these errors were encountered: