Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete obsolete Parquet and DuckDB files #2384

Open
severo opened this issue Feb 2, 2024 · 12 comments
Open

Delete obsolete Parquet and DuckDB files #2384

severo opened this issue Feb 2, 2024 · 12 comments
Assignees
Labels
bug Something isn't working P1 Not as needed as P0, but still important/wanted

Comments

@severo
Copy link
Collaborator

severo commented Feb 2, 2024

replaces #1613 and #980.

When we call delete_dataset(), we should remove all the parquet and duckdb files.

And maybe even the refs/convert/parquet branch altogether?

@severo severo added bug Something isn't working P1 Not as needed as P0, but still important/wanted labels Feb 2, 2024
@AndreaFrancis
Copy link
Contributor

We should also delete refs/convert/duckdb

@severo
Copy link
Collaborator Author

severo commented Apr 29, 2024

Note that, if the dataset has been deleted from the Hub, there is no branch to delete :)

@severo
Copy link
Collaborator Author

severo commented Jun 19, 2024

The only cases:

  • the dataset is put on the blocklist (generally because jobs take too much time)
  • the dataset is turned down to private -> not a big issue to still have the files

I think we can close.

@severo severo closed this as completed Jun 19, 2024
@severo
Copy link
Collaborator Author

severo commented Jul 8, 2024

https://huggingface.co/datasets/Cnam-LMSSC/vibravox/discussions/4#66854a97118934e841dcf35c

Yes they are only updated when the viewer is updated, disabling the viewer didn't remove those files as it should be. We'll work on a mechanism to clean the branch when the Viewer is disabled

@severo severo reopened this Jul 8, 2024
@severo severo self-assigned this Jan 22, 2025
@severo
Copy link
Collaborator Author

severo commented Jan 22, 2025

  • the dataset is turned down to private -> not a big issue to still have the files

It is now an issue because the private storage is billed: https://huggingface.co/docs/hub/en/storage-limits#storage-plans.

Note that currently the dataset viewer is enabled on private datasets for PRO users and Enterprise Hub orgs, and thus the parquet and duckdb files are taken into account in the storage limit. See #1774 for a way to reduce the storage for them (and for HF in public repos)

@severo
Copy link
Collaborator Author

severo commented Jan 22, 2025

Various ways to implement it

The first way is in a task called by the webhook service. Possible drawbacks:

Another way I see is to create a new kind of job, which aims at deleting things, would have the highest priority, but should be blocked by other existing jobs for the same dataset so that these have the time to finish and we avoid conflicts.

A third way is to be able to kill the started jobs before deleting the branches. I'm not sure how we would implement the ability to interrupt a job. Maybe we could have a collection of the datasets states (old idea): we would mark the dataset as "deleted" (or delete it from the collection), and every job would check if the dataset still exists (+ has the correct revision) before running, or even multiple times (before every critical step) in jobs that take time.

Interested in your opinions @AndreaFrancis @lhoestq (+ I haven't touched the code for a while, and I might be forgetting things)

@AndreaFrancis
Copy link
Contributor

Also, when a pipeline runs, it keeps old stuff even if some steps fail. We should probably clear that out to avoid confusion (but I am unsure since I also didn't touch the code for a long time).

I wonder if we could determine which Kubernetes container runs each job. That might make it easier to stop them if needed. Or maybe we could use something like Kafka? We already have a way to stop jobs, but we'd need to be careful handling messages since there could be a lot at once.

I was also thinking about a new type of job specifically for datasets. Maybe this could be the first job in the pipeline. Then, if the dataset changes, we could automatically delete anything that depends on it. There might still be some things left over, but we could have a separate process to clean those up.

Also, we could probably do much of this cleaning during the backfill process. Right now, it just checks if the dataset exists. We could also check if it's blocked, private, or disabled and delete everything related to it. We could run this cleaner more often than the regular backfill (which might be easier to implement, same as doing it on the webhook but not sure if there is any advantage/disadvantage on doing it on the service on in a kubernetes job).

@severo
Copy link
Collaborator Author

severo commented Jan 23, 2025

Good points. Is it worth doing this big refactoring?

@lhoestq
Copy link
Member

lhoestq commented Jan 28, 2025

started jobs might generate conflicts and recreate the branches after their deletion

The viewer is enabled if (PRO or PUBLIC and not viewer: false) right ? So we do have a way to avoid existing jobs to write to a disabled dataset refs/convert/parquet.

Moreover we use a lock when we write to refs/convert/parquet, so we can have a new job to clean the branch and which uses the same locking mechanism.

When the cleaning job runs, it locks the branch, then it checks that the user didn't re-enable the viewer in the meantime and finally cleans the branch.

WDYT ? anyway if there is 0.0001% where this doesn't work we can still add a backfill-like task later

@severo
Copy link
Collaborator Author

severo commented Jan 28, 2025

Of all the mentioned options, the cron job mentioned by @AndreaFrancis might be the most adequate tradeoff:

  • it's the simplest option: only a script, no new job, no change in the jobs' logic, no new collection
  • drawback: we might have to wait for some hours before cleaning the storage. Totally acceptable, no?

@lhoestq
Copy link
Member

lhoestq commented Jan 28, 2025

yup sounds good

@severo
Copy link
Collaborator Author

severo commented Jan 28, 2025

OK. I'll work on it after #3131 (which should be more impactful IMHO)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 Not as needed as P0, but still important/wanted
Projects
None yet
Development

No branches or pull requests

3 participants