Delete obsolete Parquet and DuckDB files #2384

severo · 2024-02-02T17:04:49Z

replaces #1613 and #980.

When we call delete_dataset(), we should remove all the parquet and duckdb files.

And maybe even the refs/convert/parquet branch altogether?

The text was updated successfully, but these errors were encountered:

AndreaFrancis · 2024-04-29T17:23:30Z

We should also delete refs/convert/duckdb

severo · 2024-04-29T17:30:17Z

Note that, if the dataset has been deleted from the Hub, there is no branch to delete :)

severo · 2024-06-19T15:34:48Z

The only cases:

the dataset is put on the blocklist (generally because jobs take too much time)
the dataset is turned down to private -> not a big issue to still have the files

I think we can close.

severo · 2024-07-08T08:46:05Z

https://huggingface.co/datasets/Cnam-LMSSC/vibravox/discussions/4#66854a97118934e841dcf35c

Yes they are only updated when the viewer is updated, disabling the viewer didn't remove those files as it should be. We'll work on a mechanism to clean the branch when the Viewer is disabled

severo · 2025-01-22T14:31:32Z

the dataset is turned down to private -> not a big issue to still have the files

It is now an issue because the private storage is billed: https://huggingface.co/docs/hub/en/storage-limits#storage-plans.

Note that currently the dataset viewer is enabled on private datasets for PRO users and Enterprise Hub orgs, and thus the parquet and duckdb files are taken into account in the storage limit. See #1774 for a way to reduce the storage for them (and for HF in public repos)

severo · 2025-01-22T15:17:32Z

Various ways to implement it

The first way is in a task called by the webhook service. Possible drawbacks:

we would have to move code from service/workers to libs/libcommon
the committer token should be set in the webhook environment, which might be a bit more insecure (for now, only the workers + the discussion bot have access to it)
started jobs might generate conflicts and recreate the branches after their deletion (see create_branch in https://github.com/huggingface/dataset-viewer/blob/23efc549369fe09f3360a6732dee323cc65a40fd/services/worker/src/worker/job_runners/config/parquet_and_info.py)

Another way I see is to create a new kind of job, which aims at deleting things, would have the highest priority, but should be blocked by other existing jobs for the same dataset so that these have the time to finish and we avoid conflicts.

A third way is to be able to kill the started jobs before deleting the branches. I'm not sure how we would implement the ability to interrupt a job. Maybe we could have a collection of the datasets states (old idea): we would mark the dataset as "deleted" (or delete it from the collection), and every job would check if the dataset still exists (+ has the correct revision) before running, or even multiple times (before every critical step) in jobs that take time.

Interested in your opinions @AndreaFrancis @lhoestq (+ I haven't touched the code for a while, and I might be forgetting things)

AndreaFrancis · 2025-01-23T12:22:58Z

Also, when a pipeline runs, it keeps old stuff even if some steps fail. We should probably clear that out to avoid confusion (but I am unsure since I also didn't touch the code for a long time).

I wonder if we could determine which Kubernetes container runs each job. That might make it easier to stop them if needed. Or maybe we could use something like Kafka? We already have a way to stop jobs, but we'd need to be careful handling messages since there could be a lot at once.

I was also thinking about a new type of job specifically for datasets. Maybe this could be the first job in the pipeline. Then, if the dataset changes, we could automatically delete anything that depends on it. There might still be some things left over, but we could have a separate process to clean those up.

Also, we could probably do much of this cleaning during the backfill process. Right now, it just checks if the dataset exists. We could also check if it's blocked, private, or disabled and delete everything related to it. We could run this cleaner more often than the regular backfill (which might be easier to implement, same as doing it on the webhook but not sure if there is any advantage/disadvantage on doing it on the service on in a kubernetes job).

severo · 2025-01-23T12:31:05Z

Good points. Is it worth doing this big refactoring?

lhoestq · 2025-01-28T13:33:50Z

started jobs might generate conflicts and recreate the branches after their deletion

The viewer is enabled if (PRO or PUBLIC and not viewer: false) right ? So we do have a way to avoid existing jobs to write to a disabled dataset refs/convert/parquet.

Moreover we use a lock when we write to refs/convert/parquet, so we can have a new job to clean the branch and which uses the same locking mechanism.

When the cleaning job runs, it locks the branch, then it checks that the user didn't re-enable the viewer in the meantime and finally cleans the branch.

WDYT ? anyway if there is 0.0001% where this doesn't work we can still add a backfill-like task later

severo · 2025-01-28T13:54:58Z

Of all the mentioned options, the cron job mentioned by @AndreaFrancis might be the most adequate tradeoff:

it's the simplest option: only a script, no new job, no change in the jobs' logic, no new collection
drawback: we might have to wait for some hours before cleaning the storage. Totally acceptable, no?

lhoestq · 2025-01-28T16:33:13Z

yup sounds good

severo · 2025-01-28T21:52:47Z

OK. I'll work on it after #3131 (which should be more impactful IMHO)

severo added bug Something isn't working P1 Not as needed as P0, but still important/wanted labels Feb 2, 2024

This was referenced Feb 2, 2024

In config-parquet-metadata, delete the old files before uploading new ones #1613

Closed

Parquet files should be deleted if a dataset goes above the limit #980

Closed

severo closed this as completed Jun 19, 2024

severo reopened this Jul 8, 2024

severo self-assigned this Jan 22, 2025

severo mentioned this issue Jan 22, 2025

Squash the commits in the refs/convert/parquet and refs/convert/duckdb branches #1774

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delete obsolete Parquet and DuckDB files #2384

Delete obsolete Parquet and DuckDB files #2384

severo commented Feb 2, 2024 •

edited

Loading

AndreaFrancis commented Apr 29, 2024

severo commented Apr 29, 2024

severo commented Jun 19, 2024

severo commented Jul 8, 2024

severo commented Jan 22, 2025 •

edited

Loading

severo commented Jan 22, 2025

AndreaFrancis commented Jan 23, 2025

severo commented Jan 23, 2025

lhoestq commented Jan 28, 2025

severo commented Jan 28, 2025

lhoestq commented Jan 28, 2025

severo commented Jan 28, 2025

Delete obsolete Parquet and DuckDB files #2384

Delete obsolete Parquet and DuckDB files #2384

Comments

severo commented Feb 2, 2024 • edited Loading

AndreaFrancis commented Apr 29, 2024

severo commented Apr 29, 2024

severo commented Jun 19, 2024

severo commented Jul 8, 2024

severo commented Jan 22, 2025 • edited Loading

severo commented Jan 22, 2025

AndreaFrancis commented Jan 23, 2025

severo commented Jan 23, 2025

lhoestq commented Jan 28, 2025

severo commented Jan 28, 2025

lhoestq commented Jan 28, 2025

severo commented Jan 28, 2025

severo commented Feb 2, 2024 •

edited

Loading

severo commented Jan 22, 2025 •

edited

Loading