Skip to content

v0.14.0: Filesystem API, Webhook Server, upload improvements, keep-alive connections, and more

Compare
Choose a tag to compare
@Wauplin Wauplin released this 18 Apr 19:25
· 664 commits to main since this release

HfFileSystem: interact with the Hub through the Filesystem API

We introduce HfFileSystem, a pythonic filesystem interface compatible with fsspec. Built on top of HfApi, it offers typical filesystem operations like cp, mv, ls, du, glob, get_file and put_file.

>>> from huggingface_hub import HfFileSystem
>>> fs = HfFileSystem()

# List all files in a directory
>>> fs.ls("datasets/myself/my-dataset/data", detail=False)
['datasets/myself/my-dataset/data/train.csv', 'datasets/myself/my-dataset/data/test.csv']

>>> train_data = fs.read_text("datasets/myself/my-dataset/data/train.csv")

Its biggest advantage is to provide ready-to-use integrations with popular libraries like Pandas, DuckDB and Zarr.

import pandas as pd

# Read a remote CSV file into a dataframe
df = pd.read_csv("hf://datasets/my-username/my-dataset-repo/train.csv")

# Write a dataframe to a remote CSV file
df.to_csv("hf://datasets/my-username/my-dataset-repo/test.csv")

For a more detailed overview, please have a look to this guide.

Webhook Server

WebhooksServer allows to implement, debug and deploy webhook endpoints on the Hub without any overhead. Creating a new endpoint is as easy as decorating a Python function.

# app.py
from huggingface_hub import webhook_endpoint, WebhookPayload

@webhook_endpoint
async def trigger_training(payload: WebhookPayload) -> None:
    if payload.repo.type == "dataset" and payload.event.action == "update":
        # Trigger a training job if a dataset is updated
        ...

For more details, check out this twitter thread or the documentation guide.

Note that this feature is experimental which means the API/behavior might change without prior notice. A warning is displayed to the user when using it. As it is experimental, we would love to get feedback!

Some upload QOL improvements

Faster upload with hf_transfer

Integration with a Rust-based library to upload large files in chunks and concurrently. Expect x3 speed-up if your bandwidth allows it!

Upload in multiple commits

Uploading large folders at once might be annoying if any error happens while committing (e.g. a connection error occurs). It is now possible to upload a folder in multiple (smaller) commits. If a commit fails, you can re-run the script and resume the upload. Commits are pushed to a dedicated PR. Once completed, the PR is merged to the main branch resulting in a single commit in your git history.

upload_folder(
    folder_path="local/checkpoints",
    repo_id="username/my-dataset",
    repo_type="dataset",
    multi_commits=True, # resumable multi-upload
    multi_commits_verbose=True,
)

Note that this feature is also experimental, meaning its behavior might be updated in the future.

Upload validation

Some more pre-validation done before committing files to the Hub. The .git folder is ignored in upload_folder (if any) + fail early in case of invalid paths.

  • Fix path_in_repo validation when committing files by @Wauplin in #1382
  • Raise issue if trying to upload .git/ folder + ignore .git/ folder in upload_folder by @Wauplin in #1408

Keep-alive connections between requests

Internal update to reuse the same HTTP session across huggingface_hub. The goal is to keep the connection open when doing multiple calls to the Hub which ultimately saves a lot of time. For instance, updating metadata in a README became 40% faster while listing all models from the Hub is 60% faster. This has no impact for atomic calls (e.g. 1 standalone GET call).

Custom sleep time for Spaces

It is now possible to programmatically set a custom sleep time on your upgraded Space. After X seconds of inactivity, your Space will go to sleep to save you some $$$.

from huggingface_hub import set_space_sleep_time

# Put your Space to sleep after 1h of inactivity
set_space_sleep_time(repo_id=repo_id, sleep_time=3600)

Breaking change

  • fsspec has been added as a main dependency. It's a lightweight Python library required for HfFileSystem.

No other breaking change expected in this release.

Bugfixes & small improvements

File-related

A lot of effort has been invested in making huggingface_hub's cache system more robust especially when working with symlinks on Windows. Hope everything's fixed by now.

  • Fix relative symlinks in cache by @Wauplin in #1390
  • Hotfix - use relative symlinks whenever possible by @Wauplin in #1399
  • [hot-fix] Malicious repo can overwrite any file on disk by @Wauplin in #1429
  • Fix symlinks on different volumes on Windows by @Wauplin in #1437
  • [FIX] bug "Invalid cross-device link" error when using snapshot_download to local_dir with no symlink by @thaiminhpv in #1439
  • Raise after download if file size is not consistent by @Wauplin in # 1403

ETag-related

After a server-side configuration issue, we made huggingface_hub more robust when getting Hub's Etags to be more future-proof.

  • Update file_download.py by @Wauplin in #1406
  • 🧹 Use HUGGINGFACE_HEADER_X_LINKED_ETAG const by @julien-c in #1405
  • Normalize both possible variants of the Etag to remove potentially invalid path elements by @dwforbes in #1428

Documentation-related

Misc

Internal stuff

  • Fix CI by @Wauplin in #1392
  • PR should not fail if codecov is bad by @Wauplin (direct commit on main)
  • remove cov check in PR by @Wauplin (direct commit on main)
  • Fix restart space test by @Wauplin (direct commit on main)
  • fix move repo test by @Wauplin (direct commit on main)