-
Notifications
You must be signed in to change notification settings - Fork 671
Handle Rate Limits in Pagination with Automatic Retries #2970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening the PR @Weyaxi ! As discussed in DMs, I think it's a reasonable solution 👍
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@@ -42,7 +42,7 @@ def paginate(path: str, params: Dict, headers: Dict) -> Iterable: | |||
next_page = _get_next_page(r) | |||
while next_page is not None: | |||
logger.debug(f"Pagination detected. Requesting next page: {next_page}") | |||
r = session.get(next_page, headers=headers) | |||
r = http_backoff("GET", next_page, max_retries=20, retry_on_status_codes=429, headers=headers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shoud add from . import get_session, hf_raise_for_status, http_backoff, logging
in import section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I forgot, sorry. Should be done with this:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's wait for green CI and we're good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this test is broken: https://github.com/huggingface/huggingface_hub/blob/main/tests/test_utils_pagination.py#L9 Would you like to check what's wrong with it? Most likely the http_backoff
method that has to be mocked as well (same as currently done with get_session
). Let me know if you need help.
(pushed 03f8ff9 to fix the mocked tests) |
This PR enhances the
paginate
function insrc/huggingface_hub/utils/_pagination.py/_pagination.py
by implementing automatic retries using thehttp_backoff
function. Previously, when attempting to list all models viaapi.list_models()
, the process would fail due to rate limiting (HTTP 429) after approximately 500K models (this number depends on many factors). This led to data loss and required restarting the process from the beginning. Additionally, after restarting, the process would hit the rate limit again after another 500K models, making it impossible to scrape all models in a single run.Changes & Improvements:
session.get
withhttp_backoff
to enable automatic retries.max_retries=20
andretry_on_status_codes=429
to handle API rate limits gracefully.paginate
benefit from improved stability (though this rate limiting is most likely to occur only when listing >1.5M models).Why This Change?
paginate
, not just model listing.This PR was initially planned to implement retry logic directly within
_pagination.py
instead of using thehttp_backoff
function. However, thanks to @Wauplin's great guidance in pointing out thathttp_backoff
already exists and is highly stable, this approach is now much better! 🙂Thanks, @Wauplin! 🚀
Below screenshot shows the behavior before this PR. As seen in the screenshot, an exception was thrown when the rate limit was hit, and restarting the process would only result in the same issue occurring again. As stated earlier, this made it impossible to access the remaining models in a simple way. 🙂