Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requesting new option to avoid hitting daily_captures_limit #39

Open
barkoder opened this issue Jan 28, 2025 · 6 comments
Open

Requesting new option to avoid hitting daily_captures_limit #39

barkoder opened this issue Jan 28, 2025 · 6 comments

Comments

@barkoder
Copy link

$ curl -s -H 'Accept: application/json' -H 'Authorization: LOW AUTH:KEY' https://web.archive.org/save/status/user
{"available":8,"daily_captures":10199,"daily_captures_limit":10000,"processing":0}

Wayback machine has been slowly reducing the available page captures per user. It used to be 100,000 then 80,000, then 40,000. Now it's a mere 10,000 .

I've been hitting that limit every day these days.

So I'm requesting a new option.

If daily_captures_limit - daily_captures is below a certain specified number, then

  • user specifiable suboption A: sleep job until next UTC day begins.

OR

  • user specifiable suboption B: safely break out of the job immediately.

I need this feature because there are other important URLs that I need to reserve some captures for, and I don't want to hit the daily_captures_limit.

Thanks!

@bac0id
Copy link
Contributor

bac0id commented Feb 13, 2025

Hi. As an alternative, you can sign up for additional Internet Archive accounts for additional captures.

@barkoder
Copy link
Author

About 24 hours after I opened this issue, the available captures returned back to 40000 .

Hi. As an alternative, you can sign up for additional Internet Archive accounts for additional captures.

I could, but I don't want to burden(abuse) their free and wonderful service. If they've reduced the number of available captures per user on their end, I'll assume IA servers are overloaded in some capacity.
I honestly don't mind waiting it out.

I need this feature specifically because I archive ~2000-4000 "endangered" URLs each day. I unequivocally need ~4000 slots to be free(reserved) on my account for these URLs.

I don't want spn.sh eating up all the available capture slots while I'm archiving non-endangered URLs. So whilst I'm spn-ing the non-endangered URLs, if daily_captures_limit - daily_captures < 4000 , then I want spn.sh to sleep and wait till the beginning of the next UTC day to continue capturing.

@bac0id
Copy link
Contributor

bac0id commented Feb 14, 2025

spn.sh can resume an aborted session. I think a new option to restrict the number of captures per session can be added. So spn.sh can pause itself, and then you can restart spn.sh as a scheduled job with cron. But this approach only works for a single device, not for multiple devices using the same IA authentication.

If you want to pause spn.sh by daily_captures, this means to have extra communication with IA to get latest daily_captures, which may reduce the speed of sending capture tasks. I feel the IA server limits the communication frequency to approximately once every 2 seconds, as I frequently encounter SSL errors during communication.

@barkoder
Copy link
Author

I've not encountered SSL errors but IA does sometimes return nothing when I check daily_captures

I have to while loop it until I get an actual result.

while true ; do curl -s -H 'Accept: application/json' -H 'Authorization: LOW S3:PASS' https://web.archive.org/save/status/user && break ; sleep 10 ; done

I personally don't mind the penalty induced(if any) in sending capture tasks.
But you may have the user configure the polling frequency depending on their needs.

If the user wants a more aggressive(more accurate) capture slot reservation, and the user is okay with any task capturing speed penalty(if any) associated with it, then the user may set it to LOW number – like having spn.sh do daily_captures_limit - daily_captures once every 10s , let's say.
But if user doesn't mind losing a few more capture slots(less strict, less accurate), and/or the user is NOT okay with any associated task capturing speed penalty, then the user may set the polling frequency to HIGH number – like once every 10m.

Let the user set whether spn.sh should check free slots and do the arithmetic every 10 seconds or 10 minutes.

No cron please.

Thanks!

@bac0id
Copy link
Contributor

bac0id commented Feb 18, 2025

If the user wants a more aggressive(more accurate) capture slot reservation, and the user is okay with any task capturing speed penalty(if any) associated with it, then the user may set it to LOW number – like having spn.sh do daily_captures_limit - daily_captures once every 10s , let's say.

A good idea. Will this feature be added? @overcast07

By the way, I opened another issue #42 for SSL errors from IA. I did some research on it.

@overcast07
Copy link
Owner

I'm willing to merge pull requests, so it could happen that way, but as you can see from the changelog I haven't written any code for this script in almost 2 years, so I can't guarantee that I would ever implement it myself. I agree that adding a feature to spread out the allotted captures evenly throughout the day would be helpful, though. Sorry if this is a disappointing answer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants