Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: ability to only save pages that haven't been archived yet #30

Open
exurd opened this issue Jul 20, 2023 · 4 comments
Open

Feature: ability to only save pages that haven't been archived yet #30

exurd opened this issue Jul 20, 2023 · 4 comments

Comments

@exurd
Copy link
Contributor

exurd commented Jul 20, 2023

There should be an option that allows you to check in the Wayback Machine if it has already been archived. For example, if you have a bunch of text files and only want to send requests for URLs with no archived page (i.e. first archive of a page), this setting can help.

Other things to consider when adding this is how long ago should it check. Maybe the option can work by adding the option, and then the timestamp?

@exurd exurd changed the title Feature: ability to only save pages that haven't been archived Feature: ability to only save pages that haven't been archived yet Jul 20, 2023
@TheTechRobo
Copy link

Just chiming in that I think this would be really slow.

@exurd
Copy link
Contributor Author

exurd commented Jul 21, 2023

If I wanted to do this on the text files I have myself, I would currently need to do this:

  1. Turn the text file into a Google Sheet
  2. Put it into the "Batch process Google Sheets using archive.org services" app with the "Check if URLs are archived in the Wayback Machine" feature
  3. Export from Google Sheets to a CSV
  4. Convert that CSV to a usable text file (with the URLs only)
  5. And then finally sending it into spn.sh

It takes around half an hour (or more) to hand-process each of those files. If I have multiple files, this would get tedious pretty quickly; merging and un-merging them would add two additional steps to the already large method (I wouldn't even know how to separate them after they get processed).

If the script could do this, it would not only make this method outdated, but it would also be quicker since it doesn't need to do every URL at once.

@AgostinoSturaro
Copy link

Other things to consider when adding this is how long ago should it check. Maybe the option can work by adding the option, and then the timestamp?

This should be already possible.
Use the -d flag to set the if_not_archived_within capture option, plus the -n option to avoid saving error pages.
See the SPN 2 Public API doc for the options.
Something like spn.sh -d 'if_not_archived_within=5y' -n should only save working pages not saved within the last 5 years.
Try it out. There are even options for the outlinks.

@NoodlesStamps
Copy link

"The capture will start in ~ seconds because we are doing a lot of captures of ~ ~ right now" When this message appears, it seems that the archive will be duplicated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants