Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch additional data resulting from SPN2 capture_outlinks function #23

Open
overcast07 opened this issue Mar 18, 2023 · 1 comment
Open
Labels
enhancement New feature or request

Comments

@overcast07
Copy link
Owner

Ideally, this script should be able to fetch data for outlinks captured using the server-side SPN2 outlinks function.

Any implementation of this would run into a particular challenge: polling the status API endpoint for a large number of outlinks could cause the server to return 429 errors if the rate of requests is too high. The overall rate of requests would have to be controlled in some way, accounting for the additional requests made.

One way to implement this would be to add a separate text file (spn2-outlinks.txt) to which outlink status IDs are added upon completion of the main capture job. A check for this file could be added at some point in the main while loops (the ones starting at lines 579, 609 and 680), and the child processes could be spawned from those loops. Importantly, this approach would allow the child processes to be immediate children of the main process, so they would be counted by jobs -p. The script would probably have to pause new job submissions while the child processes for the outlinks are spawned. A variable could be used to store remaining lines if the child processes for the outlinks are not spawned in one go.

Alternatively, this could be done within each capture() child process immediately after the status API endpoint returns a successful capture and the list of outlinks. However, this would not be visible to the currently implemented check on the number of child processes (i.e. jobs -p), and the rate of requests of all parts of the script would have to be slowed down to account for this (unless the status API endpoint was just checked really infrequently).

We would have to decide whether failed outlink captures should be retried. Presumably, the outlinks of these pages would not be collected, so they would have to be listed separately from the main failed.txt list. An extra variable would have to be passed to the capture function to indicate whether or not to set capture_outlinks=1.

This option would also need to interface appropriately with the -o, -x and -r options.

This idea was previously listed in the "Future plans" in README.md, but I've removed that section since it's basically outdated and no longer relevant.

@overcast07 overcast07 added the enhancement New feature or request label May 23, 2023
@overcast07
Copy link
Owner Author

The POST parameter job_id_outlinks would allow the data for all of the outlinks of a capture to be obtained at the same time. The rate limiting issue mentioned in the original post might not apply if this method is used. A list of pending captures would have to be stored, and the JSON would have to be parsed/split properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant