Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process URLs strictly in the given order #300

Open
bzc6p opened this issue Dec 30, 2015 · 1 comment
Open

Process URLs strictly in the given order #300

bzc6p opened this issue Dec 30, 2015 · 1 comment

Comments

@bzc6p
Copy link

bzc6p commented Dec 30, 2015

If wpull experiences a problem fetching an URL, skips it and processes it in the end. This is the reasonable approach in most cases.

But there are applications where it is important that no URLs are skipped, but they are processed in the given order. (Even if they need to be retried for long because of an error.) Such one is when saving paginated lists that are rolling down due to new elements – leaving out one page and processing later may cause losing some elements of the meanwhile updated list.

I tried to modify the behaviour, even wpull code for that, but it's more complex than I thought, The handle_error hook function returning Actions.RETRY doesn't solve this. In engine.py, get_next_url_record method I tried changing the order of looking up URLs in the database (first error, then todo) but this gave only a partial solution, because, as the log suggests, multiple URL records are started to be processed, that is, new URLs are taken from the database (or from some cache if multiple URLs are taken at the same time from the db), so the next candidate is chosen before the last one is finished. This seems to be a different multithreaded behaviour than the one adjustable with --concurrency.

An option for commanding wpull to keep the order of URLs would be a possible but, I admit, a far not important enhancement. So, besides leaving this here as an enhancement suggestion, I would like to ask if there is a way – even by modifying the code – that the multithreaded behaviour described above can be turned off? The other possible workarounds not touching wpull itself (e.g. wpulling URLs one by one) are way less efficient, as far as I can imagine.

Thank you in advance.

@JustAnotherArchivist
Copy link
Contributor

JustAnotherArchivist commented Oct 10, 2018

If I understand the issue correctly, this should be partially fixed by 46f0ea5 in #393. Specifically, that commit fixes what you describe about URLs being fetched from the DB ahead of time.

However, wpull will still first retrieve all todos, then all errors. So even if you return Actions.RETRY from handle_error, it won't retry that URL immediately. Unfortunately, the URL priorisation I also added in #393 won't help here either.

How about this workaround? Instead of adding all URLs in a list or through the command line at once, add them one-by-one in get_urls. So when https://example.org/list?page=<n> has been retrieved successfully and get_urls is called on it, return https://example.org/list?page=<n+1>. This way, wpull doesn't even know about the next page until you're ready to retrieve it. This definitely won't work in all cases though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants