Process URLs strictly in the given order #300

bzc6p · 2015-12-30T18:29:31Z

If wpull experiences a problem fetching an URL, skips it and processes it in the end. This is the reasonable approach in most cases.

But there are applications where it is important that no URLs are skipped, but they are processed in the given order. (Even if they need to be retried for long because of an error.) Such one is when saving paginated lists that are rolling down due to new elements – leaving out one page and processing later may cause losing some elements of the meanwhile updated list.

I tried to modify the behaviour, even wpull code for that, but it's more complex than I thought, The handle_error hook function returning Actions.RETRY doesn't solve this. In engine.py, get_next_url_record method I tried changing the order of looking up URLs in the database (first error, then todo) but this gave only a partial solution, because, as the log suggests, multiple URL records are started to be processed, that is, new URLs are taken from the database (or from some cache if multiple URLs are taken at the same time from the db), so the next candidate is chosen before the last one is finished. This seems to be a different multithreaded behaviour than the one adjustable with --concurrency.

An option for commanding wpull to keep the order of URLs would be a possible but, I admit, a far not important enhancement. So, besides leaving this here as an enhancement suggestion, I would like to ask if there is a way – even by modifying the code – that the multithreaded behaviour described above can be turned off? The other possible workarounds not touching wpull itself (e.g. wpulling URLs one by one) are way less efficient, as far as I can imagine.

Thank you in advance.

JustAnotherArchivist · 2018-10-10T23:26:30Z

If I understand the issue correctly, this should be partially fixed by 46f0ea5 in #393. Specifically, that commit fixes what you describe about URLs being fetched from the DB ahead of time.

However, wpull will still first retrieve all todos, then all errors. So even if you return Actions.RETRY from handle_error, it won't retry that URL immediately. Unfortunately, the URL priorisation I also added in #393 won't help here either.

How about this workaround? Instead of adding all URLs in a list or through the command line at once, add them one-by-one in get_urls. So when https://example.org/list?page=<n> has been retrieved successfully and get_urls is called on it, return https://example.org/list?page=<n+1>. This way, wpull doesn't even know about the next page until you're ready to retrieve it. This definitely won't work in all cases though.

chfoo added the enhancement label Jan 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process URLs strictly in the given order #300

Process URLs strictly in the given order #300

bzc6p commented Dec 30, 2015

JustAnotherArchivist commented Oct 10, 2018 •

edited

Loading

Process URLs strictly in the given order #300

Process URLs strictly in the given order #300

Comments

bzc6p commented Dec 30, 2015

JustAnotherArchivist commented Oct 10, 2018 • edited Loading

JustAnotherArchivist commented Oct 10, 2018 •

edited

Loading