-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
URL prioritisation, split meta WARCs, and miscellaneous bug fixes #393
base: develop
Are you sure you want to change the base?
URL prioritisation, split meta WARCs, and miscellaneous bug fixes #393
Conversation
You might have noticed already, but your tests are failing in at least
Some other tests are probably failing from the earlier merge. Because it would be very tricky to continue test-driven development with failing tests, they should probably all be fixed (or some skipped?) as part of this PR (or in PR to land before this one if you wish.) |
I completely agree. I remember some tests failing back in February, but definitely not that many. And yeah, some of the failures definitely come from @falconkirtaran's changes, e.g. the I'll look into it. To keep things separate, I will probably fix only the tests broken by my changes in this PR, then prepare a separate PR for the other broken tests. |
My memory was correct: I did get some test failures back in February, but not all of these. As it turns out, this is caused by a version difference in Tornado: my development venv had Tornado 4.5.1 whereas the Travis CI test used 4.5.3. The problematic commit is tornadoweb/tornado@84bb2e2 (thanks for finding this, @ivan). It replaces Because wpull treats I'll override |
That looks much better. The remaining five errors are due to FalconK's changes. @ivan, could you review please when you get a chance? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/priorisation/prioritisation/ or the z equivalent in your commit messages as well
I would recommend hard-wrapping your commit messages in the future because most of the git tools either don't wrap or character-wrap.
I just realised that my prioritisation code won't work as expected probably. It does assign priorities correctly, but it will still first process all todo URLs and only then go to errors. So if a prioritised URL errors out, it will still only be retried after all the other todos have been processed. It should probably go after all highest priority URLs first, doing first the todos and then retrying the errors, and only then go to the next lower priority. Implementing this will probaby require some changes in how the pipeline and the database interact though. Currently, the pipeline requests an item with either status todo or error, i.e. the pipeline manages when to switch to retries, so to speak. Now, the database could return This is mostly a note for myself so I remember to look into this. |
Do you want me to edit the commit messages (and thereby rewrite the history of my branch, requiring a force-push) or not? |
You can edit and force-push. I'm not sure how GitHub handles the review comments, but the PR would require a re-review anyway. |
e4cde6e
to
e6e1758
Compare
Here we go... For reference, the old branch is preserved at https://github.com/JustAnotherArchivist/wpull/tree/pr393-old. |
I need to rebase again. Forgot to update the commit IDs in the commit messages. |
78f6ace
to
f486d27
Compare
@ivan, could you re-review please? |
tests are still red though... :/ |
@anarcat Yeah, those are due to @falconkirtaran's changes between 2.0.1 and 2.0.3. As mentioned earlier, I only fixed the test failures related to my changes. The remaining ones can be addressed afterwards. |
okay, then maybe that can be banged out in #402 |
I was waiting on confirmation that the new queries (which now include priority) didn't regress performance (discussed on IRC; becoming slower here would be bad) before reviewing this again, but if someone else wants to review this, that is fine with me :-) |
With the added index and INDEXED BY clause in d8fee1a, performance is somewhat worse than before. There really is no way around a small regression since the index is larger than before, so SQLite has more data to crunch on each check-out. I tested the performance with a large (~10 GiB) DB that I had from a crashed ArchiveBot job. It has over 50 million URLs with roughly 15 million processed. I did not do extensive tests with other databases, but I think this one should cover a fairly extreme case. The test consists of checking out and back in 1000 entries from the DB. The script can be found here. The same script works with both 2.0.3 and d8fee1a; it uses code inspection to determine the number of arguments to Averaged across 10 runs each, interleaved, on the same machine: 9.3 seconds for the 2.0.3 code, 12.1 seconds for d8fee1a. In other words, a performance penalty of about 30 %. That's not too surprising given the increase in index size: an index on just the This is certainly not a negligible performance impact, but I'm not sure there's any way to reduce it. At the same time, I am sure that there are other (and simpler?) ways to improve the database's performance by at least a factor 20 (#427), so I don't think these 30 % matter too much. I intend to do some more tests with different priority values in the table (instead of everything at zero). I'd expect the performance hit to be less severe then since the new index would be more efficient. |
SQLAlchemy 1.3.0 broke my implementation of the |
Plugins are attached to existing HookableMixin instances as before. In addition, they are also kept in an attribute of the HookableMixin class to allow attaching when an instance is created after the plugin has been loaded. This commit restores the plugin loading behaviour of version 1.2.3. It therefore breaks backward compatibility in some cases; plugins written for versions 2.0.x may need to be adapted.
When URLRecord.__init__ is executed, it first calls super().__init__, which resolves to URLPriorities's init method. However, because URLPriorities.__init__ does not call super().__init__, it doesn't go further in the hierarchy, therefore never defining the fields of URLData and URLResult. NB: There's still a bug in that URLRecord.database_attributes is incorrect, but since that class is never used as a URLDatabaseMixin, that should not be a problem. It might be worth overwriting the database_items method though to prevent such usage explicitly.
Version 0.99999999 (eight nines) and up do not provide the `html5lib.tokenizer` API used by wpull.
…#355) Note that this is only implemented for the scrapers; wpull.url.URLInfo.parse should do the same thing.
As a side effect, this also makes wpull parse http://:@example.com:?@/ correctly; the test suite previously expected this to throw an exception, but that URL is perfectly parseable (although it does cause a validation error per the URL Standard).
According to that commit, there is a 200 ms delay on macOS when connecting to localhost and only IPv4 is bound (which is what Tornado does). For that reason, they replaced "localhost" with "127.0.0.1" in tornado.testing.AsyncHTTPTestCase.get_url. Because wpull treats "localhost" and "127.0.0.1" as different hosts, this breaks a variety of tests for different reasons (URLs getting skipped, cookies not set, etc.) when using Tornado 4.5.3. As a workaround, this commit restores the previous behaviour of using "localhost". This may incur a performance penalty during testing on some platforms.
… hook instead of copying, and add a note to the documentation that modifying the objects passed into callbacks produces undefined behaviour.
Previously, wpull would first try all todos once, then retry all errors. However, priorities are supposed to take precedence, i.e. first all URLs of the highest priority should be retried until they are all either done or skipped, and only then should URLs with the next lower priority start getting processed. There is a backwards-incompatible change in the URLTable.check_out API: this method no longer takes any parameters. Previously, it would get a status to filter by, and there was also a level parameter (which was never used anywhere though), and the function would be first called with status todo and, if there were no todo URLs anymore, again with status error. This logic has been moved to the URLTable, i.e. it is now the URLTable's duty to return items in that order.
SQLite's query planner for some reason doesn't realise that it should use that index for the check_out query. Without the INDEXED BY, it's instead doing a full table scan for every check-out, which is obviously horrible for performance.
d8fee1a
to
6530b28
Compare
6530b28
to
37dfdf9
Compare
This PR contains the various things I've implemented and fixed back in January and February on my fork, listed below.
Sorry for the huge PR. I was working based on version 2.0.3 (then only available on FalconK's fork) back then, and separating this into individual PRs for each implemented feature/fixed bug now would be really painful. Let me know if you want me to do it anyway.
--priority-regex
/--priority-domain
and theget_priority
hook can be used to determine in which order URLs are retrieved.--concurrent
having no effect (--concurrent doesn't enable concurrency #339)--warc-split-meta