Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL prioritisation, split meta WARCs, and miscellaneous bug fixes #393

Open
wants to merge 41 commits into
base: develop
Choose a base branch
from

Conversation

JustAnotherArchivist
Copy link
Contributor

This PR contains the various things I've implemented and fixed back in January and February on my fork, listed below.

Sorry for the huge PR. I was working based on version 2.0.3 (then only available on FalconK's fork) back then, and separating this into individual PRs for each implemented feature/fixed bug now would be really painful. Let me know if you want me to do it anyway.

@ivan
Copy link
Contributor

ivan commented Oct 10, 2018

You might have noticed already, but your tests are failing in at least

  File "/home/travis/build/ArchiveTeam/wpull/thematrix/wpull/testing/integration/priorisation_test.py", line 161, in test_app_priority_plugin_get_urls_with_priorities
    self.assertEqual(builder.factory['Statistics'].files, 11)
nose.proxy.AssertionError: 5 != 11

Some other tests are probably failing from the earlier merge. Because it would be very tricky to continue test-driven development with failing tests, they should probably all be fixed (or some skipped?) as part of this PR (or in PR to land before this one if you wish.)

@JustAnotherArchivist
Copy link
Contributor Author

I completely agree.

I remember some tests failing back in February, but definitely not that many. And yeah, some of the failures definitely come from @falconkirtaran's changes, e.g. the unhashable type: dict ones.

I'll look into it. To keep things separate, I will probably fix only the tests broken by my changes in this PR, then prepare a separate PR for the other broken tests.

@JustAnotherArchivist
Copy link
Contributor Author

My memory was correct: I did get some test failures back in February, but not all of these. As it turns out, this is caused by a version difference in Tornado: my development venv had Tornado 4.5.1 whereas the Travis CI test used 4.5.3.

The problematic commit is tornadoweb/tornado@84bb2e2 (thanks for finding this, @ivan). It replaces localhost with 127.0.0.1 in the return value of tornado.testing.AsyncHTTPTestCase.get_url, reasoning that due to "apparently a recent macos change, connecting to localhost when the ipv6 port is unbound now incurs a 200ms delay, slowing a full test run down by a factor of 20".

Because wpull treats localhost as a different host than 127.0.0.1, all kinds of things in the test suite break (e.g. links to localhost get skipped, which causes the AssertionErrors that the number of files doesn't match the expected value).

I'll override tornado.testing.AsyncHTTPTestCase.get_url in our subclass to always return a localhost URL. Perhaps we should switch to 127.0.0.1 in the future if that delay mentioned in Tornado's commit message becomes a problem.

@JustAnotherArchivist
Copy link
Contributor Author

That looks much better. The remaining five errors are due to FalconK's changes.

@ivan, could you review please when you get a chance?

Copy link
Contributor

@ivan ivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/priorisation/prioritisation/ or the z equivalent in your commit messages as well

I would recommend hard-wrapping your commit messages in the future because most of the git tools either don't wrap or character-wrap.

wpull/application/options.py Outdated Show resolved Hide resolved
wpull/application/options.py Outdated Show resolved Hide resolved
wpull/application/options.py Outdated Show resolved Hide resolved
wpull/application/options.py Outdated Show resolved Hide resolved
wpull/application/tasks/rule.py Outdated Show resolved Hide resolved
wpull/warc/recorder.py Outdated Show resolved Hide resolved
wpull/warc/recorder.py Outdated Show resolved Hide resolved
wpull/warc/recorder_test.py Outdated Show resolved Hide resolved
wpull/warc/recorder_test.py Outdated Show resolved Hide resolved
wpull/warc/recorder_test.py Outdated Show resolved Hide resolved
@JustAnotherArchivist
Copy link
Contributor Author

I just realised that my prioritisation code won't work as expected probably. It does assign priorities correctly, but it will still first process all todo URLs and only then go to errors. So if a prioritised URL errors out, it will still only be retried after all the other todos have been processed.

It should probably go after all highest priority URLs first, doing first the todos and then retrying the errors, and only then go to the next lower priority. Implementing this will probaby require some changes in how the pipeline and the database interact though. Currently, the pipeline requests an item with either status todo or error, i.e. the pipeline manages when to switch to retries, so to speak. Now, the database could return NotFound if the highest priority of errors is higher than that of todos, such that the pipeline switches to errors and processes those first. But that's a really awful hack.
Instead, the todo vs. error logic should probably be moved to the database, and the pipeline just retrieves an arbitrary item and schedules it for processing.

This is mostly a note for myself so I remember to look into this.

@anarcat anarcat mentioned this pull request Nov 2, 2018
5 tasks
@JustAnotherArchivist JustAnotherArchivist changed the title URL priorisation, split meta WARCs, and miscellaneous bug fixes URL prioritisation, split meta WARCs, and miscellaneous bug fixes Nov 3, 2018
@JustAnotherArchivist
Copy link
Contributor Author

s/priorisation/prioritisation/ or the z equivalent in your commit messages as well

Do you want me to edit the commit messages (and thereby rewrite the history of my branch, requiring a force-push) or not?
I guess I'm fine with that, assuming GitHub handles it gracefully. Just want to make sure.

@ivan
Copy link
Contributor

ivan commented Nov 3, 2018

You can edit and force-push. I'm not sure how GitHub handles the review comments, but the PR would require a re-review anyway.

@JustAnotherArchivist
Copy link
Contributor Author

Here we go... For reference, the old branch is preserved at https://github.com/JustAnotherArchivist/wpull/tree/pr393-old.

@JustAnotherArchivist
Copy link
Contributor Author

I need to rebase again. Forgot to update the commit IDs in the commit messages.
I also still need to look into #393 (comment).
So not ready for merging yet.

@JustAnotherArchivist
Copy link
Contributor Author

@ivan, could you re-review please?

@anarcat
Copy link

anarcat commented Nov 15, 2018

tests are still red though... :/

@JustAnotherArchivist
Copy link
Contributor Author

@anarcat Yeah, those are due to @falconkirtaran's changes between 2.0.1 and 2.0.3. As mentioned earlier, I only fixed the test failures related to my changes. The remaining ones can be addressed afterwards.

@anarcat
Copy link

anarcat commented Nov 15, 2018

okay, then maybe that can be banged out in #402

@ivan
Copy link
Contributor

ivan commented Nov 15, 2018

I was waiting on confirmation that the new queries (which now include priority) didn't regress performance (discussed on IRC; becoming slower here would be bad) before reviewing this again, but if someone else wants to review this, that is fine with me :-)

@JustAnotherArchivist
Copy link
Contributor Author

With the added index and INDEXED BY clause in d8fee1a, performance is somewhat worse than before. There really is no way around a small regression since the index is larger than before, so SQLite has more data to crunch on each check-out.

I tested the performance with a large (~10 GiB) DB that I had from a crashed ArchiveBot job. It has over 50 million URLs with roughly 15 million processed. I did not do extensive tests with other databases, but I think this one should cover a fairly extreme case. The test consists of checking out and back in 1000 entries from the DB. The script can be found here. The same script works with both 2.0.3 and d8fee1a; it uses code inspection to determine the number of arguments to check_out to distinguish between the two.

Averaged across 10 runs each, interleaved, on the same machine: 9.3 seconds for the 2.0.3 code, 12.1 seconds for d8fee1a. In other words, a performance penalty of about 30 %. That's not too surprising given the increase in index size: an index on just the status column covers 11 bytes per row while the new index covers the same 11 bytes for status plus 1 byte for priority (because it's all zero in this test, another extreme case since the index on priority is effectively useless) plus 3 or 4 bytes for id (not sure exactly how SQLite switches between the integer representations; cf. here) = 15-16 bytes in total or about 45 % more...

This is certainly not a negligible performance impact, but I'm not sure there's any way to reduce it. At the same time, I am sure that there are other (and simpler?) ways to improve the database's performance by at least a factor 20 (#427), so I don't think these 30 % matter too much.

I intend to do some more tests with different priority values in the table (instead of everything at zero). I'd expect the performance hit to be less severe then since the new index would be more efficient.

@JustAnotherArchivist
Copy link
Contributor Author

SQLAlchemy 1.3.0 broke my implementation of the INDEXED BY clause via sqlalchemy/sqlalchemy#4509. So that part will require some more work. Although in the context of #427, getting rid of SQLAlchemy entirely is an option, so I might leave that for later and just limit the SQLAlchemy version to <1.3.0 if nobody objects.

Plugins are attached to existing HookableMixin instances as before. In addition,
they are also kept in an attribute of the HookableMixin class to allow attaching
when an instance is created after the plugin has been loaded.

This commit restores the plugin loading behaviour of version 1.2.3. It therefore
breaks backward compatibility in some cases; plugins written for versions 2.0.x
may need to be adapted.
When URLRecord.__init__ is executed, it first calls super().__init__, which
resolves to URLPriorities's init method. However, because URLPriorities.__init__
does not call super().__init__, it doesn't go further in the hierarchy, therefore
never defining the fields of URLData and URLResult.

NB: There's still a bug in that URLRecord.database_attributes is incorrect, but
since that class is never used as a URLDatabaseMixin, that should not be a problem.
It might be worth overwriting the database_items method though to prevent such
usage explicitly.
Version 0.99999999 (eight nines) and up do not provide the `html5lib.tokenizer` API used by wpull.
…#355)

Note that this is only implemented for the scrapers; wpull.url.URLInfo.parse should do the same thing.
As a side effect, this also makes wpull parse http://:@example.com:?@/ correctly;
the test suite previously expected this to throw an exception, but that URL is
perfectly parseable (although it does cause a validation error per the URL Standard).
According to that commit, there is a 200 ms delay on macOS when connecting to
localhost and only IPv4 is bound (which is what Tornado does). For that reason,
they replaced "localhost" with "127.0.0.1" in tornado.testing.AsyncHTTPTestCase.get_url.

Because wpull treats "localhost" and "127.0.0.1" as different hosts, this breaks
a variety of tests for different reasons (URLs getting skipped, cookies not set,
etc.) when using Tornado 4.5.3.

As a workaround, this commit restores the previous behaviour of using "localhost".
This may incur a performance penalty during testing on some platforms.
… hook

instead of copying, and add a note to the documentation that modifying the
objects passed into callbacks produces undefined behaviour.
Previously, wpull would first try all todos once, then retry all errors. However,
priorities are supposed to take precedence, i.e. first all URLs of the highest
priority should be retried until they are all either done or skipped, and only
then should URLs with the next lower priority start getting processed.

There is a backwards-incompatible change in the URLTable.check_out API: this
method no longer takes any parameters. Previously, it would get a status to filter
by, and there was also a level parameter (which was never used anywhere though),
and the function would be first called with status todo and, if there were no
todo URLs anymore, again with status error. This logic has been moved to the
URLTable, i.e. it is now the URLTable's duty to return items in that order.
SQLite's query planner for some reason doesn't realise that it should use that index for the check_out query. Without the INDEXED BY, it's instead doing a full table scan for every check-out, which is obviously horrible for performance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants