URL prioritisation, split meta WARCs, and miscellaneous bug fixes #393

JustAnotherArchivist · 2018-10-10T17:27:43Z

This PR contains the various things I've implemented and fixed back in January and February on my fork, listed below.

Sorry for the huge PR. I was working based on version 2.0.3 (then only available on FalconK's fork) back then, and separating this into individual PRs for each implemented feature/fixed bug now would be really painful. Let me know if you want me to do it anyway.

Add URL priorisation: --priority-regex/--priority-domain and the get_priority hook can be used to determine in which order URLs are retrieved.
Load plugins early again, like in wpull 1.x. Before, plugins were loaded after everything was already initialised, which made it difficult to replace functionality by replacing the factory class map. (Fixes Plugins activated too late #383)
Rewrite parts of the pipeline code to fix the order of URLs; this wasn't really a problem before, but it messed with the priorisation code.
Fix --concurrent having no effect (--concurrent doesn't enable concurrency #339)
Pin html5lib version (fixes ImportError with html5lib 0.99999999 (2016-07-14) #332)
Add splitting meta WARCs when the data WARC is split using --warc-split-meta
Fix flattening of consecutive slashes in URLs (URLs with a multiple protocol strings in them will give subsequent protocols only one slash #380)
Handle backslashes in URLs (fixes Fails to properly parse URLs with backslashes #377)
Handle invalid dates in last modified date (fixes TypeError: Tuple or struct_time argument required #376)
Strip tab and newline characters from scraped URLs (fixes sitemap parsing/discovery seems broken for www.zeit.de #355)
Handle empty port correctly (fixes Unable to parse URL with empty port number #340)

ivan · 2018-10-10T17:43:37Z

You might have noticed already, but your tests are failing in at least

  File "/home/travis/build/ArchiveTeam/wpull/thematrix/wpull/testing/integration/priorisation_test.py", line 161, in test_app_priority_plugin_get_urls_with_priorities
    self.assertEqual(builder.factory['Statistics'].files, 11)
nose.proxy.AssertionError: 5 != 11

Some other tests are probably failing from the earlier merge. Because it would be very tricky to continue test-driven development with failing tests, they should probably all be fixed (or some skipped?) as part of this PR (or in PR to land before this one if you wish.)

JustAnotherArchivist · 2018-10-10T21:06:55Z

I completely agree.

I remember some tests failing back in February, but definitely not that many. And yeah, some of the failures definitely come from @falconkirtaran's changes, e.g. the unhashable type: dict ones.

I'll look into it. To keep things separate, I will probably fix only the tests broken by my changes in this PR, then prepare a separate PR for the other broken tests.

JustAnotherArchivist · 2018-10-18T17:49:58Z

My memory was correct: I did get some test failures back in February, but not all of these. As it turns out, this is caused by a version difference in Tornado: my development venv had Tornado 4.5.1 whereas the Travis CI test used 4.5.3.

The problematic commit is tornadoweb/tornado@84bb2e2 (thanks for finding this, @ivan). It replaces localhost with 127.0.0.1 in the return value of tornado.testing.AsyncHTTPTestCase.get_url, reasoning that due to "apparently a recent macos change, connecting to localhost when the ipv6 port is unbound now incurs a 200ms delay, slowing a full test run down by a factor of 20".

Because wpull treats localhost as a different host than 127.0.0.1, all kinds of things in the test suite break (e.g. links to localhost get skipped, which causes the AssertionErrors that the number of files doesn't match the expected value).

I'll override tornado.testing.AsyncHTTPTestCase.get_url in our subclass to always return a localhost URL. Perhaps we should switch to 127.0.0.1 in the future if that delay mentioned in Tornado's commit message becomes a problem.

JustAnotherArchivist · 2018-10-18T18:31:08Z

That looks much better. The remaining five errors are due to FalconK's changes.

@ivan, could you review please when you get a chance?

ivan

s/priorisation/prioritisation/ or the z equivalent in your commit messages as well

I would recommend hard-wrapping your commit messages in the future because most of the git tools either don't wrap or character-wrap.

wpull/application/options.py

wpull/application/tasks/rule.py

wpull/warc/recorder.py

wpull/warc/recorder_test.py

JustAnotherArchivist · 2018-11-01T14:41:05Z

I just realised that my prioritisation code won't work as expected probably. It does assign priorities correctly, but it will still first process all todo URLs and only then go to errors. So if a prioritised URL errors out, it will still only be retried after all the other todos have been processed.

It should probably go after all highest priority URLs first, doing first the todos and then retrying the errors, and only then go to the next lower priority. Implementing this will probaby require some changes in how the pipeline and the database interact though. Currently, the pipeline requests an item with either status todo or error, i.e. the pipeline manages when to switch to retries, so to speak. Now, the database could return NotFound if the highest priority of errors is higher than that of todos, such that the pipeline switches to errors and processes those first. But that's a really awful hack.
Instead, the todo vs. error logic should probably be moved to the database, and the pipeline just retrieves an arbitrary item and schedules it for processing.

This is mostly a note for myself so I remember to look into this.

JustAnotherArchivist · 2018-11-03T18:12:08Z

s/priorisation/prioritisation/ or the z equivalent in your commit messages as well

Do you want me to edit the commit messages (and thereby rewrite the history of my branch, requiring a force-push) or not?
I guess I'm fine with that, assuming GitHub handles it gracefully. Just want to make sure.

ivan · 2018-11-03T18:26:16Z

You can edit and force-push. I'm not sure how GitHub handles the review comments, but the PR would require a re-review anyway.

JustAnotherArchivist · 2018-11-03T18:59:32Z

Here we go... For reference, the old branch is preserved at https://github.com/JustAnotherArchivist/wpull/tree/pr393-old.

JustAnotherArchivist · 2018-11-03T19:18:37Z

I need to rebase again. Forgot to update the commit IDs in the commit messages.
I also still need to look into #393 (comment).
So not ready for merging yet.

JustAnotherArchivist · 2018-11-03T23:55:06Z

@ivan, could you re-review please?

anarcat · 2018-11-15T15:13:17Z

tests are still red though... :/

JustAnotherArchivist · 2018-11-15T15:16:39Z

@anarcat Yeah, those are due to @falconkirtaran's changes between 2.0.1 and 2.0.3. As mentioned earlier, I only fixed the test failures related to my changes. The remaining ones can be addressed afterwards.

anarcat · 2018-11-15T15:43:57Z

okay, then maybe that can be banged out in #402

ivan · 2018-11-15T22:15:05Z

I was waiting on confirmation that the new queries (which now include priority) didn't regress performance (discussed on IRC; becoming slower here would be bad) before reviewing this again, but if someone else wants to review this, that is fine with me :-)

See ArchiveTeam#402 (comment) and ArchiveTeam#393 (comment)

JustAnotherArchivist · 2019-05-11T04:16:05Z

With the added index and INDEXED BY clause in d8fee1a, performance is somewhat worse than before. There really is no way around a small regression since the index is larger than before, so SQLite has more data to crunch on each check-out.

I tested the performance with a large (~10 GiB) DB that I had from a crashed ArchiveBot job. It has over 50 million URLs with roughly 15 million processed. I did not do extensive tests with other databases, but I think this one should cover a fairly extreme case. The test consists of checking out and back in 1000 entries from the DB. The script can be found here. The same script works with both 2.0.3 and d8fee1a; it uses code inspection to determine the number of arguments to check_out to distinguish between the two.

Averaged across 10 runs each, interleaved, on the same machine: 9.3 seconds for the 2.0.3 code, 12.1 seconds for d8fee1a. In other words, a performance penalty of about 30 %. That's not too surprising given the increase in index size: an index on just the status column covers 11 bytes per row while the new index covers the same 11 bytes for status plus 1 byte for priority (because it's all zero in this test, another extreme case since the index on priority is effectively useless) plus 3 or 4 bytes for id (not sure exactly how SQLite switches between the integer representations; cf. here) = 15-16 bytes in total or about 45 % more...

This is certainly not a negligible performance impact, but I'm not sure there's any way to reduce it. At the same time, I am sure that there are other (and simpler?) ways to improve the database's performance by at least a factor 20 (#427), so I don't think these 30 % matter too much.

I intend to do some more tests with different priority values in the table (instead of everything at zero). I'd expect the performance hit to be less severe then since the new index would be more efficient.

JustAnotherArchivist · 2020-09-13T18:37:01Z

SQLAlchemy 1.3.0 broke my implementation of the INDEXED BY clause via sqlalchemy/sqlalchemy#4509. So that part will require some more work. Although in the context of #427, getting rid of SQLAlchemy entirely is an option, so I might leave that for later and just limit the SQLAlchemy version to <1.3.0 if nobody objects.

Plugins are attached to existing HookableMixin instances as before. In addition, they are also kept in an attribute of the HookableMixin class to allow attaching when an instance is created after the plugin has been loaded. This commit restores the plugin loading behaviour of version 1.2.3. It therefore breaks backward compatibility in some cases; plugins written for versions 2.0.x may need to be adapted.

When URLRecord.__init__ is executed, it first calls super().__init__, which resolves to URLPriorities's init method. However, because URLPriorities.__init__ does not call super().__init__, it doesn't go further in the hierarchy, therefore never defining the fields of URLData and URLResult. NB: There's still a bug in that URLRecord.database_attributes is incorrect, but since that class is never used as a URLDatabaseMixin, that should not be a problem. It might be worth overwriting the database_items method though to prevent such usage explicitly.

Version 0.99999999 (eight nines) and up do not provide the `html5lib.tokenizer` API used by wpull.

…#380)

…ixes ArchiveTeam#377)

…rchiveTeam#376)

…#355) Note that this is only implemented for the scrapers; wpull.url.URLInfo.parse should do the same thing.

As a side effect, this also makes wpull parse http://:@example.com:?@/ correctly; the test suite previously expected this to throw an exception, but that URL is perfectly parseable (although it does cause a validation error per the URL Standard).

…repository

According to that commit, there is a 200 ms delay on macOS when connecting to localhost and only IPv4 is bound (which is what Tornado does). For that reason, they replaced "localhost" with "127.0.0.1" in tornado.testing.AsyncHTTPTestCase.get_url. Because wpull treats "localhost" and "127.0.0.1" as different hosts, this breaks a variety of tests for different reasons (URLs getting skipped, cookies not set, etc.) when using Tornado 4.5.3. As a workaround, this commit restores the previous behaviour of using "localhost". This may incur a performance penalty during testing on some platforms.

… hook instead of copying, and add a note to the documentation that modifying the objects passed into callbacks produces undefined behaviour.

Previously, wpull would first try all todos once, then retry all errors. However, priorities are supposed to take precedence, i.e. first all URLs of the highest priority should be retried until they are all either done or skipped, and only then should URLs with the next lower priority start getting processed. There is a backwards-incompatible change in the URLTable.check_out API: this method no longer takes any parameters. Previously, it would get a status to filter by, and there was also a level parameter (which was never used anywhere though), and the function would be first called with status todo and, if there were no todo URLs anymore, again with status error. This logic has been moved to the URLTable, i.e. it is now the URLTable's duty to return items in that order.

SQLite's query planner for some reason doesn't realise that it should use that index for the check_out query. Without the INDEXED BY, it's instead doing a full table scan for every check-out, which is obviously horrible for performance.

This was referenced Oct 10, 2018

Extract links from HTML5 media tags #394

Open

Process URLs strictly in the given order #300

Open

ivan requested changes Oct 21, 2018

View reviewed changes

JustAnotherArchivist mentioned this pull request Nov 1, 2018

follow tornado exception naming change #398

Open

5 tasks

anarcat mentioned this pull request Nov 2, 2018

WIP: fix test suite #402

Open

5 tasks

JustAnotherArchivist changed the title ~~URL priorisation, split meta WARCs, and miscellaneous bug fixes~~ URL prioritisation, split meta WARCs, and miscellaneous bug fixes Nov 3, 2018

JustAnotherArchivist force-pushed the develop branch from e4cde6e to e6e1758 Compare November 3, 2018 18:58

JustAnotherArchivist force-pushed the develop branch from 78f6ace to f486d27 Compare November 3, 2018 20:03

PromyLOPh added a commit to PromyLOPh/wpull that referenced this pull request Mar 9, 2019

Exclude broken html5lib/tornado versions

6b54b61

See ArchiveTeam#402 (comment) and ArchiveTeam#393 (comment)

JustAnotherArchivist mentioned this pull request Mar 23, 2019

Fails to recurse on www.stevenholcomb.com with lxml #422

Open

JustAnotherArchivist added 4 commits January 16, 2023 21:28

Initial implementation of URL prioritisation

526d44c

Test suite for prioritisation code

d233570

JustAnotherArchivist added 28 commits January 16, 2023 21:28

Pin html5lib version to 0.9999999 (seven nines) (fixes ArchiveTeam#332)

c14f2d4

Version 0.99999999 (eight nines) and up do not provide the `html5lib.tokenizer` API used by wpull.

Docs: Add description of prioritisation

65d0809

Add --warc-split-meta

16624e8

Remove flattening consecutive slashes in URL paths (fixes ArchiveTeam…

d327eec

…#380)

Fix test broken by 44ef369

4e41272

Typo

9b9fc0a

Handle backslashes in the path of special URLs like forward slashes (f…

9d7e2d2

…ixes ArchiveTeam#377)

Handle email.utils.parsedate returning None when parsing fails (fixes A…

c279743

…rchiveTeam#376)

Fix changelog

2bd10f7

Strip tab and newline characters from scraped URLs (fixes ArchiveTeam…

b4641ac

…#355) Note that this is only implemented for the scrapers; wpull.url.URLInfo.parse should do the same thing.

Update changelog to reflect the new situation regarding the official …

d3264df

…repository

Document --warc-split-meta

1a3ef99

Print some version information on CI

de601e8

Use snake_case for variables in prioritisation argument handling

b71a874

Spaces according to PEP8

748ac7e

Wrap docstrings

edce0d6

Wrap long lines

ab4b110

Remove stray prints

3c45bc7

Fix long lines in prioritisation plugin

0fcbae5

Pass the URLInfo and URLRecord objects directly into the get_priority…

1d5d30d

… hook instead of copying, and add a note to the documentation that modifying the objects passed into callbacks produces undefined behaviour.

s/priorisation/prioritisation/

66d2d29

A few minor corrections

90f3b14

Fix test broken by a100e5b

d0a98eb

Fix performance regression due to missing index

37dfdf9

SQLite's query planner for some reason doesn't realise that it should use that index for the check_out query. Without the INDEXED BY, it's instead doing a full table scan for every check-out, which is obviously horrible for performance.

Pin html5lib and SQLAlchemy to working versions in setup.py

6530b28

JustAnotherArchivist force-pushed the develop branch from d8fee1a to 6530b28 Compare January 16, 2023 21:41

JustAnotherArchivist force-pushed the develop branch from 6530b28 to 37dfdf9 Compare February 4, 2023 04:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URL prioritisation, split meta WARCs, and miscellaneous bug fixes #393

URL prioritisation, split meta WARCs, and miscellaneous bug fixes #393

JustAnotherArchivist commented Oct 10, 2018

ivan commented Oct 10, 2018

JustAnotherArchivist commented Oct 10, 2018

JustAnotherArchivist commented Oct 18, 2018

JustAnotherArchivist commented Oct 18, 2018

ivan left a comment

JustAnotherArchivist commented Nov 1, 2018

JustAnotherArchivist commented Nov 3, 2018

ivan commented Nov 3, 2018

JustAnotherArchivist commented Nov 3, 2018

JustAnotherArchivist commented Nov 3, 2018

JustAnotherArchivist commented Nov 3, 2018

anarcat commented Nov 15, 2018

JustAnotherArchivist commented Nov 15, 2018

anarcat commented Nov 15, 2018

ivan commented Nov 15, 2018

JustAnotherArchivist commented May 11, 2019

JustAnotherArchivist commented Sep 13, 2020

URL prioritisation, split meta WARCs, and miscellaneous bug fixes #393

Are you sure you want to change the base?

URL prioritisation, split meta WARCs, and miscellaneous bug fixes #393

Conversation

JustAnotherArchivist commented Oct 10, 2018

ivan commented Oct 10, 2018

JustAnotherArchivist commented Oct 10, 2018

JustAnotherArchivist commented Oct 18, 2018

JustAnotherArchivist commented Oct 18, 2018

ivan left a comment

Choose a reason for hiding this comment

JustAnotherArchivist commented Nov 1, 2018

JustAnotherArchivist commented Nov 3, 2018

ivan commented Nov 3, 2018

JustAnotherArchivist commented Nov 3, 2018

JustAnotherArchivist commented Nov 3, 2018

JustAnotherArchivist commented Nov 3, 2018

anarcat commented Nov 15, 2018

JustAnotherArchivist commented Nov 15, 2018

anarcat commented Nov 15, 2018

ivan commented Nov 15, 2018

JustAnotherArchivist commented May 11, 2019

JustAnotherArchivist commented Sep 13, 2020