Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fails to recurse on www.stevenholcomb.com with lxml #422

Open
JustAnotherArchivist opened this issue Mar 23, 2019 · 0 comments
Open

Fails to recurse on www.stevenholcomb.com with lxml #422

JustAnotherArchivist opened this issue Mar 23, 2019 · 0 comments
Labels

Comments

@JustAnotherArchivist
Copy link
Contributor

ArchiveBot job e3pq9nd3o10nud4gctgm2nnz0 for http://www.stevenholcomb.com/ (viewer) failed to recurse. It only grabbed the homepage, robots.txt, sitemap.xml, and two (broken) URLs in the sitemap.

I tested with a simpler command and was able to reproduce this with wpull --recursive --level inf --no-verbose --html-parser libxml2-lxml http://www.stevenholcomb.com/ on one of my pipelines with wpull 2.0.3. But when using html5lib, it recurses correctly.
With commit ec24bba (PR #393), however, I'm unable to reproduce it on another machine (different Python version, libraries, etc.). So maybe possibly this is fixed already, but it needs further investigation.

The server's sending UTF-16LE-encoded HTML (without advertising it in a header), which might play a role in this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant