Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade html5lib to 1.0.1 #421

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

PromyLOPh
Copy link
Contributor

The tokenizer module was never part of html5lib’s public interface and has consequently been made private in recent versions. This breaks wpull’s HTML scraper. This pull request contains a minimally invasive change, adding support for html5lib=1.0.1. A proper solution would extend TreeWalker to yield wpull’s tag stream directly.

Abusing html5lib to parse XML/RSS is not possible any more and yields incorrect tag streams (tag text is considered the tail of the closing tag). Thus, this change breaks test_rss_as_html and possibly other tests.

Please do not merge this proof of concept until the issues mentioned above are resolved.

Tokenizer was never part of the official API. Minimal changes to get
wpull running with html5lib==1.0.1.
@JustAnotherArchivist
Copy link
Contributor

How does this affect the performance? I believe the main reason for using only the tokenizer was to avoid building a tree, which isn't needed for wpull's link extraction purposes. html5lib is already pretty slow, so I think we should avoid making it even slower.

@PromyLOPh
Copy link
Contributor Author

I would not worry about that right now. The bigger issue is imo that this change breaks XML/RSS parsing and that the parsing code itself is quite fragile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants