-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How could html5lib be made faster? #314
Comments
In general terms, most of the low hanging fruit was done years ago. What's left is things that become possible with heavy refactoring (though the gains are hard to predict in some cases) and compiling parts with Cython (though to get decent gains there you'll likely have to refactor to take out some new bottlenecks, I think). |
I did a profile on the input
Observations:
Anyone have anything to add? |
Honestly if you are going to have a dependency like Cython a much, much more effective approach would be to write python bindings for html5ever. That will give you lxml-comparable speed with equal or better spec conformance compared to html5lib. |
@jgraham I did not know about html5ever. That's totally what I'm going to be using instead of html5lib. Thanks a lot! |
@tbodt the "oddly" small functions all spend the majority of their time doing dict lookups, and Cython won't help that. Gumbo has some means of building a Python tree through html5lib's treebuilders, but that's probably still needlessly expensive given the cost of construction the tree (especially versus lxml and cElementTree which don't iterate through the tree in any way in Python code), which might make it better than html5ever, but it doesn't support everything html5lib does (non-UTF-8 encodings are a big one), which html5ever has some support for. |
Also, to note, Python 3.7 is on some benchmarks of html5lib 25% quicker than Python 2.7. |
Is there a recommended way to profile |
For #493 I used a mixture of vmprof and py-spy, both of which are sampling profilers, and did some based on the things in the benchmarks directory and some from profiling entire applications using html5lib (which, e.g., showed up the cost of constructing the parser objects). |
Brilliant, thanks @gsnedders. Unlikely I'll find anything that hasn't been examined already, but keen to figure out a good process to investigate with. |
@jayaddison There's certainly plenty of places where there are potential gains: there's a bunch of places around SVG and MathML in HTML where we could do things quicker, but the comparative rarity of it means it's never been a priority; most of my attention in recent years has been kinda focused on the parser and etree treebuilder, and anything outside of that probably hasn't got much interest in a while. That all said, I hope to land something along the lines of #445, using Cython in Pure Python mode for the input stream and tokenizer, which should give a pretty significant speed-up. |
Anything I could use a few spare cycles to help with there? (I saw #24 and figured that might have decent impact, but as I'm fairly new to the codebase it might take a little while) |
#24 won't make much impact till it's using Cython I suspect? Also when the parser needs updated to be up-to-date with the spec there's probably not too much point in spending effort there. But no, there's not a huge amount worthwhile doing IMO? |
Okie doke. Glad to help out with anything performance-related if & when that'd be useful. |
html5lib is nice, but it's pretty slow. On a fairly large test file, lxml took 50ms and html5lib took 5 seconds, which is 100 times slower.
Are there any particularly slow parts of html5lib that could be optimized? Would compiling it with Cython help?
The text was updated successfully, but these errors were encountered: