Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How could html5lib be made faster? #314

Closed
tbodt opened this issue Feb 5, 2017 · 13 comments
Closed

How could html5lib be made faster? #314

tbodt opened this issue Feb 5, 2017 · 13 comments

Comments

@tbodt
Copy link

tbodt commented Feb 5, 2017

html5lib is nice, but it's pretty slow. On a fairly large test file, lxml took 50ms and html5lib took 5 seconds, which is 100 times slower.

Are there any particularly slow parts of html5lib that could be optimized? Would compiling it with Cython help?

@gsnedders
Copy link
Member

In general terms, most of the low hanging fruit was done years ago.

What's left is things that become possible with heavy refactoring (though the gains are hard to predict in some cases) and compiling parts with Cython (though to get decent gains there you'll likely have to refactor to take out some new bottlenecks, I think).

@tbodt
Copy link
Author

tbodt commented Feb 6, 2017

I did a profile on the input '<tag></tag>'*1000000. Here's the top few lines of profiling output, which account for 99.7 percent of execution time:

 2000001    7.635    0.000   33.713    0.000 _tokenizer.py:50(__iter__)
 1000003    7.330    0.000    7.767    0.000 treebuilders/etree.py:63(_setAttributes)
       1    7.213    7.213   76.995   76.995 html5parser.py:152(mainLoop)
 6000000    6.995    0.000   15.973    0.000 _tokenizer.py:421(tagNameState)
11000001    6.556    0.000    7.475    0.000 _inputstream.py:240(char)
 1000002    3.641    0.000   17.209    0.000 treebuilders/base.py:292(insertElementNormal)
 1000000    3.468    0.000    6.190    0.000 html5parser.py:1578(endTagOther)
 2000000    3.420    0.000    3.737    0.000 html5parser.py:264(normalizeToken)
 2000000    3.218    0.000    5.107    0.000 _tokenizer.py:223(emitCurrentToken)
 1000005    2.503    0.000    3.606    0.000 treebuilders/etree.py:22(__init__)
 2000000    2.453    0.000    3.730    0.000 _tokenizer.py:362(tagOpenState)
 2000001    2.017    0.000    3.722    0.000 _tokenizer.py:243(dataState)
 1000000    1.832    0.000    2.002    0.000 treebuilders/base.py:359(generateImpliedEndTags)
 2000001    1.730    0.000   39.180    0.000 html5parser.py:219(normalizedTokens)
 1000000    1.719    0.000    2.342    0.000 _tokenizer.py:397(closeTagOpenState)
 2000003    1.710    0.000    1.710    0.000 _utils.py:67(__getitem__)
 2000000    1.564    0.000    1.564    0.000 {method 'translate' of 'str' objects}
 1000000    1.340    0.000    8.422    0.000 html5parser.py:423(processEndTag)
 1000003    1.331    0.000   20.941    0.000 html5parser.py:410(processStartTag)
 1000000    1.304    0.000   18.792    0.000 html5parser.py:1296(startTagOther)
 1000005    1.103    0.000    1.103    0.000 treebuilders/etree.py:35(_getETreeTag)
 5000006    1.093    0.000    1.093    0.000 treebuilders/etree.py:46(_getName)
 1000003    1.064    0.000    1.591    0.000 treebuilders/etree.py:92(appendChild)
    1075    0.761    0.001    0.761    0.001 {method 'findall' of '_sre.SRE_Pattern' objects}
 4009862    0.609    0.000    0.609    0.000 {built-in method builtins.len}
 2000003    0.469    0.000    0.469    0.000 treebuilders/etree.py:55(_getNamespace)
 2003626    0.339    0.000    0.339    0.000 {method 'append' of 'list' objects}
 2000000    0.326    0.000    0.326    0.000 {method 'append' of 'collections.deque' objects}
 1000003    0.314    0.000    0.314    0.000 {method 'append' of 'xml.etree.ElementTree.Element' objects}
 2000000    0.311    0.000    0.311    0.000 {method 'popleft' of 'collections.deque' objects}
 1000000    0.279    0.000    0.279    0.000 treebuilders/base.py:187(reconstructActiveFormattingElements)
 1000001    0.276    0.000    0.276    0.000 {method 'pop' of 'list' objects}
 1004853    0.261    0.000    0.261    0.000 {built-in method builtins.isinstance}
 1000015    0.253    0.000    0.253    0.000 {method 'keys' of 'dict' objects}
 1000244    0.219    0.000    0.219    0.000 {method 'get' of 'dict' objects}
 1000155    0.184    0.000    0.184    0.000 {method 'items' of 'collections.OrderedDict' objects}

Observations:

  • Not quite sure what to make of the fact that 7 seconds are spent inside HTMLTokenizer.__iter__. It doesn't seem like it has very much to do, just calling out to the state functions that do most of the work. Maybe it would help to cythonize this?

  • 7 seconds spent inside HTMLParser.mainLoop also seems a bit strange. I haven't done a whole lot of digging into that.

  • Another 7 seconds is spent inside HTMLTokenizer.tagNameState. Thinking that could also be flattened/cythonized.

  • The tokenizer represents tokens as dictionaries. If a Token class were created, it could be cythonized so that adding tokens to the queue could be done without calling anything.

  • Yet another 7 seconds is spent in HTMLInputStream.char. It's pretty damn simple, so a good candidate for cythonization. (if that's a word)

Anyone have anything to add?

@jgraham
Copy link
Member

jgraham commented Feb 6, 2017

Honestly if you are going to have a dependency like Cython a much, much more effective approach would be to write python bindings for html5ever. That will give you lxml-comparable speed with equal or better spec conformance compared to html5lib.

@tbodt
Copy link
Author

tbodt commented Feb 6, 2017

@jgraham I did not know about html5ever. That's totally what I'm going to be using instead of html5lib. Thanks a lot!

@tbodt tbodt closed this as completed Feb 6, 2017
@gsnedders
Copy link
Member

@tbodt the "oddly" small functions all spend the majority of their time doing dict lookups, and Cython won't help that.

Gumbo has some means of building a Python tree through html5lib's treebuilders, but that's probably still needlessly expensive given the cost of construction the tree (especially versus lxml and cElementTree which don't iterate through the tree in any way in Python code), which might make it better than html5ever, but it doesn't support everything html5lib does (non-UTF-8 encodings are a big one), which html5ever has some support for.

@gsnedders
Copy link
Member

Also, to note, Python 3.7 is on some benchmarks of html5lib 25% quicker than Python 2.7.

@jayaddison
Copy link
Contributor

Is there a recommended way to profile html5lib? The benchmarks are useful for confirming whether a change/refactor has made an improvement, but it's tricky to figure out where to focus on without having a recommended profiler.

@gsnedders
Copy link
Member

Is there a recommended way to profile html5lib? The benchmarks are useful for confirming whether a change/refactor has made an improvement, but it's tricky to figure out where to focus on without having a recommended profiler.

For #493 I used a mixture of vmprof and py-spy, both of which are sampling profilers, and did some based on the things in the benchmarks directory and some from profiling entire applications using html5lib (which, e.g., showed up the cost of constructing the parser objects).

@jayaddison
Copy link
Contributor

Brilliant, thanks @gsnedders. Unlikely I'll find anything that hasn't been examined already, but keen to figure out a good process to investigate with.

@gsnedders
Copy link
Member

@jayaddison There's certainly plenty of places where there are potential gains: there's a bunch of places around SVG and MathML in HTML where we could do things quicker, but the comparative rarity of it means it's never been a priority; most of my attention in recent years has been kinda focused on the parser and etree treebuilder, and anything outside of that probably hasn't got much interest in a while. That all said, I hope to land something along the lines of #445, using Cython in Pure Python mode for the input stream and tokenizer, which should give a pretty significant speed-up.

@jayaddison
Copy link
Contributor

Anything I could use a few spare cycles to help with there? (I saw #24 and figured that might have decent impact, but as I'm fairly new to the codebase it might take a little while)

@gsnedders
Copy link
Member

#24 won't make much impact till it's using Cython I suspect? Also when the parser needs updated to be up-to-date with the spec there's probably not too much point in spending effort there. But no, there's not a huge amount worthwhile doing IMO?

@jayaddison
Copy link
Contributor

Okie doke. Glad to help out with anything performance-related if & when that'd be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants