How could html5lib be made faster? #314

tbodt · 2017-02-05T21:09:14Z

html5lib is nice, but it's pretty slow. On a fairly large test file, lxml took 50ms and html5lib took 5 seconds, which is 100 times slower.

Are there any particularly slow parts of html5lib that could be optimized? Would compiling it with Cython help?

gsnedders · 2017-02-06T20:19:27Z

In general terms, most of the low hanging fruit was done years ago.

What's left is things that become possible with heavy refactoring (though the gains are hard to predict in some cases) and compiling parts with Cython (though to get decent gains there you'll likely have to refactor to take out some new bottlenecks, I think).

tbodt · 2017-02-06T20:40:55Z

I did a profile on the input '<tag></tag>'*1000000. Here's the top few lines of profiling output, which account for 99.7 percent of execution time:

 2000001    7.635    0.000   33.713    0.000 _tokenizer.py:50(__iter__)
 1000003    7.330    0.000    7.767    0.000 treebuilders/etree.py:63(_setAttributes)
       1    7.213    7.213   76.995   76.995 html5parser.py:152(mainLoop)
 6000000    6.995    0.000   15.973    0.000 _tokenizer.py:421(tagNameState)
11000001    6.556    0.000    7.475    0.000 _inputstream.py:240(char)
 1000002    3.641    0.000   17.209    0.000 treebuilders/base.py:292(insertElementNormal)
 1000000    3.468    0.000    6.190    0.000 html5parser.py:1578(endTagOther)
 2000000    3.420    0.000    3.737    0.000 html5parser.py:264(normalizeToken)
 2000000    3.218    0.000    5.107    0.000 _tokenizer.py:223(emitCurrentToken)
 1000005    2.503    0.000    3.606    0.000 treebuilders/etree.py:22(__init__)
 2000000    2.453    0.000    3.730    0.000 _tokenizer.py:362(tagOpenState)
 2000001    2.017    0.000    3.722    0.000 _tokenizer.py:243(dataState)
 1000000    1.832    0.000    2.002    0.000 treebuilders/base.py:359(generateImpliedEndTags)
 2000001    1.730    0.000   39.180    0.000 html5parser.py:219(normalizedTokens)
 1000000    1.719    0.000    2.342    0.000 _tokenizer.py:397(closeTagOpenState)
 2000003    1.710    0.000    1.710    0.000 _utils.py:67(__getitem__)
 2000000    1.564    0.000    1.564    0.000 {method 'translate' of 'str' objects}
 1000000    1.340    0.000    8.422    0.000 html5parser.py:423(processEndTag)
 1000003    1.331    0.000   20.941    0.000 html5parser.py:410(processStartTag)
 1000000    1.304    0.000   18.792    0.000 html5parser.py:1296(startTagOther)
 1000005    1.103    0.000    1.103    0.000 treebuilders/etree.py:35(_getETreeTag)
 5000006    1.093    0.000    1.093    0.000 treebuilders/etree.py:46(_getName)
 1000003    1.064    0.000    1.591    0.000 treebuilders/etree.py:92(appendChild)
    1075    0.761    0.001    0.761    0.001 {method 'findall' of '_sre.SRE_Pattern' objects}
 4009862    0.609    0.000    0.609    0.000 {built-in method builtins.len}
 2000003    0.469    0.000    0.469    0.000 treebuilders/etree.py:55(_getNamespace)
 2003626    0.339    0.000    0.339    0.000 {method 'append' of 'list' objects}
 2000000    0.326    0.000    0.326    0.000 {method 'append' of 'collections.deque' objects}
 1000003    0.314    0.000    0.314    0.000 {method 'append' of 'xml.etree.ElementTree.Element' objects}
 2000000    0.311    0.000    0.311    0.000 {method 'popleft' of 'collections.deque' objects}
 1000000    0.279    0.000    0.279    0.000 treebuilders/base.py:187(reconstructActiveFormattingElements)
 1000001    0.276    0.000    0.276    0.000 {method 'pop' of 'list' objects}
 1004853    0.261    0.000    0.261    0.000 {built-in method builtins.isinstance}
 1000015    0.253    0.000    0.253    0.000 {method 'keys' of 'dict' objects}
 1000244    0.219    0.000    0.219    0.000 {method 'get' of 'dict' objects}
 1000155    0.184    0.000    0.184    0.000 {method 'items' of 'collections.OrderedDict' objects}

Observations:

Not quite sure what to make of the fact that 7 seconds are spent inside HTMLTokenizer.__iter__. It doesn't seem like it has very much to do, just calling out to the state functions that do most of the work. Maybe it would help to cythonize this?
7 seconds spent inside HTMLParser.mainLoop also seems a bit strange. I haven't done a whole lot of digging into that.
Another 7 seconds is spent inside HTMLTokenizer.tagNameState. Thinking that could also be flattened/cythonized.
The tokenizer represents tokens as dictionaries. If a Token class were created, it could be cythonized so that adding tokens to the queue could be done without calling anything.
Yet another 7 seconds is spent in HTMLInputStream.char. It's pretty damn simple, so a good candidate for cythonization. (if that's a word)

Anyone have anything to add?

jgraham · 2017-02-06T21:52:44Z

Honestly if you are going to have a dependency like Cython a much, much more effective approach would be to write python bindings for html5ever. That will give you lxml-comparable speed with equal or better spec conformance compared to html5lib.

tbodt · 2017-02-06T22:11:40Z

@jgraham I did not know about html5ever. That's totally what I'm going to be using instead of html5lib. Thanks a lot!

gsnedders · 2017-02-06T23:05:46Z

@tbodt the "oddly" small functions all spend the majority of their time doing dict lookups, and Cython won't help that.

Gumbo has some means of building a Python tree through html5lib's treebuilders, but that's probably still needlessly expensive given the cost of construction the tree (especially versus lxml and cElementTree which don't iterate through the tree in any way in Python code), which might make it better than html5ever, but it doesn't support everything html5lib does (non-UTF-8 encodings are a big one), which html5ever has some support for.

gsnedders · 2017-02-09T16:26:31Z

Also, to note, Python 3.7 is on some benchmarks of html5lib 25% quicker than Python 2.7.

jayaddison · 2020-12-16T12:20:40Z

Is there a recommended way to profile html5lib? The benchmarks are useful for confirming whether a change/refactor has made an improvement, but it's tricky to figure out where to focus on without having a recommended profiler.

gsnedders · 2020-12-16T13:28:12Z

Is there a recommended way to profile html5lib? The benchmarks are useful for confirming whether a change/refactor has made an improvement, but it's tricky to figure out where to focus on without having a recommended profiler.

For #493 I used a mixture of vmprof and py-spy, both of which are sampling profilers, and did some based on the things in the benchmarks directory and some from profiling entire applications using html5lib (which, e.g., showed up the cost of constructing the parser objects).

jayaddison · 2020-12-16T13:43:13Z

Brilliant, thanks @gsnedders. Unlikely I'll find anything that hasn't been examined already, but keen to figure out a good process to investigate with.

gsnedders · 2020-12-16T14:33:36Z

@jayaddison There's certainly plenty of places where there are potential gains: there's a bunch of places around SVG and MathML in HTML where we could do things quicker, but the comparative rarity of it means it's never been a priority; most of my attention in recent years has been kinda focused on the parser and etree treebuilder, and anything outside of that probably hasn't got much interest in a while. That all said, I hope to land something along the lines of #445, using Cython in Pure Python mode for the input stream and tokenizer, which should give a pretty significant speed-up.

jayaddison · 2020-12-16T14:50:31Z

Anything I could use a few spare cycles to help with there? (I saw #24 and figured that might have decent impact, but as I'm fairly new to the codebase it might take a little while)

gsnedders · 2020-12-16T15:53:01Z

#24 won't make much impact till it's using Cython I suspect? Also when the parser needs updated to be up-to-date with the spec there's probably not too much point in spending effort there. But no, there's not a huge amount worthwhile doing IMO?

jayaddison · 2020-12-17T11:32:35Z

Okie doke. Glad to help out with anything performance-related if & when that'd be useful.

tbodt closed this as completed Feb 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How could html5lib be made faster? #314

How could html5lib be made faster? #314

tbodt commented Feb 5, 2017

gsnedders commented Feb 6, 2017

tbodt commented Feb 6, 2017

jgraham commented Feb 6, 2017

tbodt commented Feb 6, 2017

gsnedders commented Feb 6, 2017

gsnedders commented Feb 9, 2017

jayaddison commented Dec 16, 2020

gsnedders commented Dec 16, 2020

jayaddison commented Dec 16, 2020

gsnedders commented Dec 16, 2020

jayaddison commented Dec 16, 2020

gsnedders commented Dec 16, 2020

jayaddison commented Dec 17, 2020

How could html5lib be made faster? #314

How could html5lib be made faster? #314

Comments

tbodt commented Feb 5, 2017

gsnedders commented Feb 6, 2017

tbodt commented Feb 6, 2017

jgraham commented Feb 6, 2017

tbodt commented Feb 6, 2017

gsnedders commented Feb 6, 2017

gsnedders commented Feb 9, 2017

jayaddison commented Dec 16, 2020

gsnedders commented Dec 16, 2020

jayaddison commented Dec 16, 2020

gsnedders commented Dec 16, 2020

jayaddison commented Dec 16, 2020

gsnedders commented Dec 16, 2020

jayaddison commented Dec 17, 2020