Fixes #166 #182

bytestream · 2020-05-12T12:20:23Z

This PR aims to resolve #166. It adds missing <html> <head> and <body> to the input HTML so fragments are parsed correctly.

There's several different solutions:

This class resolves cases that don't work correctly in the above.

It wouldn't mimic DOMDocument behaviour as written here; #166 (comment). That would require adding some context so it knows that <title> should go in <head>. I figure the input HTML is already invalid so it's an edge case that would require a lot more computation to handle.

What do you think?

mundschenk-at · 2020-05-12T12:24:43Z

src/HTML5.php

@@ -152,6 +156,10 @@ public function hasErrors()
     */
    public function parse($input, array $options = array())
    {
+        if (isset($options['normalize']) && $options['normalize']) {


if (! empty($options['normalize']) { would be more concise.

bytestream · 2020-05-17T20:06:28Z

@goetas any thoughts on this?

goetas · 2020-05-20T08:58:26Z

Hi and thanks for working on this.

I'm generally against the feature for the following reasons:

this library is not about repairing html, but about parsing decent html5 content into a dom document
there is already the possibility to parse a partial html5 document by using the loadHTMLFragment method

goetas · 2020-05-20T09:01:13Z

https://github.com/tgalopin/html-sanitizer is built on top of this library and is meant to sanitize external html

bytestream · 2020-05-20T09:29:25Z

What's the status of the issue marked as bug? If you're against fixing it?

I also think there's so many people building there own implementations to work around this issue that it would benefit from a solution built into the library. That ensures collaboration and a higher quality of implementation to work around theb issue.

I don't think that html-santizer is relevant in the context of the bug.

bytestream · 2020-05-20T11:46:45Z

@goetas below are copied from the issue, so I really think this should be reopened.

A head element’s start tag may be omitted if the element is empty, or if the first thing inside the head element is an element.

A body element’s start tag may be omitted if the element is empty, or if the first thing inside the body element is not a space character or a comment, except if the first thing inside the body element is a meta, link, script, style, or template element.

However I see that starting to adopt some of the more recent specifications is a good idea, so if you wish to fix this behavior, PR are welcome.

goetas · 2020-05-21T11:57:22Z

I understand.

given that the following string is a valid html5 document

 <html>Hello, This is a test.<br />Does it work this time?</html>

give this comment Invalid parsing result when head/body tag is missing #166 (comment)

What we need is to hook into some class in https://github.com/Masterminds/html5-php/tree/master/src/HTML5/Parser and add on the fly the body element if not present.

I'm not against solving this issue if is a valid html5 document.

@bytestream one of the reason for rejecting this PR is also that builds a parser/tokenizer within a library that has already a parser.

Added Normalizer.php

7dd5657

mundschenk-at reviewed May 12, 2020

View reviewed changes

goetas closed this May 20, 2020

goetas mentioned this pull request May 21, 2020

Invalid parsing result when head/body tag is missing #166

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes #166 #182

Fixes #166 #182

bytestream commented May 12, 2020

mundschenk-at May 12, 2020

bytestream commented May 17, 2020

goetas commented May 20, 2020

goetas commented May 20, 2020

bytestream commented May 20, 2020

bytestream commented May 20, 2020

goetas commented May 21, 2020

Fixes #166 #182

Fixes #166 #182

Conversation

bytestream commented May 12, 2020

mundschenk-at May 12, 2020

Choose a reason for hiding this comment

bytestream commented May 17, 2020

goetas commented May 20, 2020

goetas commented May 20, 2020

bytestream commented May 20, 2020

bytestream commented May 20, 2020

goetas commented May 21, 2020