Parser remove the single < (less than) character from given html string #250

touhidurabir · 2024-07-05T12:35:01Z

When parsing a html string with single use of <, it removes it from the parsed value that being returned . For example

<?php

use Masterminds\HTML5;

$html = '<img src="invalid-url" onerror="alert(\'XSS Attack prefix\')" /> 2 > 1 & 3 < 5 and some more text';

// Parse the document. $dom is a DOMDocument.
$html5 = new HTML5();
$dom = $html5->loadHTML($html);

// Render it as HTML5:
print $html5->saveHTML($dom);

the print of $html5->saveHTML($dom) should return as

<!DOCTYPE html>
<html><img src="invalid-url" onerror="alert('XSS Attack prefix')"> 2 &gt; 1 &amp; 3 &lt; 5 and some more text</html>

but instead it return as

<!DOCTYPE html>
<html><img src="invalid-url" onerror="alert('XSS Attack prefix')"> 2 &gt; 1 &amp; 3  5 and some more text</html>

see the missing encoded < of < character .

This is a continuation of symfony/symfony#57597 where it is impacting the sanitization process of html-sanitizer

The text was updated successfully, but these errors were encountered:

bsweeney · 2024-07-05T19:52:04Z

In an earlier investigation I noted that a small tweak to the library seems to fix the issue, though I haven't fully tested the change. From comments elsewhere:

...there does not appear to be any in-built way to avoid dropping the < character because the parser immediately flushes the buffer and moves to the next character upon encountering it (ref).

I did find that a slight tweak to the tokenizer captures the < and correctly encodes it in the output. It only requires that two lines be added following the line raising the syntax error in the HTML5 Tokenizer:
$this->scanner->unconsume();
$this->text($this->scanner->current());

goetas · 2024-07-17T13:44:04Z

hi, if the solution mentioned there works, can you provide a PR?

bsweeney · 2024-07-21T15:18:42Z

I have not fully tested the proposed solution (and, honestly, only have a high-level understanding of the code) so can't say if there are any potential ill effects. But I'm happy to put together a PR.

benoit-waldmann · 2024-08-05T11:22:22Z

Hi, do you find a solution ? Tks

Per the spec: > Parse error. Switch to the data state. Emit a U+003C LESS-THAN SIGN character token. Reconsume the current input character. https://www.w3.org/TR/2014/REC-html5-20141028/syntax.html#tag-open-state fixes Masterminds#250

bsweeney · 2024-08-07T04:09:20Z

I created a PR with a change I think will address this issue with regard to the data state (content parsing).

There is a similar problem with tag/attribute parsing, though I think that could reasonably be addressed through a separate issue.

touhidurabir mentioned this issue Jul 5, 2024

Parser remove the single < (less than) character #249

Closed

bsweeney mentioned this issue Jul 5, 2024

Unescaped Less-than character no longer handled by HTML5 parser dompdf/dompdf#3273

Open

touhidurabir mentioned this issue Jul 12, 2024

Replace HTMLPurifier with a maintained dependency pkp/pkp-lib#7916

Closed

bsweeney linked a pull request Aug 7, 2024 that will close this issue

Continue parsing as data after invalid opening tag #253

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser remove the single < (less than) character from given html string #250

Parser remove the single < (less than) character from given html string #250

touhidurabir commented Jul 5, 2024

bsweeney commented Jul 5, 2024

goetas commented Jul 17, 2024

bsweeney commented Jul 21, 2024

benoit-waldmann commented Aug 5, 2024

bsweeney commented Aug 7, 2024 •

edited

Loading

Parser remove the single < (less than) character from given html string #250

Parser remove the single < (less than) character from given html string #250

Comments

touhidurabir commented Jul 5, 2024

bsweeney commented Jul 5, 2024

goetas commented Jul 17, 2024

bsweeney commented Jul 21, 2024

benoit-waldmann commented Aug 5, 2024

bsweeney commented Aug 7, 2024 • edited Loading

bsweeney commented Aug 7, 2024 •

edited

Loading