Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser remove the single < (less than) character from given html string #250

Open
touhidurabir opened this issue Jul 5, 2024 · 5 comments · May be fixed by #253
Open

Parser remove the single < (less than) character from given html string #250

touhidurabir opened this issue Jul 5, 2024 · 5 comments · May be fixed by #253

Comments

@touhidurabir
Copy link

When parsing a html string with single use of <, it removes it from the parsed value that being returned . For example

<?php

use Masterminds\HTML5;

$html = '<img src="invalid-url" onerror="alert(\'XSS Attack prefix\')" /> 2 > 1 & 3 < 5 and some more text';

// Parse the document. $dom is a DOMDocument.
$html5 = new HTML5();
$dom = $html5->loadHTML($html);

// Render it as HTML5:
print $html5->saveHTML($dom);

the print of $html5->saveHTML($dom) should return as

<!DOCTYPE html>
<html><img src="invalid-url" onerror="alert('XSS Attack prefix')"> 2 &gt; 1 &amp; 3 &lt; 5 and some more text</html>

but instead it return as

<!DOCTYPE html>
<html><img src="invalid-url" onerror="alert('XSS Attack prefix')"> 2 &gt; 1 &amp; 3  5 and some more text</html>

see the missing encoded &lt; of < character .

This is a continuation of symfony/symfony#57597 where it is impacting the sanitization process of html-sanitizer

@bsweeney
Copy link

bsweeney commented Jul 5, 2024

In an earlier investigation I noted that a small tweak to the library seems to fix the issue, though I haven't fully tested the change. From comments elsewhere:

...there does not appear to be any in-built way to avoid dropping the < character because the parser immediately flushes the buffer and moves to the next character upon encountering it (ref).

I did find that a slight tweak to the tokenizer captures the < and correctly encodes it in the output. It only requires that two lines be added following the line raising the syntax error in the HTML5 Tokenizer:

$this->scanner->unconsume();
$this->text($this->scanner->current());

@goetas
Copy link
Member

goetas commented Jul 17, 2024

hi, if the solution mentioned there works, can you provide a PR?

@bsweeney
Copy link

I have not fully tested the proposed solution (and, honestly, only have a high-level understanding of the code) so can't say if there are any potential ill effects. But I'm happy to put together a PR.

@benoit-waldmann
Copy link

Hi, do you find a solution ? Tks

bsweeney added a commit to bsweeney/html5-php that referenced this issue Aug 7, 2024
Per the spec:

> Parse error. Switch to the data state. Emit a U+003C LESS-THAN SIGN character token. Reconsume the current input character.

https://www.w3.org/TR/2014/REC-html5-20141028/syntax.html#tag-open-state

fixes Masterminds#250
@bsweeney bsweeney linked a pull request Aug 7, 2024 that will close this issue
@bsweeney
Copy link

bsweeney commented Aug 7, 2024

I created a PR with a change I think will address this issue with regard to the data state (content parsing).

There is a similar problem with tag/attribute parsing, though I think that could reasonably be addressed through a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants