Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug breaking links: linkUrlsInTrustedHtml does not handle entities correctly #2

Open
Llbe opened this issue Dec 8, 2016 · 15 comments

Comments

@Llbe
Copy link

Llbe commented Dec 8, 2016

$urlLinker = new Youthweb\UrlLinker\UrlLinker();

var_dump ($urlLinker->linkUrlsInTrustedHtml ("http://foo.bar/baz?a=b&a=b"));

The result here is
string(56) "<a href="http://foo.bar/baz?a=b">foo.bar/baz</a>&amp;a=b"

The expected result is
string(56) "<a href="http://foo.bar/baz?a=b&amp;a=b">foo.bar/baz</a>"

This is a bug since '&' must be encoded in HTML (to avoid conflict with the encoding) and linkUrlsInTrustedHtml() expects valid HTML.

Art4 added a commit that referenced this issue Dec 8, 2016
@Art4
Copy link
Member

Art4 commented Dec 8, 2016

Thanks for this issue but I can't identify this as a bug.

http://foo.bar/baz?a=b&amp;a=b isn't a valid url, because it is already encoded (i.e with htmlspecialchars()).
So this url must be either inside a html tag, not encoded as http://foo.bar/baz?a=b&a=b or urlencoded as http://foo.bar/baz?a=b&amp%3ba=b.

I've written tests for all these cases, see 8ad91f6

@Llbe
Copy link
Author

Llbe commented Dec 9, 2016

Thanks for your reply!

It's not a valid URL but it is perfectly valid HTML.

The HTML is http://foo.bar/baz?a=b&amp;a=b

The URL is http://foo.bar/baz?a=b&a=b

And the result should be <a href="http://foo.bar/baz?a=b&amp;a=b">foo.bar/baz</a>. You are avoding the double encoding already, in escapeHtml().

This is the only way if accepting valid HTML in linkUrlsInTrustedHtml().

Art4 added a commit that referenced this issue Dec 9, 2016
@Art4
Copy link
Member

Art4 commented Dec 9, 2016

This would be valid HTML if it was already inside a HTML tag like <a href="http://foo.bar/baz?a=b&amp;a=b">.... But it isn't, so it is neither valid HTML nor a valid url.

In this case you have to use $urlLinker->linkUrlsAndEscapeHtml('http://foo.bar/baz?a=b&amp;a=b') what works fine. See tests in 2492474.

@Llbe
Copy link
Author

Llbe commented Dec 9, 2016

So you are saying that you can only use &amp; inside an attribute in HTML?

Eg the following would not be valid HTML: <p>http://foo.bar/baz?a=b&amp;a=b</p>?

@Art4
Copy link
Member

Art4 commented Dec 9, 2016

Eg the following would not be valid HTML: <p>http://foo.bar/baz?a=b&amp;a=b</p>?

Sure, this is valid HTML, but without a valid url. (with the exception of the first part http://foo.bar/baz?a=b)

The question is less if this HTML is valid but if it is trusted. So if this <p>http://foo.bar/baz?a=b&amp;a=b</p> came from as a request from a user it is not trusted and you should use $urlLinker->linkUrlsAndEscapeHtml().

But if this came from a trusted source than you can use $urlLinker->linkUrlsInTrustedHtml() but you can see that it contains not a valid url.

This is a valid url: http://foo.bar/baz?a=b&a=b
This is not a valid url: http://foo.bar/baz?a=b&amp;a=b

@Llbe
Copy link
Author

Llbe commented Dec 9, 2016

The encoding are layers. The &amp; belongs to HTML as you know. It has nothing to do with the URL. This is why it still is a valid URL given that the HTML is interpreted.

I am not saying that http://foo.bar/baz?a=b&amp;a=b is a valid URL. I am saying that it is valid HTML that contains a valid URL.

@Llbe
Copy link
Author

Llbe commented Dec 9, 2016

The linkUrlsInTrustedHtml() should understand HTML correctly and trusting.

@Art4
Copy link
Member

Art4 commented Dec 9, 2016

I am not saying that http://foo.bar/baz?a=b&amp;a=b is a valid URL. I am saying that it is valid HTML that contains a valid URL.

If you insert an invalid URL the library will try to recognize as much as possible and put it in a <a>-tag. You can't expect that the library will know the meaning of an invalid url.

The linkUrlsInTrustedHtml() should understand HTML correctly and trusting.

So how should the library decide how to handle an invalid url? The meanings of http://foo.bar/baz?a=b&amp;a=b could be:

valid url in <a href="
http://foo.bar/baz?a=b&a=b http://foo.bar/baz?a=b&amp;a=b
http://foo.bar/baz?a=b&amp%3ba=b http://foo.bar/baz?a=b&amp;amp%3ba=b
http://foo.bar/baz?a=b%26amp%3ba=b http://foo.bar/baz?a=b%26amp%3ba=b

The library will recognize an url until the first query part http://foo.bar/baz?a=b and ignore the rest. We can discuss if the library should recognize the url until http://foo.bar/baz?a=b&amp because this is also valid. But the rest ;a=b is invalid and is ignored.

I'd be glad if you would create a PR with a solution how to decide what the library should do.

@Art4
Copy link
Member

Art4 commented Dec 9, 2016

Another example: If you call the url http://foo.bar/baz?a=b&amp;c=d PHP will interpret the querystring as:

array(2) {
  ["a"]=>
  string(1) "b"
  ["amp;c"]=>
  string(1) "d"
}

This is because the url will be urlencoded in the browser and became http://foo.bar/baz?a=b&amp;amp%3ba=b.

But how should the library if it should use urlencode() or turn &amp; back to &? You provided "trusted HTML" to $urlLinker->linkUrlsInTrustedHtml() so it will not change it but search for valid urls.
But if you put this invalid url into $urlLinker->linkUrlsAndEscapeHtml() it will escape the HTML and will recognize the &amp; as &.

So you have two options:

  1. Make sure you insert valid urls
  2. Use $urlLinker->linkUrlsAndEscapeHtml()

@Llbe
Copy link
Author

Llbe commented Dec 9, 2016

I am not saying that http://foo.bar/baz?a=b&amp;a=b is a valid URL. I am saying that it is valid HTML that contains a valid URL.
If you insert an invalid URL the library will try to recognize as much as possible and put it in a -tag. You can't expect that the library will know the meaning of an invalid url.

I'm sorry but you are missing the point. I am not inserting an invalid URL. I am inserting valid HTML. When the HTML is interpreted, it becomes a valid URL.

The fundamentally important thing is to encode/decode per context. You must not handle HTML entities as a part of a URL. Currently you are converting HTML → text without a HTML decode.

Basically linkUrlsInTrustedHtml() should HTML decode before calling linkUrlsAndEscapeHtml() - because the latter method expects text and not HTML.

For this to work the regex in linkUrlsInTrustedHtml() must include HTML entities when finding the URLs.

@Llbe
Copy link
Author

Llbe commented Dec 9, 2016

You are mixing different contexts.

@Llbe
Copy link
Author

Llbe commented Dec 9, 2016

Use html_entity_decode().

@Art4
Copy link
Member

Art4 commented Dec 9, 2016

I'm sorry but I don't get your problem. Can you provide me a specific example of your problem?

If you have the input from above:

$urlLinker = new Youthweb\UrlLinker\UrlLinker();

var_dump($urlLinker->linkUrlsInTrustedHtml("http://foo.bar/baz?a=b&amp;a=b"));

you will get

input output from linkUrlsInTrustedHtml()
http://foo.bar/baz?a=b&amp;a=b <a href="http://foo.bar/baz?a=b">foo.bar/baz</a>&amp;a=b
<a href="http://foo.bar/baz?a=b&amp;a=b">foo.bar</a> <a href="http://foo.bar/baz?a=b&amp;a=b">foo.bar</a>
<p>http://foo.bar/baz?a=b&amp;a=b</p> <p><a href="http://foo.bar/baz?a=b">foo.bar/baz</a>&amp;a=b</p>

I don't see a problem here.

Use html_entity_decode().

Why should it use html_entity_decode? You inserted trustful HTML. But you can use html_entity_decode() on your own before insert it to linkUrlsInTrustedHtml().

@Llbe
Copy link
Author

Llbe commented Dec 9, 2016

I don't want to be rude but you should study how to work with encodings. I promise you it will prove beneficial.

The problem with linkUrlsInTrustedHtml() specifically is that links are not extracted including entities.

You cannot decode before inserting - because you cannot decode before "parsing"/regex.

Because regex is used (and not libxml) - you need to extract links including entities and

a) decode the entities before inserting to linkUrlsAndEscapeHtml()

or

b) as today: avoid double encoding in linkUrlsAndEscapeHtml()

..

As a side note, ideally you should only extract links from HTML including entities that are within a link and not at the end of a link (or at the beginning of a link).

This is in order to handle a link such as (text): <http://foo.bar/baz>, which in HTML would be &lt;http://foo.bar/baz&gt;, where the link should be http://foo.bar/baz.

However http://foo.bar/baz&gt;baz in HTML would be interpreted as link http://foo.bar/baz>baz.

('<' and '>' is allowed unencoded in URLs, see RFC 3986 - it is considered unsafe/unwise by RFC 1738).

I take this as an example because '<' and '>' is more obviously problematic in HTML. Technically, '&' is allowed unencoded (it is only so called ambiguous ampersands that are not allowed - ie ampersands that can be interpreted as an entity but that are not a valid entity).

The '<' '>' case are not as important as the &amp; case - since that WILL break common links in valid HTML.

I have a working and secure handling of user input in my application. I use HTML for rich text formating. I want to convert links on the fly because how links are converted depends (I have a vary on target and rel attributes). This library does not work because HTML is not handled correctly. Otherwise, it seems to be a good improvement on the kwi original.

@Art4
Copy link
Member

Art4 commented Dec 10, 2016

Thank you for your patience. 😊 I'm always interested in learning new things. I am far away from being an expert on encoding, but I try to understand things before I can improve them. So I am really glad for your issue and I will work through your comment and discuss this issue internally.

This library does not work because HTML is not handled correctly. Otherwise, it seems to be a good improvement on the kwi original.

Thank you. And we are interested in improving this library and are thankful for every contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants