Bug breaking links: linkUrlsInTrustedHtml does not handle entities correctly #2

Llbe · 2016-12-08T17:20:56Z

$urlLinker = new Youthweb\UrlLinker\UrlLinker();

var_dump ($urlLinker->linkUrlsInTrustedHtml ("http://foo.bar/baz?a=b&amp;a=b"));

The result here is
string(56) "<a href="http://foo.bar/baz?a=b">foo.bar/baz</a>&a=b"

The expected result is
string(56) "<a href="http://foo.bar/baz?a=b&a=b">foo.bar/baz</a>"

This is a bug since '&' must be encoded in HTML (to avoid conflict with the encoding) and linkUrlsInTrustedHtml() expects valid HTML.

The text was updated successfully, but these errors were encountered:

refs #2

Art4 · 2016-12-08T23:39:46Z

Thanks for this issue but I can't identify this as a bug.

http://foo.bar/baz?a=b&a=b isn't a valid url, because it is already encoded (i.e with htmlspecialchars()).
So this url must be either inside a html tag, not encoded as http://foo.bar/baz?a=b&a=b or urlencoded as http://foo.bar/baz?a=b&amp%3ba=b.

I've written tests for all these cases, see 8ad91f6

Llbe · 2016-12-09T04:29:13Z

Thanks for your reply!

It's not a valid URL but it is perfectly valid HTML.

The HTML is http://foo.bar/baz?a=b&a=b

The URL is http://foo.bar/baz?a=b&a=b

And the result should be <a href="http://foo.bar/baz?a=b&a=b">foo.bar/baz</a>. You are avoding the double encoding already, in escapeHtml().

This is the only way if accepting valid HTML in linkUrlsInTrustedHtml().

refs #2

Art4 · 2016-12-09T06:50:58Z

This would be valid HTML if it was already inside a HTML tag like <a href="http://foo.bar/baz?a=b&a=b">.... But it isn't, so it is neither valid HTML nor a valid url.

In this case you have to use $urlLinker->linkUrlsAndEscapeHtml('http://foo.bar/baz?a=b&a=b') what works fine. See tests in 2492474.

Llbe · 2016-12-09T14:42:09Z

So you are saying that you can only use & inside an attribute in HTML?

Eg the following would not be valid HTML: http://foo.bar/baz?a=b&a=b?

Art4 · 2016-12-09T14:53:39Z

Eg the following would not be valid HTML: http://foo.bar/baz?a=b&a=b?

Sure, this is valid HTML, but without a valid url. (with the exception of the first part http://foo.bar/baz?a=b)

The question is less if this HTML is valid but if it is trusted. So if this http://foo.bar/baz?a=b&a=b came from as a request from a user it is not trusted and you should use $urlLinker->linkUrlsAndEscapeHtml().

But if this came from a trusted source than you can use $urlLinker->linkUrlsInTrustedHtml() but you can see that it contains not a valid url.

This is a valid url: http://foo.bar/baz?a=b&a=b
This is not a valid url: http://foo.bar/baz?a=b&a=b

Llbe · 2016-12-09T15:05:29Z

The encoding are layers. The & belongs to HTML as you know. It has nothing to do with the URL. This is why it still is a valid URL given that the HTML is interpreted.

I am not saying that http://foo.bar/baz?a=b&a=b is a valid URL. I am saying that it is valid HTML that contains a valid URL.

Llbe · 2016-12-09T15:06:29Z

The linkUrlsInTrustedHtml() should understand HTML correctly and trusting.

Art4 · 2016-12-09T15:30:41Z

I am not saying that http://foo.bar/baz?a=b&a=b is a valid URL. I am saying that it is valid HTML that contains a valid URL.

If you insert an invalid URL the library will try to recognize as much as possible and put it in a <a>-tag. You can't expect that the library will know the meaning of an invalid url.

The linkUrlsInTrustedHtml() should understand HTML correctly and trusting.

So how should the library decide how to handle an invalid url? The meanings of http://foo.bar/baz?a=b&a=b could be:

valid url	in `<a href="`
`http://foo.bar/baz?a=b&a=b`	`http://foo.bar/baz?a=b&a=b`
`http://foo.bar/baz?a=b&amp%3ba=b`	`http://foo.bar/baz?a=b&amp%3ba=b`
`http://foo.bar/baz?a=b%26amp%3ba=b`	`http://foo.bar/baz?a=b%26amp%3ba=b`

The library will recognize an url until the first query part http://foo.bar/baz?a=b and ignore the rest. We can discuss if the library should recognize the url until http://foo.bar/baz?a=b&amp because this is also valid. But the rest ;a=b is invalid and is ignored.

I'd be glad if you would create a PR with a solution how to decide what the library should do.

Art4 · 2016-12-09T15:46:16Z

Another example: If you call the url http://foo.bar/baz?a=b&c=d PHP will interpret the querystring as:

array(2) {
  ["a"]=>
  string(1) "b"
  ["amp;c"]=>
  string(1) "d"
}

This is because the url will be urlencoded in the browser and became http://foo.bar/baz?a=b&amp%3ba=b.

But how should the library if it should use urlencode() or turn & back to &? You provided "trusted HTML" to $urlLinker->linkUrlsInTrustedHtml() so it will not change it but search for valid urls.
But if you put this invalid url into $urlLinker->linkUrlsAndEscapeHtml() it will escape the HTML and will recognize the & as &.

So you have two options:

Make sure you insert valid urls
Use $urlLinker->linkUrlsAndEscapeHtml()

Llbe · 2016-12-09T15:46:39Z

I am not saying that http://foo.bar/baz?a=b&a=b is a valid URL. I am saying that it is valid HTML that contains a valid URL.
If you insert an invalid URL the library will try to recognize as much as possible and put it in a -tag. You can't expect that the library will know the meaning of an invalid url.

I'm sorry but you are missing the point. I am not inserting an invalid URL. I am inserting valid HTML. When the HTML is interpreted, it becomes a valid URL.

The fundamentally important thing is to encode/decode per context. You must not handle HTML entities as a part of a URL. Currently you are converting HTML → text without a HTML decode.

Basically linkUrlsInTrustedHtml() should HTML decode before calling linkUrlsAndEscapeHtml() - because the latter method expects text and not HTML.

For this to work the regex in linkUrlsInTrustedHtml() must include HTML entities when finding the URLs.

Llbe · 2016-12-09T15:47:50Z

You are mixing different contexts.

Llbe · 2016-12-09T15:49:12Z

Use html_entity_decode().

Art4 · 2016-12-09T16:10:29Z

I'm sorry but I don't get your problem. Can you provide me a specific example of your problem?

If you have the input from above:

$urlLinker = new Youthweb\UrlLinker\UrlLinker();

var_dump($urlLinker->linkUrlsInTrustedHtml("http://foo.bar/baz?a=b&amp;a=b"));

you will get

input	output from `linkUrlsInTrustedHtml()`
`http://foo.bar/baz?a=b&a=b`	`<a href="http://foo.bar/baz?a=b">foo.bar/baz</a>&a=b`
`<a href="http://foo.bar/baz?a=b&a=b">foo.bar</a>`	`<a href="http://foo.bar/baz?a=b&a=b">foo.bar</a>`
`<p>http://foo.bar/baz?a=b&a=b</p>`	`<p><a href="http://foo.bar/baz?a=b">foo.bar/baz</a>&a=b</p>`

I don't see a problem here.

Use html_entity_decode().

Why should it use html_entity_decode? You inserted trustful HTML. But you can use html_entity_decode() on your own before insert it to linkUrlsInTrustedHtml().

Llbe · 2016-12-09T18:59:25Z

I don't want to be rude but you should study how to work with encodings. I promise you it will prove beneficial.

The problem with linkUrlsInTrustedHtml() specifically is that links are not extracted including entities.

You cannot decode before inserting - because you cannot decode before "parsing"/regex.

Because regex is used (and not libxml) - you need to extract links including entities and

a) decode the entities before inserting to linkUrlsAndEscapeHtml()

or

b) as today: avoid double encoding in linkUrlsAndEscapeHtml()

..

As a side note, ideally you should only extract links from HTML including entities that are within a link and not at the end of a link (or at the beginning of a link).

This is in order to handle a link such as (text): <http://foo.bar/baz>, which in HTML would be <http://foo.bar/baz>, where the link should be http://foo.bar/baz.

However http://foo.bar/baz>baz in HTML would be interpreted as link http://foo.bar/baz>baz.

('<' and '>' is allowed unencoded in URLs, see RFC 3986 - it is considered unsafe/unwise by RFC 1738).

I take this as an example because '<' and '>' is more obviously problematic in HTML. Technically, '&' is allowed unencoded (it is only so called ambiguous ampersands that are not allowed - ie ampersands that can be interpreted as an entity but that are not a valid entity).

The '<' '>' case are not as important as the & case - since that WILL break common links in valid HTML.

I have a working and secure handling of user input in my application. I use HTML for rich text formating. I want to convert links on the fly because how links are converted depends (I have a vary on target and rel attributes). This library does not work because HTML is not handled correctly. Otherwise, it seems to be a good improvement on the kwi original.

Art4 · 2016-12-10T00:34:22Z

Thank you for your patience. 😊 I'm always interested in learning new things. I am far away from being an expert on encoding, but I try to understand things before I can improve them. So I am really glad for your issue and I will work through your comment and discuss this issue internally.

This library does not work because HTML is not handled correctly. Otherwise, it seems to be a good improvement on the kwi original.

Thank you. And we are interested in improving this library and are thankful for every contribution.

Art4 added a commit that referenced this issue Dec 8, 2016

Add tests for urls in HTML

8ad91f6

refs #2

Art4 added a commit that referenced this issue Dec 9, 2016

Add tests for urls with escaping HTML

2492474

refs #2

Art4 added enhancement help wanted labels Apr 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug breaking links: linkUrlsInTrustedHtml does not handle entities correctly #2

Bug breaking links: linkUrlsInTrustedHtml does not handle entities correctly #2

Llbe commented Dec 8, 2016 •

edited

Loading

Art4 commented Dec 8, 2016

Llbe commented Dec 9, 2016

Art4 commented Dec 9, 2016 •

edited

Loading

Llbe commented Dec 9, 2016

Art4 commented Dec 9, 2016 •

edited

Loading

Llbe commented Dec 9, 2016

Llbe commented Dec 9, 2016

Art4 commented Dec 9, 2016

Art4 commented Dec 9, 2016

Llbe commented Dec 9, 2016

Llbe commented Dec 9, 2016

Llbe commented Dec 9, 2016

Art4 commented Dec 9, 2016 •

edited

Loading

Llbe commented Dec 9, 2016

Art4 commented Dec 10, 2016

Bug breaking links: linkUrlsInTrustedHtml does not handle entities correctly #2

Bug breaking links: linkUrlsInTrustedHtml does not handle entities correctly #2

Comments

Llbe commented Dec 8, 2016 • edited Loading

Art4 commented Dec 8, 2016

Llbe commented Dec 9, 2016

Art4 commented Dec 9, 2016 • edited Loading

Llbe commented Dec 9, 2016

Art4 commented Dec 9, 2016 • edited Loading

Llbe commented Dec 9, 2016

Llbe commented Dec 9, 2016

Art4 commented Dec 9, 2016

Art4 commented Dec 9, 2016

Llbe commented Dec 9, 2016

Llbe commented Dec 9, 2016

Llbe commented Dec 9, 2016

Art4 commented Dec 9, 2016 • edited Loading

Llbe commented Dec 9, 2016

Art4 commented Dec 10, 2016

Llbe commented Dec 8, 2016 •

edited

Loading

Art4 commented Dec 9, 2016 •

edited

Loading

Art4 commented Dec 9, 2016 •

edited

Loading

Art4 commented Dec 9, 2016 •

edited

Loading