-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug breaking links: linkUrlsInTrustedHtml does not handle entities correctly #2
Comments
Thanks for this issue but I can't identify this as a bug.
I've written tests for all these cases, see 8ad91f6 |
Thanks for your reply! It's not a valid URL but it is perfectly valid HTML. The HTML is The URL is And the result should be This is the only way if accepting valid HTML in linkUrlsInTrustedHtml(). |
This would be valid HTML if it was already inside a HTML tag like In this case you have to use |
So you are saying that you can only use Eg the following would not be valid HTML: |
Sure, this is valid HTML, but without a valid url. (with the exception of the first part The question is less if this HTML is valid but if it is trusted. So if this But if this came from a trusted source than you can use This is a valid url: |
The encoding are layers. The I am not saying that |
The linkUrlsInTrustedHtml() should understand HTML correctly and trusting. |
If you insert an invalid URL the library will try to recognize as much as possible and put it in a
So how should the library decide how to handle an invalid url? The meanings of
The library will recognize an url until the first query part I'd be glad if you would create a PR with a solution how to decide what the library should do. |
Another example: If you call the url
This is because the url will be urlencoded in the browser and became But how should the library if it should use So you have two options:
|
I'm sorry but you are missing the point. I am not inserting an invalid URL. I am inserting valid HTML. When the HTML is interpreted, it becomes a valid URL. The fundamentally important thing is to encode/decode per context. You must not handle HTML entities as a part of a URL. Currently you are converting HTML → text without a HTML decode. Basically linkUrlsInTrustedHtml() should HTML decode before calling linkUrlsAndEscapeHtml() - because the latter method expects text and not HTML. For this to work the regex in linkUrlsInTrustedHtml() must include HTML entities when finding the URLs. |
You are mixing different contexts. |
Use html_entity_decode(). |
I'm sorry but I don't get your problem. Can you provide me a specific example of your problem? If you have the input from above: $urlLinker = new Youthweb\UrlLinker\UrlLinker();
var_dump($urlLinker->linkUrlsInTrustedHtml("http://foo.bar/baz?a=b&a=b")); you will get
I don't see a problem here.
Why should it use |
I don't want to be rude but you should study how to work with encodings. I promise you it will prove beneficial. The problem with linkUrlsInTrustedHtml() specifically is that links are not extracted including entities. You cannot decode before inserting - because you cannot decode before "parsing"/regex. Because regex is used (and not libxml) - you need to extract links including entities and a) decode the entities before inserting to linkUrlsAndEscapeHtml() or b) as today: avoid double encoding in linkUrlsAndEscapeHtml() .. As a side note, ideally you should only extract links from HTML including entities that are within a link and not at the end of a link (or at the beginning of a link). This is in order to handle a link such as (text): However ('<' and '>' is allowed unencoded in URLs, see RFC 3986 - it is considered unsafe/unwise by RFC 1738). I take this as an example because '<' and '>' is more obviously problematic in HTML. Technically, '&' is allowed unencoded (it is only so called ambiguous ampersands that are not allowed - ie ampersands that can be interpreted as an entity but that are not a valid entity). The '<' '>' case are not as important as the I have a working and secure handling of user input in my application. I use HTML for rich text formating. I want to convert links on the fly because how links are converted depends (I have a vary on target and rel attributes). This library does not work because HTML is not handled correctly. Otherwise, it seems to be a good improvement on the kwi original. |
Thank you for your patience. 😊 I'm always interested in learning new things. I am far away from being an expert on encoding, but I try to understand things before I can improve them. So I am really glad for your issue and I will work through your comment and discuss this issue internally.
Thank you. And we are interested in improving this library and are thankful for every contribution. |
The result here is
string(56) "<a href="http://foo.bar/baz?a=b">foo.bar/baz</a>&a=b"
The expected result is
string(56) "<a href="http://foo.bar/baz?a=b&a=b">foo.bar/baz</a>"
This is a bug since '&' must be encoded in HTML (to avoid conflict with the encoding) and linkUrlsInTrustedHtml() expects valid HTML.
The text was updated successfully, but these errors were encountered: