Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Certain characters in markdown code incorrectly parsed (e.g., &) #394

Open
veenstrajelmer opened this issue Nov 1, 2024 · 2 comments
Open

Comments

@veenstrajelmer
Copy link

veenstrajelmer commented Nov 1, 2024

I encountered a similar issue as #372, originally found in omnilib/sphinx-mdinclude#19. I did some digging and found that the issue lays in mistune.util.escape_url indeed as @torokati44 points towards, but only because its behavior is different depending on whether html is installed (this happens in the import, at least of mistune 2.0.5):

mistune/mistune/util.py

Lines 22 to 31 in cb580e8

def escape_url(link):
safe = (
':/?#@' # gen-delims - '[]' (rfc3986)
'!$&()*+,;=' # sub-delims - "'" (rfc3986)
'%' # leave already-encoded octets alone
)
if html is None:
return quote(link.encode('utf-8'), safe=safe)
return html.escape(quote(html.unescape(link), safe=safe))

Some code to reproduce:

from urllib.parse import quote
import html

link = "https://sonarcloud.io/api/project_badges/measure?project=Deltares_ddlpy&metric=alert_status"

# code from: def escape_url(link):
safe = (
    ':/?#@'           # gen-delims - '[]' (rfc3986)
    '!$&()*+,;='      # sub-delims - "'" (rfc3986)
    '%'               # leave already-encoded octets alone
)
out_nonhtml = quote(link.encode('utf-8'), safe=safe)
out_withhtml = html.escape(quote(html.unescape(link), safe=safe))
out_withhtml_noescape = quote(html.unescape(link), safe=safe)

print(out_nonhtml)
print(out_withhtml)
print(out_withhtml_noescape)

This gives different results. The first one is returned if html is not installed, the second one if html is installed. The third one is correct again:

https://sonarcloud.io/api/project_badges/measure?project=Deltares_ddlpy&metric=alert_status
https://sonarcloud.io/api/project_badges/measure?project=Deltares_ddlpy&metric=alert_status
https://sonarcloud.io/api/project_badges/measure?project=Deltares_ddlpy&metric=alert_status

Therefore, what I think should fix the issue is to remove the html.escape() from the escape_url function so the behaviour is always consistent and according to expectations.

Update for newer mistune versions
The escape_url function looks different in the master branch (>3.0.0):

def escape_url(link: str) -> str:
"""Escape URL for safety."""
safe = (
':/?#@' # gen-delims - '[]' (rfc3986)
'!$&()*+,;=' # sub-delims - "'" (rfc3986)
'%' # leave already-encoded octets alone
)
return escape(quote(unescape(link), safe=safe))

In this case omitting escape does the trick.

Originally posted by @veenstrajelmer in #372 (comment)

@mentalisttraceur
Copy link
Contributor

Okay, I've taken a closer look. TL;DR: we need to move the call to escape out of escape_url and into the HTML renderers.

This is another case of HTML-escaping being applied in the parse stage, instead of during HTML rendering, resulting in HTML-specific escapes leaking into non-HTML outputs/uses:

>>> md = mistune.Markdown()
>>> md('[hi](https://example.com/xyz?a=b&foo=bar)')
[{'type': 'paragraph', 'children': [{'type': 'link', 'children': [{'type': 'text', 'raw': 'hi'}], 'attrs': {'url': 'https://example.com/xyz?a=b&foo=bar'}}]}]

(The same kind of issue as for inline code spans - different code paths, but same abstract "shape".)

The current logic of escape_url is correct for HTML, but so we don't want to just delete the call to escape from inside it, we want to move it to every spot that renders URLs to HTML. Funny enough, the include Image directive plugin is already doing this (meaning it's currently buggy and will also get fixed by this change).

PR soon.

@veenstrajelmer
Copy link
Author

veenstrajelmer commented Nov 1, 2024

Thanks a lot for this elaborate analysis and the reproducible example. I was indeed expecting the escape could not just be ommitted, but had no idea about how to further resolve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants