Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-ASCII hostnames are not validated and so allow "@" and other reserved gen-delims characters #955

Open
mjpieters opened this issue Nov 20, 2023 · 4 comments
Labels

Comments

@mjpieters
Copy link
Contributor

mjpieters commented Nov 20, 2023

Describe the bug

For #880 we fixed handling of ASCII hostnames, rejecting hostnames that contain characters or sequences that are explicitly excluded (see RFC3986, section 3.2.2, the reg-name grammar rule.

When implementing the PR for this I had tested idna.encode(host, uts46=True) and verified that it correctly rejects hostnames that use ASCII characters outside of the reg-name rule, not realising that _idna_encode() catches the exception raised for this and then falls back to host.encode('idna'), which doesn't reject such hostnames.

The exception handling is there to allow for IDNA 2003 / 2008 compatibility (see #152). I suspect that we need to use idna.encode(host) instead of host.encode('idna') here to ensure that invalid characters are still rejected.

Because there is no validation it is trivial to create invalid URLs, e.g. by passing in a non-ascii authority value to host plus a user or password value.

To Reproduce

>>> import yarl
>>> yarl.URL.build(scheme="http", host="user:pass@историк.рф", user="therealusername")
URL('http://therealusername@xn--user:[email protected]')

Expected behavior

The user:pass@историк.рф host value contains invalid characters, : and @ and so a ValueError should have been raised.

@mjpieters mjpieters added the bug label Nov 20, 2023
@mjpieters
Copy link
Contributor Author

I suspect that we need to use idna.encode(host) instead of host.encode('idna') here to ensure that invalid characters are still rejected.

Nope, that's not correct. Instead, we should perhaps just reject IDNA 2003 support? I'm not versed in IDNA RFCs so it'll take me some time to figure this out.

@asvetlov: can you weigh in on this? Why do we need to be able to encode host strings using IDNA 2003 if IDNA 2008 encoding fails?

@mjpieters
Copy link
Contributor Author

mjpieters commented Nov 20, 2023

I have now found and read the Further Notes section of the Python idna project. I think that the IDNA encoder is implementing the recommendation there:

For now, applications that need to support these non-compliant labels may wish to consider trying the encode/decode operation in this library first, and then falling back to using encodings.idna

If this is the reason, then we'll need to explicitly handle non-compiant characters in the host separately.

Alternatively, we could decide to reject non-IDNA 2008 hostnames.

@webknjaz
Copy link
Member

The user:pass@историк.рф host

Oh boi. We should replace that with something adequate. It's a resource praising the r*ssian colonialism and related imperialist propaganda 🤮. I opened it just to make sure and immediately got a panic attack that will take days to recover from :(

@Dreamsorcerer
Copy link
Member

These problems are probably why requests appear to have just rejected it:
psf/requests#5845

Although, as that issue states, several TLDs choose not to enforce IDNA 2008 for their domain registries (and anybody could used them for subdomains), so it's not exactly forbidden globally either.

webknjaz added a commit that referenced this issue Dec 6, 2023
This change doesn't edit the test semantics, nor does it make any
functional changes. It removes references to the ruscist culture that
are as triggering to a lot of people as using "slave/master",
"blacklist/whitelist" terminology.

In particular, this gets rid of a link to the website that exists
to justify ethnic cleansing in Crimea and genocides in Ukraine, being
sponsored by the government of the terrorist state of muscovy.

Ref #955 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

3 participants