-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define hosts' public suffix and registrable domain. #391
Conversation
This patch is another attempt at #72, and defers most of the actual work to the algorithms defined at https://publicsuffix.org/list/.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I mostly have nits.
Do we need public suffix as a concept in practice? I haven't seen the need for that so far, but if we add an API it would make sense.
url.bs
Outdated
obtain <var>host</var>'s <a for=host>public suffix</a>, run the following steps: | ||
|
||
<ol> | ||
<li><p>Let <var>parsed</var> be the result of <a lt="host parser">host parsing</a> <var>host</var>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A host is already parsed (otherwise it wouldn't be a host). You also need to introduce the host variable in the paragraph before the algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A host is already parsed (otherwise it wouldn't be a host).
Hrm. Yeah, I guess it's reasonable to assume that we'll only be using this algorithm on already-parsed hosts.
You also need to introduce the host variable in the paragraph before the algorithm.
Line 277 introduces <var>host</var>
. Would you prefer to be more explicit, like "To obtain the <a for=host>public suffix</a> for a <a for=/>host</a> <var>A</var>:
"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I missed that on line 277. That's fine, but your alternative here works too.
url.bs
Outdated
<var>host</var>'s <a for=host>registrable domain</a>, run the following steps: | ||
|
||
<ol> | ||
<li><p>Let <var>parsed</var> be the result of <a lt="host parser">host parsing</a> <var>host</var>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above.
url.bs
Outdated
<ol> | ||
<li><p>Let <var>parsed</var> be the result of <a lt="host parser">host parsing</a> <var>host</var>. | ||
|
||
<li><p>If <var>parsed</var> is not a <a>domain</a>, return the empty string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of implies that the public suffix is also a string. Perhaps it's cleaner to return null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the public suffix a host? I guess it could be. I was assuming it was a string, but treating it as a host seems reasonable.
<td><code>com</code> | ||
<td><code>example.com</code> | ||
<tr> | ||
<td><code>EXAMPLE.COM</code> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a host, but input to the host parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's helpful to point out that no matter how folks spell the URL, it's going to be normalized. Perhaps shifting this table to include a URL rather than a host would make that point, especially for the punycode bits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to just list hosts, but we should label it "host input" or some such, to not confuse it with host as a concept, which is already parsed and normalized.
<td><code>github.io</code> | ||
<td><code>whatwg.github.io</code> | ||
<tr> | ||
<td><code>إختبار</code> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. And also applies below.
url.bs
Outdated
</div> | ||
|
||
<p>Two <a for=/>hosts</a>, <var>A</var> and <var>B</var> are said to be | ||
<dfn for=host export id=concept-host-same-site>same-site</dfn> with each other if either of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have it as "same origin". Should this be "same site"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meh. I think I'd have spelled it "same-origin" if you hadn't already spelled it "same origin". :)
I'm happy to follow suit with "same site"; I'm not dogmatic about hyphenation.
url.bs
Outdated
following statements are true: | ||
|
||
<ul class=brief> | ||
<li><p><var>A</var> is identical to <var>B</var> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should use concept-host-equals.
As for the API, I think we could make it part of #288. We could even leave out |
Updated based on your feedback, WDYT?
I think we do, as it seems to be what we want to reference from HTML's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From an API perspective, I remain opposed to adding it (least of all because each browser carries their own notion of a PSL and, as it turns out, browser’s implements slightly different algorithms)
I die a little inside that we’re speccing more of this, but I suppose that’s inevitable. CC’ing @weppos and @dnsguru as other PSL maintainers
Isn't detecting that kind of difference in platform restrictions a reason to add the API? Different browsers consider a different set of hosts to have different registrable domains over time: it seems reasonable to expose that to developers so they can make decisions about the environment in which their code is executing.
Not writing it down seems worse. :) As long as cookies, |
No, it's actually a reason for not exposing - that the platform does not provide any guarantees about that environment, and you shouldn't be relying on trying to detect the environment. If there is something you could do if you're not on the PSL (for example, using separate domains under a gTLD vs using separate 3LDs), you should do that regardless. Anything you could or would do if you're not on the PSL, you should do regardless. And anything you would or could do 'if' you're on the PSL is wrong. So it's sort of win/win - always treat it as if you're not on it, and you're fine :) I know it's a fairly extreme position, but the PSL isn't something we can or should be relying on, especially as the Internet scales on. The notion of having a static list of every hosting provider, CMS, and otherwise user-generated content platforms, which the PSL is, is, well, a very 1980s solution :)
Except no one (on the author side) should be building systems that rely on it, and no new specs should be depending on it. To the extent writing it down gets to make explicit that this is a legacy aspect and any spec that references it has security flaws and should be redesigned before being implemented/shipped because of that, sure 👍 |
@sleevi if you were actually successful in stopping new standards from relying on it I'd be more persuaded. But meanwhile WebAuthn relies on it, Token Binding relies on it, the new Cross-Origin-Resource-Policy header relies on it, and to a large extent because of pressure from accounts.google.com (at least for the first two) as I understand it. |
@annevk Token Binding is not yet shipping in any browser, and WebAuthN's use of facets remains problematic security. If C-O-R-P relies on it, we should be fixing that in C-O-R-P before shipping. |
url.bs
Outdated
@@ -272,6 +272,93 @@ for further processing. | |||
U+0020 SPACE, U+0023 (#), U+0025 (%), U+002F (/), U+003A (:), U+003F (?), U+0040 (@), U+005B ([), | |||
U+005C (\), or U+005D (]). | |||
|
|||
<p>A <a for=/>host</a>'s <dfn for=host export id=concept-host-public-suffix>public suffix</dfn> is | |||
the portion of a <a for=/>host</a> which is controlled by a registrar, public or otherwise. To |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pedantry:
which is controlled by a registrar
Isn't necessarily correct. This implies control over the DNS, which isn't always passed on (e.g. in the cast of hosting or DNS providers), and an example like appspot.com
, that domain isn't controlled by a registrar.
That was the intent of the PSL originally - reflecting ccTLDs registration policies - but that predates the advent of the PRIVATE section where it all began the descent into hell :)
publicsuffix.org
doesn't list 'what' a public suffix is, other than the result of running the algorithm. Logically, it represents the separation of domain boundaries indicating a change in administrative or technical control or security policy (which is why IETF called it DBOUND), but that's a bit of a mouthful... :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an example like appspot.com, that domain isn't controlled by a registrar.
Aren't we calling Google the registrar in this example? Or GitHub the registrar of*.github.io
?
Is there a term I could use that would be more accurate (and less than a sentence long :) )?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikewest Yeah, except neither GitHub nor Google are actually or acting as registrars. That was why it was sort of weird. They don't necessarily allow registration either (and may instead assign names, such as Amazon, based on project IDs)
For a given domain input, the PSL splits the labels on the first administrative boundary, with the registered domain being the set of labels that are operated according to a different set of domain policies than the public suffix (which itself may contain more domain splits).
Definitely a mouthful, and this is part of why we dance around it on publicsuffix.org, because we haven't found a pithy way of describing left/right except in their relationship to each other. :/
I was hoping your ability to condense these concepts would be better than mine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely one of the longest outstanding proper definitions that we eventually need to clarify on the PSL project as well, and connected to publicsuffix/publicsuffix.org#12
If we consider only the ICANN section (mistakenly named like that, as it should be IANA), than the definition is probably correct. If we consider also the PRIVATE section, and then the list as a whole, we must come with a better definition of what is effectively that distinguish a suffix from a host.
The "controlled" part is the key. In both cases, the denominator is than an entity has control of a portion (a set of labels) in a host, and determine specific rules on how that portion of the name is operated. Everything beyond (on the left) of that label is basically not under direct control of that entity, and therefore each subzone should be considered independent from the others.
In the case of a registry, the controlled portion is for sure the TLD and perhaps extra lower levels (generally second, something third). In that case, the "registerable" definition potentially apply, as there is a direct assumption that the registrar makes those domain available for registration. Again, this is actually a potential incorrect assumption, as domains that belong to that zone may not be open for registration, but assigned explicitly.
If the "registerable" may potentially fit the registrar use case, it definitely doesn't fit the PRIVATE use case because the suffixes in this section may be there for a variety of reasons.
However, regardless the use case, the common pattern is that the entity that controls the suffix declares that every subzone beyond that suffix should be considered independent zones potentially managed by different users.
@annevk The PSL fundamentally can't scale, unless our goal is to deliver a snapshot of the Internet domains to users in real time, which we sort of put to bed when RFC 952 was obsoleted. I don't want to derail this thread, so apologies if it comes off there - but definitely wanted to push back on the notion of exposing this as part of the platform. Anyone that is making security assumptions about the presence or non-presence of a domain on the PSL is making a flawed security decision. To the extent browsers are doing it, they're wrong - and while I understand they may be doing so for legacy reasons, we should push back. But as a concept for exposing it to/as part of the platform, as much as possible, we should be trying to hide it from the platform and developers, because it's a concept that should go away / should not be relied on. If there is anything authors would do differently (based on non-presence), they should do that, and if there's anything they would do based on presence, they should stop doing that :) Hopefully that would obviate the need for API exposure. |
@sleevi again, that'd be more convincing if Google didn't double down on relying upon it. Both server-side with accounts and in Chrome with site isolation. |
@annevk I don't think I can productively respond to that. It feels very much "You're employed by Google. Google misuses a project you maintain. Therefore I don't need to consider that feedback until you change Google first.". I agree that it is unfortunate, I agree that it is not ideal, but to the extent possible, we should push back and try to find better solutions, and push back on those organizations that ignore that feedback. |
Google is large, it contains multitudes. FWIW, Chrome's isolation folks are actively working on origin isolation ( I think it's also true that Google's sign-in team is enthusiastic about separating I think it's helpful to create primitives that help developers work within the model we've created for ourselves, on the one hand, and to use our other hand to poke at the model in the hopes of shifting it. Making |
Thanks @sleevi for bringing this discussion to my attention. For the records, I'm very bad at instantaneously follow up to threads (I admire how Ryan can keep an eye on all the ML he's involved into 😅 ). Anyways, before weighting on one side or another, I actually have a question:
@sleevi just to better understand, are you critic on the concept of a public suffix, or on the specific implementation of the PSL per se? Reason I'm asking is because I concur that the PSL as it stands today it's an obsolete project that may need some extra work, but I'm not totally against the concept of a public suffix (the hard part is to properly define it). For now, for the sake of simplicity, let's take the suffixes in the PSL and consider them a PSL. Should the domain owner have a way like it was done for the CAA to somehow communicate e.g. via DNS what is the zone preferrable public suffix interpretation and policy, would your position on the concept of public suffix change? Please forgive for a second the possible implementation constraints or implications. I think the scope here is to try to clarify that a notion of public suffix exists, regardless whether we use the PSL today to represent them. Which I don't completely disagree on the idea. |
url.bs
Outdated
@@ -272,6 +272,94 @@ for further processing. | |||
U+0020 SPACE, U+0023 (#), U+0025 (%), U+002F (/), U+003A (:), U+003F (?), U+0040 (@), U+005B ([), | |||
U+005C (\), or U+005D (]). | |||
|
|||
<p>A <a for=/>host</a>'s <dfn for=host export>public suffix</dfn> is the portion of a | |||
<a for=/>host</a> which is controlled by a registrar, public or otherwise. To obtain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of "controlled by a registrar, public or otherwise" we could say "included on the Public Suffix List". This is boring, but factual and correct as I understand it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that works (as boring as it is)
url.bs
Outdated
</ol> | ||
|
||
<p>A <a for=/>host</a>'s <dfn for=host export>registrable domain</dfn> is a <a>domain</a> that could | ||
be registered at a registry. To obtain <var>host</var>'s <a for=host>registrable domain</a>, run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"is its public suffix including one domain label preceding its public suffix". Again, boring, but factual?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So a given host may have multiple public suffixes expressed within it.
Perhaps:
The domain formed by the most specific public suffix for host, along with the domain label immediately preceding it?
From a spec question, what do you want this definition to entail for the appspot.com
case?
That is,
foo.bar.appspot.com
is "obviously" going to returnbar.appspot.com
as the registerable domain (withappspot.com
as the public suffix), and the same would be expected if justbar.appspot.com
.- What do you expect this machinery to return for
appspot.com
?appspot.com
is on the PSL, so that is a public suffix, butappspot.com
is also a registerable domain under thecom
PSL.
I seem to recall that different platform features interpret that differently (navigation vs cookies, for example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't really aware of this case. Do you know why they interpret it differently? I guess we want consistent answers with cookies, WebAuthn, etc. If by navigation you mean the address bar it seems consistency with that would not matter that much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikewest wrote that the registrable domain would be null in such a case (we have github.io as example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d need to reaudit the Chrome code to figure out which cases are web visible. The results differ in this case based on whether or not you include private suffices and whether you treat wildcards as implicit entries of the parent. Chrome and FF differ on the latter, and the former is specified by the caller.
@weppos Thanks for chiming in. I realized the more I wrote, the more this should be its own issue, so I filed publicsuffix/list#671 to try and track some of my thoughts on this, and on the overall domain holder boundary use case. |
Based on the discussion I think we should go ahead, to ensure a consistent definition of all the standards that end up using these. Given @sleevi's legitimate concerns, let's add a Any API proposal should be discussed in a new issue on this repository, if there's still appetite. Discussion of scaling PSL can go in the issue raised by @sleevi above. |
url.bs
Outdated
<tr> | ||
<td><code>com</code> | ||
<td><code>com</code> | ||
<td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should change this to <td>Null
I suppose. (Though we could maybe add a paragraph that says that null values are omitted. Not sure what's nicer.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest <td><i>null</i>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? We don't use that convention anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example tables like this are special. We omit the quotes, substitute strings for structs, and use other conventions meant for visual clarity and not consistency. I certainly don't think we should capitalize "null" here, and I think italicizing it so that it's clear it's not just a registrable domain named null
is helpful.
Shrug. Just a thought.
<td><code>whatwg.github.io</code> | ||
<td><code>github.io</code> | ||
<td><code>whatwg.github.io</code> | ||
<tr> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this row duplicated? The previous one looks the same.
@sleevi @weppos I'll toss in some hopefully helpful but really wordy info here as color. For the last decade we have struggled with the focused definition of 'public suffix' term, 'eTLD', 'registerable domain' and other terms as being interchangeable. I agree that we need a glossary or something to help make these definitions more clear, and perhaps identify and expire the use of some of them if we can. But it gets tricky and nuanced. Different users, developers, integrators, and contributors define these in a variety of ways, sometimes as synonyms, sometimes not. This seems to result from the variation in how the PSL gets implemented within libraries or used in development. Sometimes there is a granular distinction that drove a given term's usage. Pardon if I go too deep on this one but it helps us out in the process of coming up with a good path forward. To avoid being drawn into the frothing energy around competing/alternative root TLD systems, the PSL maintainers opted to follow a document from ICANN called ICP-3, which defines a single authoritative root system for TLDs. The IANA maintains the listings of the TLDs listed in that root. The IANA does not go deeper levels than these initial entries (so it would include .UK but not CO.UK and include .AU but not COM.AU). A long time ago (in a galaxy far, far away ;)), it was discovered that one might be able to issue a 'super cookie' for CO.UK and slurp up all kinds of interesting data for the subdomains of CO.UK, and the PSL was born to dig deeper into these TLD structures in order to know what to treat as though it was 'effectively' (hence 'eTLD') a Top Level Domain when really it was a second level (or deeper) domain such as CO.UK. And thus was born use of a static list to identify these nuances in a more elegant manner than the IANA list, and go deeper into the effective namespaces. The benefit of such a list is that it is possible to cache it or incorporate it within one's software to understand how to treat entries, but a drawback is that having this update creates challenges because it is held in a centralized location (which defeats the benefits of the distributed nature of DNS that replaced the hosts.txt situation in the 80s). In the years since, the PSL has really become the only widely used, community maintained, frequently updated list of strings that might be expected to behave as-if they are a TLD, even if they are not at the top level. This evolved further. While CO.UK is operated by Nominet who oversee .UK, and COM.AU is overseen by AUDA who oversee .AU, there are some TLD-like systems that leap that direct and authoritative connection. Over the course of time, systems like Centralnic started offering subdomain registrations, Github offered subdomain hosting, and Dyn (now Oracle) started to offer DNS host naming, etc. US.COM, operated by Centralnic, is technically under .COM, but not operated directly by the .COM registry. So there is a change in the administrative horizon that begins at the root, and we opted to split the PSL into two sections, putting the IANA top down / ICANN delegated zones into the 'ICANN' section, and located this stuff (mostly, it still needs constant audit) into a section that designated that horizon, the 'PRIVATE' section. These lists seem at first like they are something that should be simple to compile, but the other maintainers and I would argue this is not the case. As a result, developers and integrators and security experts, software libraries, certificate authorities, and browsers and search engines (and I could riff for a while on this) have leveraged the PSL as a core list (sometimes authority) on handling this stuff. We as maintainers know what we do with it, but know we are not all knowing and have a spectrum of use-cases that get impacted by changes we might make to the file due to the processing that is done on it after it is downloaded. Maintaining entries is non-disruptive to the list and the derivative users. Renaming sections might be. Defining these terms in a glossary may be helpful for future integrations, but not as much for where there is 'set and forget' code or processes. I hope this is helpful - and not too "ivory tower" - as background. |
(FYI: I'm OOO until the 4th; I can almost certainly add the note that @annevk requested above sometime today, but if this needs more discussion than that, then I hope y'all feel free to move forward without me. :) ) |
@mikewest I think the main thing I need input on is the
|
To expand on that:
Doing an audit of Chromium for EXCLUDE_PRIVATE_REGISTRIES shows that the features are predominantly UI. It looks like we do expose some of the configurability to Blink, although it doesn't use it The wildcard issue is enough to call out, since you'll get different results if you getPublicSuffix(getPublicSuffix('foo.platform.sh')) depending on browser. |
I think we have two options:
I prefer 2 (and I'm pretty sure that's what the current text suggests, though the PSL's algorithm isn't exactly clear in step 7 of https://publicsuffix.org/list/ what we ought to do if there's no additional label). Running with 1 would require callers to understand that The bugs you linked, Ryan, are curious. I would have expected neither |
@mikewest so that means you can never be same site with |
This seems fine to me, at least partially because I'm not comfortable with actually serving anything from a public suffix, and I'm surprised every time I rediscover that we allow it. :) You also can't set |
cc03e7b addresses the concrete suggestions I picked out of the conversation above. WDYT? |
Thanks, instead of defining registrable domain in terms of a registry, we should use a variant of @sleevi's suggestion I think:
And I think we should add another paragraph to the warning, saying that when specifications nevertheless do rely on any of them for comparison purposes, they should carefully consider cross-scheme scenarios. I'm not entirely sure how to phrase that given that hosts don't have schemes, but you'll probably think of something? |
Yeah that looks great, I have a couple minor editorial nits, but from my perspective this is good to go otherwise. I'll give the others copied here until Wednesday before I merge it (at which point it'd be good to raise problems as new issues instead). |
<tr> | ||
<td><code>example.إختبار</code> | ||
<td><code>xn-kgbechtv</code> | ||
<td><code>example.xn-kgbechtv</code> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So one of the things is the PSL doesn't specify whether or not it returns U-Label or A-Label (that's left to the implementation). I'm curious the documentation here for the A-Label - is this an expectation of the contract?
That is, are you trying to show that either U-Label or A-Label can be returned regardless of U-Label or A-Label input, or are you trying to state that A-Labels should be the consistent return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we don't rely on this anywhere (assuming it's consistent to be one or the other, is that at least required?), but A-label seems preferable as that'd be consistent with how the platform exposes URLs and origins overall.
I suspect this will only matter if we add an API, but it really depends on whether PSL dependencies keep getting added or not.
Thanks all! I filed #396 as follow-up for the remaining issue. |
This patch is another attempt at #72, and defers most of the actual work to the
algorithms defined at https://publicsuffix.org/list/.
I wonder if this is something we should expose on
URL
objects? I thikn @hillbrad was asking for it a looong time ago, but I don't know if he still has use cases.Preview | Diff