Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define hosts' public suffix and registrable domain. #391

Merged
merged 7 commits into from
Jun 7, 2018
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 91 additions & 0 deletions url.bs
Original file line number Diff line number Diff line change
Expand Up @@ -272,6 +272,97 @@ for further processing.
U+0020 SPACE, U+0023 (#), U+0025 (%), U+002F (/), U+003A (:), U+003F (?), U+0040 (@), U+005B ([),
U+005C (\), or U+005D (]).

<p>A <a for=/>host</a>'s <dfn for=host export id=concept-host-public-suffix>public suffix</dfn> is
the portion of a <a for=/>host</a> which is controlled by a registrar, public or otherwise. To
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pedantry:

which is controlled by a registrar

Isn't necessarily correct. This implies control over the DNS, which isn't always passed on (e.g. in the cast of hosting or DNS providers), and an example like appspot.com, that domain isn't controlled by a registrar.

That was the intent of the PSL originally - reflecting ccTLDs registration policies - but that predates the advent of the PRIVATE section where it all began the descent into hell :)

publicsuffix.org doesn't list 'what' a public suffix is, other than the result of running the algorithm. Logically, it represents the separation of domain boundaries indicating a change in administrative or technical control or security policy (which is why IETF called it DBOUND), but that's a bit of a mouthful... :/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an example like appspot.com, that domain isn't controlled by a registrar.

Aren't we calling Google the registrar in this example? Or GitHub the registrar of*.github.io?

Is there a term I could use that would be more accurate (and less than a sentence long :) )?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikewest Yeah, except neither GitHub nor Google are actually or acting as registrars. That was why it was sort of weird. They don't necessarily allow registration either (and may instead assign names, such as Amazon, based on project IDs)

For a given domain input, the PSL splits the labels on the first administrative boundary, with the registered domain being the set of labels that are operated according to a different set of domain policies than the public suffix (which itself may contain more domain splits).

Definitely a mouthful, and this is part of why we dance around it on publicsuffix.org, because we haven't found a pithy way of describing left/right except in their relationship to each other. :/

I was hoping your ability to condense these concepts would be better than mine.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely one of the longest outstanding proper definitions that we eventually need to clarify on the PSL project as well, and connected to publicsuffix/publicsuffix.org#12

If we consider only the ICANN section (mistakenly named like that, as it should be IANA), than the definition is probably correct. If we consider also the PRIVATE section, and then the list as a whole, we must come with a better definition of what is effectively that distinguish a suffix from a host.

The "controlled" part is the key. In both cases, the denominator is than an entity has control of a portion (a set of labels) in a host, and determine specific rules on how that portion of the name is operated. Everything beyond (on the left) of that label is basically not under direct control of that entity, and therefore each subzone should be considered independent from the others.

In the case of a registry, the controlled portion is for sure the TLD and perhaps extra lower levels (generally second, something third). In that case, the "registerable" definition potentially apply, as there is a direct assumption that the registrar makes those domain available for registration. Again, this is actually a potential incorrect assumption, as domains that belong to that zone may not be open for registration, but assigned explicitly.

If the "registerable" may potentially fit the registrar use case, it definitely doesn't fit the PRIVATE use case because the suffixes in this section may be there for a variety of reasons.

However, regardless the use case, the common pattern is that the entity that controls the suffix declares that every subzone beyond that suffix should be considered independent zones potentially managed by different users.

obtain <var>host</var>'s <a for=host>public suffix</a>, run the following steps:

<ol>
<li><p>Let <var>parsed</var> be the result of <a lt="host parser">host parsing</a> <var>host</var>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A host is already parsed (otherwise it wouldn't be a host). You also need to introduce the host variable in the paragraph before the algorithm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A host is already parsed (otherwise it wouldn't be a host).

Hrm. Yeah, I guess it's reasonable to assume that we'll only be using this algorithm on already-parsed hosts.

You also need to introduce the host variable in the paragraph before the algorithm.

Line 277 introduces <var>host</var>. Would you prefer to be more explicit, like "To obtain the <a for=host>public suffix</a> for a <a for=/>host</a> <var>A</var>:"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I missed that on line 277. That's fine, but your alternative here works too.


<li><p>If <var>parsed</var> is not a <a>domain</a>, return the empty string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of implies that the public suffix is also a string. Perhaps it's cleaner to return null?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the public suffix a host? I guess it could be. I was assuming it was a string, but treating it as a host seems reasonable.


<li><p>Return the <a for=host>public suffix</a> obtained by executing the
<a href="https://publicsuffix.org/list/">algorithm</a> defined by the Public Suffix List. [[!PSL]].
</ol>

<p>A <a for=/>host</a>'s <dfn for=host export id=concept-host-registrable-domain>registrable
domain</dfn> is a formally valid domain name that could be registered at a registry. To obtain
<var>host</var>'s <a for=host>registrable domain</a>, run the following steps:

<ol>
<li><p>Let <var>parsed</var> be the result of <a lt="host parser">host parsing</a> <var>host</var>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.


<li><p>If <var>parsed</var> is not a <a>domain</a>, return the empty string.

<li><p>If <var>parsed</var>'s <a for=host>public suffix</a> is <var>host</var>, return the empty
string.

<li><p>Return the <a for=host>registrable domain</a> obtained by executing the
<a href="https://publicsuffix.org/list/">algorithm</a> defined by the Public Suffix List. [[!PSL]].
</ol>

<div class=example id=example-host-psl>
<table>
<tr>
<th>Host
<th>Public Suffix
<th>Registrable Domain
<tr>
<td><code>com</code>
<td><code>com</code>
<td>
Copy link
Member

@annevk annevk May 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change this to <td>Null I suppose. (Though we could maybe add a paragraph that says that null values are omitted. Not sure what's nicer.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest <td><i>null</i>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? We don't use that convention anywhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example tables like this are special. We omit the quotes, substitute strings for structs, and use other conventions meant for visual clarity and not consistency. I certainly don't think we should capitalize "null" here, and I think italicizing it so that it's clear it's not just a registrable domain named null is helpful.

Shrug. Just a thought.

<tr>
<td><code>example.com</code>
<td><code>com</code>
<td><code>example.com</code>
<tr>
<td><code>www.example.com</code>
<td><code>com</code>
<td><code>example.com</code>
<tr>
<td><code>sub.www.example.com</code>
<td><code>com</code>
<td><code>example.com</code>
<tr>
<td><code>EXAMPLE.COM</code>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a host, but input to the host parser.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's helpful to point out that no matter how folks spell the URL, it's going to be normalized. Perhaps shifting this table to include a URL rather than a host would make that point, especially for the punycode bits?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to just list hosts, but we should label it "host input" or some such, to not confuse it with host as a concept, which is already parsed and normalized.

<td><code>com</code>
<td><code>example.com</code>
<tr>
<td><code>github.io</code>
<td><code>github.io</code>
<td>
<tr>
<td><code>whatwg.github.io</code>
<td><code>github.io</code>
<td><code>whatwg.github.io</code>
<tr>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this row duplicated? The previous one looks the same.

<td><code>whatwg.github.io</code>
<td><code>github.io</code>
<td><code>whatwg.github.io</code>
<tr>
<td><code>إختبار</code>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. And also applies below.

<td><code>xn-kgbechtv</code>
<td>
<tr>
<td><code>example.إختبار</code>
<td><code>xn-kgbechtv</code>
<td><code>example.xn-kgbechtv</code>
Copy link

@sleevi sleevi Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So one of the things is the PSL doesn't specify whether or not it returns U-Label or A-Label (that's left to the implementation). I'm curious the documentation here for the A-Label - is this an expectation of the contract?

That is, are you trying to show that either U-Label or A-Label can be returned regardless of U-Label or A-Label input, or are you trying to state that A-Labels should be the consistent return?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we don't rely on this anywhere (assuming it's consistent to be one or the other, is that at least required?), but A-label seems preferable as that'd be consistent with how the platform exposes URLs and origins overall.

I suspect this will only matter if we add an API, but it really depends on whether PSL dependencies keep getting added or not.

<tr>
<td><code>sub.example.إختبار</code>
<td><code>xn-kgbechtv</code>
<td><code>example.xn-kgbechtv</code>
</table>
</div>

<p>Two <a for=/>hosts</a>, <var>A</var> and <var>B</var> are said to be
<dfn for=host export id=concept-host-same-site>same-site</dfn> with each other if either of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have it as "same origin". Should this be "same site"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meh. I think I'd have spelled it "same-origin" if you hadn't already spelled it "same origin". :)

I'm happy to follow suit with "same site"; I'm not dogmatic about hyphenation.

following statements are true:

<ul class=brief>
<li><p><var>A</var> is identical to <var>B</var>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use concept-host-equals.

<li><p><var>A</var>'s <a for=host>registrable domain</a> is not the empty string, and is identical
to <var>B</var>'s <a for=host>registrable domain</a>.
</ul>

<h3 id=idna>IDNA</h3>

Expand Down