Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add versioning information to "public_suffix_list.dat" file #1808

Closed
TurtleWilly opened this issue Jul 23, 2023 · 10 comments
Closed

Add versioning information to "public_suffix_list.dat" file #1808

TurtleWilly opened this issue Jul 23, 2023 · 10 comments
Labels
❔❔ question Open question, please look / answer / respond

Comments

@TurtleWilly
Copy link

It would be nice to have some sort of (automatic) versioning information directly inside the "public_suffix_list.dat" file. Currently it is practically impossible to determine which file is the most current from a set of multiple "public_suffix_list.dat" on disk. This probably also could be useful for libpsl to determine what the "latest" is.

With CVS or SVN we could add // $Id$ as the first line of the file and the problem would solve itself (svn may need a propset depending on the configuration). The source control system would then automatically insert current version and/or date during the checkout (I'm not too familiar with git and if it has a similar feature or not.)

@eli-schwartz
Copy link

You can do the same thing in git with $Format:%cs$ where %cs is the formatter code to embed a YYYY-MM-DD style timestamp of the commit date (not the checkout date).

There are no tags so git describe can't be used with any degree of accuracy.

@dnsguru
Copy link
Member

dnsguru commented Jul 26, 2023

@smarnach is this possible?

@dnsguru dnsguru added the ❔❔ question Open question, please look / answer / respond label Jul 26, 2023
@weppos
Copy link
Member

weppos commented Aug 1, 2023

Git doesn't ship with an $id$ equivalent feature. Instead, you are encouraged to leverage SHAs generated by Git itself.

In order to embed an external information, like the SHA or any other ID, we would need to pre-process the file before being committed. This is generally the responsibility of a CI/pipeline that we don't have.

I am not inclined to add such complexity in the file itself when this is within the repo, as it would be redundant since we can leverage git.

Ideally, the tagging should happen in the pipeline that processes the list for distribution at
https://publicsuffix.org/list/public_suffix_list.dat

Although these days I even question whether we still need such distribution mechanism and we shouldn't instead just rely on Git hosting.

For consumers that need/want version tagging the current solution would be to switch towards pulling the list directly from the repo. I've actually been doing it for years in the library I maintain, here's an example:

weppos/publicsuffix-go@a20f9ab

https://github.com/weppos/publicsuffix-go/blob/a20f9abcc222b049ef9b7a28845bac88e0155ae3/publicsuffix/generator/gen.go#L24-L49

@dnsguru
Copy link
Member

dnsguru commented Aug 1, 2023 via email

@smarnach
Copy link

smarnach commented Aug 3, 2023

Cloud Storage returns the date the list was last modified in the Last-Modified header, so anyone is free to post-process the file when downloading it via the CDN. It would also be easy to modify the deployment workflow to include the date in the file when uploading the data. From an operational point of view, I don't have any concerns about doing this, so it's up to you to make the call here, @weppos and @dnsguru. I'm happy to make the required changes if you want me to.

@eli-schwartz
Copy link

Git doesn't ship with an $id$ equivalent feature. Instead, you are encouraged to leverage SHAs generated by Git itself.

I specifically pointed out that it does indeed do precisely this. It's part of the git-archive(1) machinery, for example the thing that github uses to generate https://github.com/publicsuffix/list/archive/refs/heads/master.tar.gz

It doesn't affect git clones, although you could invoke that machinery pretty easily:

git archive HEAD <filename> | bsdtar -x -C path/to/output/directory -f -

@dnsguru
Copy link
Member

dnsguru commented Aug 14, 2023

Because the gTLD list from ICANN's JSON has a timestamp in it, and that's the most often updated element, I'd assert that "Solution Exists" if one were to track that as the last date. It does not account for deltas that occur between auto-pulls from ICANN, but due to the frequency of those, and their priority of processing ahead of subdomain projects, this works itself out relatively well.

This was referenced Sep 15, 2023
@dnsguru
Copy link
Member

dnsguru commented Oct 2, 2023

Cloud Storage returns the date the list was last modified in the Last-Modified header, so anyone is free to post-process the file when downloading it via the CDN. It would also be easy to modify the deployment workflow to include the date in the file when uploading the data. From an operational point of view, I don't have any concerns about doing this, so it's up to you to make the call here, @weppos and @dnsguru. I'm happy to make the required changes if you want me to.

In reviewing #1855 / #1856 - in order to avoid confusion about versions of security reports that would cause further disposible volunteer resource drain in hunting, we may want to tie doing these things together:

  • Add Date in file
  • Implement Security Policy

I have seen salient arguments for doing both and also for doing neither, but it seems like datestamp would be prereq should we implement a security policy were that to proceed.

@eli-schwartz
Copy link

Would you be interested in an implementation of the git-archive side of this on the theory that it causes no harm to have this literal text in the file:

// this is not guaranteed to be updated, but will contain either "$Format" or else a YYYY-MM-DD timestamp
// Date updated: $Format:%cs$

and under some conditions, at least, it would be a benefit since it would actually contain:

// this is not guaranteed to be updated, but will contain either "$Format" or else a YYYY-MM-DD timestamp
// Date updated: 2023-10-02

@simon-friedberger
Copy link
Contributor

We have updated the deployment pipeline to include version information like this:

// This Source Code Form is subject to the terms of the Mozilla Public
// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at https://mozilla.org/MPL/2.0/.

// Please pull this list from, and only from https://publicsuffix.org/list/public_suffix_list.dat,
// rather than any other VCS sites. Pulling from any other URL is not guaranteed to be supported.

// VERSION: 2024-10-31_18-14-42_UTC
// COMMIT: 783da2456c94cfd5bcb7f977ae229b8205d58556

// Instructions on pulling and using this list can be found at https://publicsuffix.org/list/.

// ===BEGIN ICANN DOMAINS===

// ac : http://nic.ac/rules.htm
ac
com.ac
edu.ac
gov.ac
net.ac

I think this should make string comparison on the version do the right thing (@danderson wdyt?) and if somebody wants to know the actual commit ID in the repo that is available as well.

Please let me know if you see issues with this or can think of a use-cases that this doesn't solve! If there is nothing we will try to roll this out next week.

@github-project-automation github-project-automation bot moved this from awaiting feedback to Done or Won't in Meta Topics, Questions, Process Nov 7, 2024
sebastian-nagel added a commit to sebastian-nagel/crawler-commons that referenced this issue Nov 10, 2024
…ages

- log for processed public suffix list
  - MD5 and SHA-512
  - number of bytes, lines and rules
  - commit date and git hash, cf.
    publicsuffix/list#1808
sebastian-nagel added a commit to sebastian-nagel/crawler-commons that referenced this issue Nov 10, 2024
…ages

- log for processed public suffix list
  - MD5 and SHA-512
  - number of bytes, lines and rules
  - commit date and git hash, cf.
    publicsuffix/list#1808
sebastian-nagel added a commit to sebastian-nagel/crawler-commons that referenced this issue Nov 12, 2024
…ages

- log for processed public suffix list
  - MD5 and SHA-512
  - number of bytes, lines and rules
  - commit date and git hash, cf.
    publicsuffix/list#1808
sebastian-nagel added a commit to sebastian-nagel/crawler-commons that referenced this issue Nov 12, 2024
…ages

- log for processed public suffix list
  - MD5 and SHA-512
  - number of bytes, lines and rules
  - commit date and git hash, cf.
    publicsuffix/list#1808
sebastian-nagel added a commit to crawler-commons/crawler-commons that referenced this issue Nov 12, 2024
…ages

- log for processed public suffix list
  - MD5 and SHA-512
  - number of bytes, lines and rules
  - commit date and git hash, cf.
    publicsuffix/list#1808
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
❔❔ question Open question, please look / answer / respond
Projects
Status: Done or Won't
Development

No branches or pull requests

6 participants