Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SBOM #403

Open
lackhove opened this issue Dec 4, 2024 · 10 comments
Open

SBOM #403

lackhove opened this issue Dec 4, 2024 · 10 comments
Labels
enhancement New feature or request

Comments

@lackhove
Copy link

lackhove commented Dec 4, 2024

While the standalone builds already contain licensing info in the PYTHON.json file it would be great if it could also contain a full SBOM in a standard format such as SPDX, similar to the binaries on python.org.

@Edward-Knight
Copy link
Contributor

Second this. I'm currently working on manually writing a CycloneDX SBOM for a specific build of PBS we use downstream, following which I'll be working on some automated tooling. I'll share both as and when I can to hopefully get the ball rolling on upstreaming this 👍

@zanieb
Copy link
Member

zanieb commented Dec 11, 2024

thank you!

@Edward-Knight
Copy link
Contributor

I've attached an SBOM I've made for one of the past builds of PBS. It's the first SBOM I've written and I don't have particular expertise in this area, but I believe it is a good starting point.

Notable exclusions

The SBOM is mostly "complete", with a few exceptions I've outlined below. Of course one could always include more information and fill every possible field, but that way lies madness.

  • It does not include "dependency" information (under /dependencies)
    • This just allows you to add relationships to the already declared "components". It should be a simple case to say all of the libraries are dependencies of CPython, and CPython itself is a dependency of PBS, however I'm not sure if some of the libraries are subdependencies (e.g. some of the X or tk related libs)
  • It does not include license or copyright information (neither for the top-level component under /metadata/component/{licenses,copyright} or for dependencies under /components[]/{licenses,copyright})
    • This information is already tracked in this project, so shouldn't be too difficult to include
  • It does not include build information (e.g. compiler name and version, tooling versions, information about the CI system etc)
    • It isn't very clear where this information would be included. I believe the /formulation section would be most appropriate, but also it seems like this is excluded from most SBOMs, instead left to a different document (e.g. an MBOM)
      • This information isn’t included in https://github.com/CycloneDX/bom-examples
      • From what I can tell both the v1 cyclonedx-conan and v2 conan cyclonedx plugin don't include this information
      • Adoptium OpenJDK builds use metadata/tools and metadata/properties for this (incorrectly)
        • E.g. metadata/tools: {"name": "MacOS Compiler", "version": "clang (clang/LLVM from Xcode 15.2)"}
        • E.g. metadata/properties: {"name": "OS version", "value": "Darwin 23.6.0"}

Guide to the format

Since a large blob of JSON can be intimidating at first, I'll quickly go over the general structure. At the top level there is some generic boilerplate to do with the SBOM itself:

{
    "$schema": "https://cyclonedx.org/schema/bom-1.6.schema.json",
    "bomFormat": "CycloneDX",
    "specVersion": "1.6",
    "serialNumber": "urn:uuid:f6a24d2b-f989-426f-a9af-19324d8b6949",
    "version": 1,
    "metadata": {
        "timestamp": "2024-12-16T16:00:00+00:00",
        "component": {...}
    },
    "components": [...]
}

There is a "metadata component" and another list of "components". These all follow the same format. The "metadata component" is the one the SBOM is about, i.e. a particular build of PBS:

{
    "type": "application",
    "mime-type": "application/zstd",
    "bom-ref": "pkg:generic/python-build-standalone@20231002?download_url=https://github.com/indygreg/python-build-standalone/releases/download/20231002/cpython-3.10.13+20231002-x86_64_v2-unknown-linux-gnu-pgo-full.tar.zst&checksum=sha256:f1121cc0fccb1c5e867923f39e3e7d6413720554ec079eac022f5fc69e7ee83a",
    "authors": [
        {
            "name": "Gregory Szorc",
            "email": "[email protected]"
        }
    ],
    "name": "python-build-standalone",
    "version": "20231002",
    "description": "This project produces self-contained, highly-portable Python distributions. These Python distributions contain a fully-usable, full-featured Python installation: most extension modules from the Python standard library are present and their library dependencies are either distributed with the distribution or are statically linked.",
    "hashes": [
        {
            "alg": "SHA-256",
            "content": "f1121cc0fccb1c5e867923f39e3e7d6413720554ec079eac022f5fc69e7ee83a"
        }
    ],
    "purl": "pkg:generic/python-build-standalone@20231002?download_url=https://github.com/indygreg/python-build-standalone/releases/download/20231002/cpython-3.10.13+20231002-x86_64_v2-unknown-linux-gnu-pgo-full.tar.zst&checksum=sha256:f1121cc0fccb1c5e867923f39e3e7d6413720554ec079eac022f5fc69e7ee83a",
    "externalReferences": [
        {
            "url": "https://github.com/indygreg/python-build-standalone",
            "type": "vcs"
        },
        {
            "url": "https://github.com/indygreg/python-build-standalone/releases/tag/20231002",
            "type": "release-notes"
        },
        {
            "url": "https://github.com/indygreg/python-build-standalone/releases/download/20231002/cpython-3.10.13+20231002-x86_64_v2-unknown-linux-gnu-pgo-full.tar.zst",
            "type": "distribution",
            "hashes": [
                {
                    "alg": "SHA-256",
                    "content": "f1121cc0fccb1c5e867923f39e3e7d6413720554ec079eac022f5fc69e7ee83a"
                }
            ]
        }
    ]
}

This SBOM is for a specific release tarball and the mime-type and hashes reflect this. There are some self-explanatory fields like name, version, description, and author (which I imagine will change to manufacturer and point to "Astral Software Inc." for future releases). The bom-ref, purl, and "distribution" type external reference all have duplicate information pointing to the exact build (as python-build-standalone@20231002 on its own does not uniquely define a release).

The array of components is then a list of the dependencies. I wrote the CPython one by hand, and have included a short bit of Python code that I used to generate the rest from the downloads.py list. As an example, here is the one for CPython (which has more detail than the other dependencies):

{
    "type": "application",
    "mime-type": "application/x-xz",
    "bom-ref": "pkg:generic/[email protected]?download_url=https://www.python.org/ftp/python/3.10.13/Python-3.10.13.tar.xz&checksum=sha256:5c88848668640d3e152b35b4536ef1c23b2ca4bd2c957ef1ecbb053f571dd3f6",
    "manufacturer": {
        "name": "Python Software Foundation",
        "address": {
            "country": "US",
            "region": "Oregon",
            "locality": "Beaverton",
            "postalCode": "OR 97008",
            "streetAddress": "9450 SW Gemini Dr. ECM# 90772"
        },
        "url": [
            "https://www.python.org"
        ]
    },
    "name": "CPython",
    "version": "3.10.13",
    "hashes": [
        {
            "alg": "SHA-256",
            "content": "5c88848668640d3e152b35b4536ef1c23b2ca4bd2c957ef1ecbb053f571dd3f6"
        }
    ],
    "purl": "pkg:generic/[email protected]?download_url=https://www.python.org/ftp/python/3.10.13/Python-3.10.13.tar.xz&checksum=sha256:5c88848668640d3e152b35b4536ef1c23b2ca4bd2c957ef1ecbb053f571dd3f6",
    "externalReferences": [
        {
            "url": "https://www.python.org/ftp/python/3.10.13/Python-3.10.13.tar.xz",
            "type": "distribution",
            "hashes": [
                {
                    "alg": "SHA-256",
                    "content": "5c88848668640d3e152b35b4536ef1c23b2ca4bd2c957ef1ecbb053f571dd3f6"
                }
            ]
        }
    ]
}

As noted above, all the "components" have a similar structure, the only difference is that I've set the manufacturer field instead of the author field.

Attachments and useful links

@Edward-Knight
Copy link
Contributor

I've been looking at software that consumes SBOMs (in this case Dependency-Track), and it seems like the generic purls (package URLs) aren't used for matching against vulnerability databases. My testing shows that CPE (Common Platform Enumeration) IDs do work though - using a CPE of cpe:2.3:a:openssl:openssl:3.0.11:*:*:*:*:*:*:* for openssl does correctly link to some vulnerabilities. We can't automatically construct these as with generic purls, but adding them to metadata for a component and using them where available is probably a good idea

@charliermarsh charliermarsh added the enhancement New feature or request label Dec 18, 2024
@Edward-Knight
Copy link
Contributor

Charlie et. al, a few implementation questions:

  1. Would you prefer this functionality be implemented in the Python code or Rust code? (I personally haven't written Rust before)
  2. Where would you like to see code implementing this added (location in repo and time of execution in CI)?

From the looks of it we could maybe do it:

  1. At the end of cpython-unix/build.py::build_cpython() after PYTHON.json is created
  2. After compress_python_archive() is called in cpython-unix/build-main.py::main()
  3. As part of pythonbuild validate-distribution (or a new CLI action for pythonbuild)
  4. Later at release time (unsure if all the information we need is saved)

It looks like we need information available to the Python script at build time (the versions of dependencies). For us to make the SBOM specifically about the tarball we'd need to at least update it with the tarball hash after it has been created.

I presume this would form an extra build artefact alongside each tarball

@Edward-Knight
Copy link
Contributor

@zanieb I've got some time coming up to work on this, would this be a contribution you would accept or should I implement this outside the project?

@zanieb
Copy link
Member

zanieb commented Jan 30, 2025

I'm interested! Sorry for the lack of reply — got lost / I don't have great answers.

  1. Whichever is fine. I'm happy to clean up some Rust or review Python. I think this depends more on where it fits. Which brings us to (2)

After the PYTHON.json is created seems reasonable. We need the tarball from compress_python_archive though, right? Probably makes less sense to add this during validation. A new action for pythonbuild could be sensible too. Do we need build-time information?

Note we create derived artifacts at release time. This was relevant, e.g., for #343, and presume would be relevant here if you need the archive hash? Are you going to create a SBOM for every release artifact? I worry about doubling the number of release artifacts — I'm not sure when GitHub will start enforcing a limit but we're already at more than 1000.

Also vaguely relevant, we were considering #284 to improve consumption of the download metadata outside of Python.

@Edward-Knight
Copy link
Contributor

Thanks!

I'll keep in mind the "derived artifacts" 👍

I was planning on making one SBOM per release artifact tarball... which would increase the number of artifacts by another 50%.... Would you be open to getting rid of the .sha256 files, or perhaps having one big checksum file? I can't think of a better place to put SBOMs than adding them to the release artifacts

@Edward-Knight
Copy link
Contributor

RE: number of release assets, GitHub do specifically say there is "no limit" to the total size (although don't talk about number of assets directly). Of course I'm sure there is some limit, but we could "call their bluff" so to speak

Storage and bandwidth quotas
Each file included in a release must be under 2 GiB. There is no limit on the total size of a release, nor bandwidth usage.

@zanieb
Copy link
Member

zanieb commented Feb 7, 2025

I guess let's go for it and we can deal with it if there are problems.

We could collapse the checksums into a single file if we have to, yeah. They seems slightly easier to use as separate files though,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants