Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pip] Don't record all dependencies as binary packages in the SBOM when allow_binary: true is used #813

Open
eskultety opened this issue Jan 31, 2025 · 13 comments
Labels
bug Something isn't working pip Pull requests/issues related to our pip handling module

Comments

@eskultety
Copy link
Member

Whenever a user provides the allow_binary input JSON option, all dependencies are then recorded in the SBOM as binary packages even though many/most of the dependencies are actually distributed as tarballs as well. In fact, for dependencies that distribute both wheels and sdists we download both, but prefer the former over the latter when it comes to SBOM information recording. That may cause unnecessary friction for users with a ton of Python dependencies where they only need a handful of wheels (AI/ML related stuff), leading to potential problems with further policy checks (i.e. usage of binary packages vs building from sources) based on the SBOM we provided.

When allow_binary is provided, our default behaviour should be to only record packages as binary ones when the resource isn't provided as an sdist (we should have that information at hand already). By doing this we can also clean up the prefetch cache significantly if we don't propagate a ton of wheels for packages that distribute as sdists as well.

Conversely, if a consumer wishes to build out of wheels only and they are okay with the way how these are recorded in our SBOM then we should introduce a new option enforce_binary that would skip ignore all sdists unless a wheel isn't available in which case sdist should be fetched.

@eskultety eskultety added bug Something isn't working pip Pull requests/issues related to our pip handling module labels Jan 31, 2025
@chmeliik
Copy link
Contributor

chmeliik commented Feb 3, 2025

Unless explicitly instructed otherwise, pip will prefer wheels over sdists during installation. If we have downloaded both the wheel(s) and a sdist for a package, reporting it as binary will be the right thing to do in most cases.

users with a ton of Python dependencies where they only need a handful of wheels

What would make more sense to me would be to let allow_binary be a list of package names to allow, e.g. {"type": "pip", "allow_binary": ["cryptography"]}. Which should make cachi2 download wheels for and only for cryptography.

@eskultety
Copy link
Member Author

If we have downloaded both the wheel(s) and a sdist for a package, reporting it as binary will be the right thing to do in most cases.

Except we wouldn't download both as mentioned in the description (admittedly, not the best wording):

By doing this we can also clean up the prefetch cache significantly if we don't propagate a ton of wheels for packages that distribute as sdists as well.

...

What would make more sense to me would be to let allow_binary be a list of package names to allow, e.g. {"type": "pip", "allow_binary": ["cryptography"]}. Which should make cachi2 download wheels for and only for cryptography.

This isn't backwards compatible though, because the field is a boolean. That's why I proposed to change the default internal implementation such that we would only download wheels for packages that don't distribute otherwise, hence recording the majority of packages as sdists which is a completely transparent change to the end user and still backwards compatible with our current JSON interface.

@slimreaper35
Copy link
Member

Is this issue related to #730 ?

@eskultety
Copy link
Member Author

Is this issue related to #730 ?

No, not really, unlike #730 this issue isn't trying to report more data in the SBOM, quite the contrary, it's trying to reduce the dependency payload that is propagated further to the build and change how that is recorded in the SBOM along the way.

@chmeliik
Copy link
Contributor

chmeliik commented Feb 3, 2025

This isn't backwards compatible though, because the field is a boolean.

Only if you drop support for booleans. We can support both. Code-wise, it could be a pydantic hook in the input model that converts False -> [] and True -> [":all:"] (taking inspiration from pip's use of :all: and :none: for similar CLI options: https://pip.pypa.io/en/stable/cli/pip_install/#cmdoption-no-binary)

change the default internal implementation such that we would only download wheels for packages that don't distribute otherwise

This sounds like a bigger breaking change. Builds that previously worked with "allow_binary": true would stop working, because we would stop providing most of the wheels. Which would almost certainly make the user's requirements-build.txt file incomplete, or introduce the need for this file where previously it wouldn't be needed. Not to mention other challenges with compiling C-based dependencies (or other languages) that don't apply if you're using wheels.

@eskultety
Copy link
Member Author

eskultety commented Feb 4, 2025

Fair enough, that's a good argument.

Only if you drop support for booleans. We can support both. Code-wise, it could be a pydantic hook in the input model that converts False -> [] and True -> [":all:"]

I'd refrain from abusing pydantic validators to change field types on the go, but yeah, the idea behind the proposal is valid.

@a-ovchinnikov
Copy link
Contributor

This could be addressed by adding a new argument, something like allowed_binaries which would contain a list of allowed binaries. This argument should be considered only if allow_binary were set to True and it must default to an equivalent of :all:. This argument must be completely ignored if allow_binary is set to False. This way we would preserve current behavior and would not affect any of existing users of the feature while also providing a mechanism for those who wish to be more precise.

@chmeliik
Copy link
Contributor

This could be addressed by adding a new argument, something like allowed_binaries which would contain a list of allowed binaries. This argument should be considered only if allow_binary were set to True and it must default to an equivalent of :all:. This argument must be completely ignored if allow_binary is set to False. This way we would preserve current behavior and would not affect any of existing users of the feature while also providing a mechanism for those who wish to be more precise.

Separating the options would allow for nonsensical inputs:

  • allow_binary=false, allowed_binaries=[<non-empty>]
  • allow_binary=true, allowed_binaries=[]

What's worse, the first of these nonsensical inputs would be easy to encounter. As a user, the obvious way to tell hermeto to allow wheels for cryptography would be {"type": "pip", "allowed_binaries": ["cryptography"]}. I wouldn't expect to have to set a second option to make this work.

@eskultety
Copy link
Member Author

This could be addressed by adding a new argument, something like allowed_binaries which would contain a list of allowed binaries. This argument should be considered only if allow_binary were set to True and it must default to an equivalent of :all:. This argument must be completely ignored if allow_binary is set to False. This way we would preserve current behavior and would not affect any of existing users of the feature while also providing a mechanism for those who wish to be more precise.

Separating the options would allow for nonsensical inputs:

* `allow_binary=false, allowed_binaries=[<non-empty>]`

* `allow_binary=true, allowed_binaries=[]`

What's worse, the first of these nonsensical inputs would be easy to encounter. As a user, the obvious way to tell hermeto to allow wheels for cryptography would be {"type": "pip", "allowed_binaries": ["cryptography"]}. I wouldn't expect to have to set a second option to make this work.

When it comes to input options this is IMO a matter of clear documentation, it is common to have supplementary CLI options, so if this is properly documented how these options work together, I have no problem with this approach because it's cleaner and less work, less maintenance burden than the counterproposal IMO.

@chmeliik
Copy link
Contributor

When it comes to input options this is IMO a matter of clear documentation, it is common to have supplementary CLI options, so if this is properly documented how these options work together, I have no problem with this approach because it's cleaner and less work, less maintenance burden than the counterproposal IMO.

Yeah, with good docs it should be fine

@a-ovchinnikov
Copy link
Contributor

  • allow_binary=false, allowed_binaries=[<non-empty>]

Is addressed by the following rule:

This argument must be completely ignored if allow_binary is set to False

As for the second example it is a degenerate case, but exactly what a user wants (apparently) -- someone went out of their way to add an argument and then block everything with it.

@chmeliik
Copy link
Contributor

Is addressed by the following rule:

This argument must be completely ignored if allow_binary is set to False

Yeah, I wasn't worried about what the implementation would do but about the user experience. Maybe input validation could help here, to give users earlier feedback that they forgot to set the second option

@a-ovchinnikov
Copy link
Contributor

the user experience

Yes, these situations do warrant a warning as possible anomalies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pip Pull requests/issues related to our pip handling module
Development

No branches or pull requests

4 participants