Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclear code in check around supporting -p and -l flags #119

Open
DamianBarabonkovQC opened this issue Jul 3, 2023 · 24 comments
Open

Unclear code in check around supporting -p and -l flags #119

DamianBarabonkovQC opened this issue Jul 3, 2023 · 24 comments

Comments

@DamianBarabonkovQC
Copy link

Hi there. I recently forked your project. I all looks really good. I am going through your source-code to see if it is something that will be actually useful for my use-case. I found these lines: https://github.com/ubersan/pylic/blob/main/pylic/cli/commands/check.py#L89-L98 which deal with handling the -p and -l arguments. This block is very confusing and not intuitive, with the if not all and then 3 different if clauses with various conditions.

Would you mind explaining the rationale behind this block? It would be greatly appreciated! I can add a comment in the upstream code once you've explained it to me.

@ubersan
Copy link
Owner

ubersan commented Jul 3, 2023

Hi @DamianBarabonkovQC

Just quickly glanced at it and indeed, this code block is very hard to understand! Best to refactor or at least comment after we uncovered its use again😄

I will find some time today to check if/what these lines do and then come back to you.

Are you planning to contribute/extend or do you have different use cases in mind that are currently not supported?

cheers!

@DamianBarabonkovQC
Copy link
Author

Thanks! I am currently modifying the project to pull license info from a conda environment directory rather than pip distributions.

@ubersan
Copy link
Owner

ubersan commented Jul 3, 2023

okay, this is something that exists locally only? dont know muxh about conda, habe not used it much and not for long time. Is thisnsomething that's required often?

@DamianBarabonkovQC
Copy link
Author

There are two main ways to install a Python package, from PyPI and from a conda repository. Every package installed by conda has a metadata file which includes all of the good stuff, including license -- yes, locally.

Conda is relatively popular. My company uses it over venv since you can install non-Python libraries as well. It's a generic package manager with a strong leaning towards Python.

@ubersan
Copy link
Owner

ubersan commented Jul 3, 2023

I see, thanks for the explanation! Would be nice then if we could already here also incorporate the license info in the metadata file. I'd see this as a very valuable extension to the current (very pypi-centric) feature set! Would you consider contributing this?

@DamianBarabonkovQC
Copy link
Author

Possibly, however first we'd need to open-source the code that I am writing, which will take time. Also, for the internal solution, I removed all PyPI support since we do not need it. So contributing would require having a plan on how to support both.

@ubersan
Copy link
Owner

ubersan commented Jul 5, 2023

yes we'll need to see the requirements and how this could be integrated. re the code, sorry it was very busy the last few days, I hope to come back to you today

@DamianBarabonkovQC
Copy link
Author

No rush!

@DamianBarabonkovQC
Copy link
Author

I believe that the line of code in question https://github.com/ubersan/pylic/blob/main/pylic/cli/commands/check.py#L89-L98 should be:

if not any(
    [
        unnecessary_ignore_packages,
        bad_unsafe_packages,
        missing_unsafe_packages,
        unsafe_licenses.found,
    ]
):

Notice the any instead of an all. Essentially, as I try to understand it, this code block should only be entered, where the CLI option flags might override a return 1 with a return 0 if something more serious such as bad_unsafe_packages or missing_unsafe_packages does not exist. If any one of those exist, we should return 1 regardless.

The described logic is very much an any operation instead of an all operation.

@ubersan
Copy link
Owner

ubersan commented Jul 8, 2023

sooo, finally found an hour to dig into this. First of all, the code is hugely complicated and of course there are bugs. I found a few examples where the return code is wrong. Yes you're wrong, this code should basically just deal with the scenario where there were extra/unnecessary packages or licenses listed, which would normally lead to an error (return code 1), but instead we'll allow it in case the corresponding option is set.

The snippet can be rewritten into something like this:

extra_packages_declared_and_allowed = (
    len(unnecessary_unsafe_packages) > 0 or len(unnecessary_ignore_packages) > 0
) and "allow_extra_packages" in options
if extra_packages_declared_and_allowed and not any(
    [
        bad_unsafe_packages,
        missing_unsafe_packages,
        unsafe_licenses.found,
        unnecessary_safe_licenses,
    ]
):
    console_writer.write_all_licenses_ok()
    return 0

extra_licenses_declared_and_allowed = len(unnecessary_safe_licenses) > 0 and "allow_extra_licenses" in options
if extra_licenses_declared_and_allowed and not any(
    [
        unnecessary_ignore_packages,
        bad_unsafe_packages,
        missing_unsafe_packages,
        unsafe_licenses.found,
        unnecessary_unsafe_packages,
    ]
):
    console_writer.write_all_licenses_ok()
    return 0

if (
    extra_licenses_declared_and_allowed
    and extra_packages_declared_and_allowed
    and not any([bad_unsafe_packages, missing_unsafe_packages, unsafe_licenses.found])
):
    console_writer.write_all_licenses_ok()
    return 0

which already is much easier to understand and it's clearer what is happening. Basically we test if the exception case is applicable and no other un-silenced error is present. This cross-checking makes it a bit harder, as we e.g. only check extra packages and then return 0, but it has to be extra packages and no other error. The original version also tried to accomplish this in a "smart" way, but it fully fails, for the case where extra packages are correctly applicable and ignored, but a bad license is detected, yet it still returns 0:

def test_temp(
    check_command: CheckCommand,
    package: str,
    license: str,
    version: str,
    read_config: MagicMock,
    read_all_installed_licenses_metadata: MagicMock,
) -> None:
    unsafe_packages = [f"{package}", "some-other-package-124"]
    ignore_packages = unsafe_packages
    installed_licenses = [
        {"package": f"{package}", "license": license, "version": version},
        {"package": "some-other-package", "license": "THIS-IS-IGNORED", "version": version},
    ]
    safe_licenses = [license]
    read_config.return_value = AppConfig(safe_licenses, unsafe_packages, ignore_packages)
    read_all_installed_licenses_metadata.return_value = installed_licenses
    return_code = check_command.handle(["allow_extra_packages"])
    assert return_code == 1 # This fails, it returns 0

Switching to not any would mean that it's required that all entries have to be false (i.e. empty lists - also this is quite implicit), not sure if that would improve things.

We have to fix this soon and rewrite this so it's much easier to understand.

Not sure if the extended version above fixes everything, at least the failing test now would pass. But we need to spend a bit more time on this and ensure all edge cases are covered.

Hope this helps for now.

cheers!

@DamianBarabonkovQC
Copy link
Author

Thanks so much for looking into this!

So how I understood unsafe_packages is that it is a "blacklist" of sorts. So if a package is present there, even if its license is deemed safe, it will error. The ignore_packages is a "whitelist" of sorts. So if a package is present there, even if its license is deemed unsafe, it will pass.

In your test case, you write:

unsafe_packages = [f"{package}", "some-other-package-124"]
ignore_packages = unsafe_packages

Which does not make sense in my understanding of the code.

The final version I came up with was:

            # If any of these more serious errors are not encountered, the return may
            # still be `0` depending on the command line options passed
            if not any(
                [
                    bad_unsafe_packages,
                    missing_unsafe_packages,
                    unsafe_licenses.found,
                ]
            ):
                # We are allowing a extra packages and there are extra packages
                extra_packages_declared_and_allowed = (
                    len(unnecessary_unsafe_packages) + len(unnecessary_ignore_packages)
                    > 0
                    and "allow_extra_packages" in options
                )
             .......

This makes sense to me. If no packages are unsafe, have unknown licenses or are unsafe_licenses.found, there is a chance that the return could be 0 if the appropriate exception is applied. Otherwise, if any of those cases is True, always return `.

@ubersan
Copy link
Owner

ubersan commented Jul 8, 2023

No problem, thanks for helping to uncover the problems! 😅

So unsafe_packages indeed could mean what you think, but it actually means: (see the readme) "If you rely on a package that does not come with a license you have to explicitly list it as such.". Perhaps a better name might be unlicensed_packages; that would most probably be easier to understand.

If a package with a license is marked as unsafe_package, then this will result in an error (-> will be in the list of bad_unsafe_packages).

The problem with not any If I understand your proposal correctly, is that then all of the list entries have to map to False, ie. be non-empty lists. Example with taking only two elements:

e1 | e2 | any | not any
T  | T  |  T  |   F
F  | T  |  T  |   F
F  | F  |  F  |   T

with not all also the mixed/cross cases are included

e1 | e2 | all | not all
T  | T  |  T  |   F
F  | T  |  F  |   T
F  | F  |  F  |   T

So in the not all case only if there are no problems, the extra checks would be skipped. But then of course all the possible combinations have to be checked (which currently is not the case - thus the bugs).

Does this make sense to you?

@DamianBarabonkovQC
Copy link
Author

DamianBarabonkovQC commented Jul 9, 2023

Thanks for the detailed answer once again.

To be entirely honest, I never understood the README when it said "If you rely on a package that does not come with a license you have to explicitly list it as such." regarding the unsafe packages. Regardless, it appeared to me that when a package was listed there and was installed, it would error if it was encountered. Therefore I treated that as a blacklist of packages. Please elaborate on the case that that is not true/correct?

Otherwise, let's assume my understanding where unsafe -> blacklist, ignore -> whitelist.

We have 3 conditions in the if statement in my code (one less than you because I made unnecessary_ignore_packages silent if allow_extra_packages is true).

bad_unsafe_packages,
missing_unsafe_packages,
unsafe_licenses.found,

My goal, as described in my comment, is that these conditions are considered "serious" errors -- when there is an unsafe package detected, when a package has unknown as a license and when a package exists with an unsafe license. There is no case where if at least one of them is non-empty that the program should return 0. It should always return 1 then.

Now, referring to your truth table for not any.

e1 | e2 | any | not any
T  | T  |  T  |   F
F  | T  |  T  |   F
F  | F  |  F  |   T

I only want to enter the if statement and potentially return 0 if a correct option flag was passed. Well then not any is perfect because it is T only if all of the values are F. As soon as only one of them is T, then it is F and the program will not enter the if statement and error. That is exactly the behavior I described above that I want.

For not all, that simply is not the logic:

e1 | e2 | all | not all
T  | T  |  T  |   F
F  | T  |  F  |   T
F  | F  |  F  |   T

As a matter of fact, when say only one of the errors bad_unsafe_packages, missing_unsafe_packages and unsafe_licenses.found is failing, the if statement will still be entered and potentially return 0. This is NOT what I want. If one of these errors is failing, I want it to always return 1.

Do you follow now the logic for why I believe not any is the correct choice, not not all?

@ubersan
Copy link
Owner

ubersan commented Jul 9, 2023

To be entirely honest, I never understood the README

That's so great to know! We need to improve the API and explanations in the readme and cli helpers!

The reasoning behind the unsafe_packages feature is that if a package is installed that does not provide a license (which can happen on pypi, but I guess is much more common in private/company package indexes), it has to be explicitly marked as such.

@ubersan
Copy link
Owner

ubersan commented Jul 9, 2023

Otherwise, let's assume my understanding where unsafe -> blacklist, ignore -> whitelist.

The setup with the unsafe/safe packages/licenses should be such, that there is no need for an explicit blacklist, because any license and or package (without license) is automatically on the blacklist, as anything which is not on one of the whitelists and installed will lead to an error.

@DamianBarabonkovQC
Copy link
Author

DamianBarabonkovQC commented Jul 9, 2023

that there is no need for an explicit blacklist

What if a Python package has an incorrect license? The PyPI info was not updated when the LICENSE was changed. We'd need to blacklist this specific package, even though the program thinks that its license is "safe".

@DamianBarabonkovQC
Copy link
Author

The reasoning behind the unsafe_packages feature is that if a package is installed that does not provide a license (which can happen on pypi, but I guess is much more common in private/company package indexes), it has to be explicitly marked as such.

How is the behavior then different when it is marked as unsafe vs when it is not. Both times, it errors, no?

@ubersan
Copy link
Owner

ubersan commented Jul 9, 2023

that there is no need for an explicit blacklist

What if a Python package has an incorrect license? The PyPI info was not updated when the LICENSE was changed. We'd need to blacklist this specific package, even though the program thinks that its license is "safe".

Do you mean if the e.g. pypi license classifiers or metadata are different to what LICENSE is actually checked into the repo? So far we trust the license data we get.

@ubersan
Copy link
Owner

ubersan commented Jul 9, 2023

The reasoning behind the unsafe_packages feature is that if a package is installed that does not provide a license (which can happen on pypi, but I guess is much more common in private/company package indexes), it has to be explicitly marked as such.

How is the behavior then different when it is marked as unsafe vs when it is not. Both times, it errors, no?

If you install a-cool-private-repo-package that does not ship with a license then:

  • not marking the package as unsafe: pylic check fails with an error ("Found unsafe packages: ...")
  • marking the package as unsafe: pylic check succeeds without an error or warning. The reasoning is that the developer acknowledges that the package is safe to use - despite not providing a license.

@ubersan
Copy link
Owner

ubersan commented Jul 9, 2023

An example of such an unsafe package is e.g. pyNakadi. It has a checked in MIT license, but does not expose it via classifier or metadata.

pylic check then returns:

Found unsafe packages:
  pyNakadi (0.2.14.8)

@DamianBarabonkovQC
Copy link
Author

I understand finally your whole point regarding unsafe_packages. I don't think that the behavior of what we want is exactly achieved by your definition of unsafe. In my code, I feel like retaining unsafe -> blacklist (we cannot trust the license info in the metadata). ignore -> whitelist.

What are your thoughts on the not any vs not all case in my longer post above? Regardless of how you define unsafe, I think my comments still apply.

@ubersan
Copy link
Owner

ubersan commented Jul 13, 2023

sorry busy days ...

about your longer post above and the not any logic. I think I do understand what you mean. I think we also do talk about the same thing - at least we both know that the current logic is flawed 🙂

My goal, as described in my comment, is that these conditions are considered "serious" errors -- when there is an unsafe package detected, when a package has unknown as a license and when a package exists with an unsafe license. There is no case where if at least one of them is non-empty that the program should return 0. It should always return 1 then.

Yes, but just bare in mind that we're already in the error/return code 1 branch and we only want to exceptionally instead return 0.

My goal, as described in my comment, is that these conditions are considered "serious" errors -- when there is an unsafe package detected, when a package has unknown as a license and when a package exists with an unsafe license. There is no case where if at least one of them is non-empty that the program should return 0. It should always return 1 then.

I agree, but wouldn't this then imply any, i.e. if any(list of all the potential serious problems):

I only want to enter the if statement and potentially return 0 if a correct option flag was passed.

Yes that's what I tried here

I only want to enter the if statement and potentially return 0 if a correct option flag was passed. Well then not any is perfect because it is T only if all of the values are F. As soon as only one of them is T, then it is F and the program will not enter the if statement and error. That is exactly the behavior I described above that I want.

Ah okay I think I now understand what you mean. So you want something along the lines of

if not any([packages_are_problem, licenses_are_a_problem, ...]):
  return 0

? Not sure if this would be understandable. Also I think you'd have to wrap this in if (package_option_given or license_option_given): and also make sure that other problems such as unnecessary_ignore_packages are not still failing the check.

I think the easiest would be to have this (similar to the code linked above):

if any([...]):
  ...

  if package option given and only packages are wrong:
    return 0

 if license option given and only licenses are wrong:
    return 0

 if package and license options are given and only packages and licenses are wrong:
    return 0

  return 1

I think this would cover all the cases and is easier to understand and read compared to these rather unusual boolean transformations.

wdyt?

@DamianBarabonkovQC
Copy link
Author

Hey Sandro. Unrelated to the topic, I am letting you know that I am on holiday for 2 weeks. So I won't be able to run stuff until I get back. Hope that is fine.

@ubersan
Copy link
Owner

ubersan commented Jul 13, 2023

Hey Sandro. Unrelated to the topic, I am letting you know that I am on holiday for 2 weeks. So I won't be able to run stuff until I get back. Hope that is fine.

Thanks for letting me know! Enjoy😊👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants