Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solver should report exact package hash that was used to install a package #5102

Open
3 tasks
fridex opened this issue Jan 17, 2022 · 15 comments
Open
3 tasks
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/stack-guidance Categorizes an issue or PR as relevant to SIG Stack Guidance. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@fridex
Copy link
Contributor

fridex commented Jan 17, 2022

Is your feature request related to a problem? Please describe.

Currently, Thoth provides all the artifact hashes in the lockfile that were found on the index and it lets the pip installation procedure pick the suitable artifact. Instead, Thoth should point to an exact Python artifact that should be used during the installation process to make sure proper auditing is done.

Describe the solution you'd like

  • adjust solver logic to report artifact hash that was used during the installation and metadata extraction
  • the hash should be synced into Thoth's knowledge base specifically for the given OS and Python version
  • adviser should query hash when constructing the lock file
@fridex fridex added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 17, 2022
@Gregory-Pereira
Copy link
Member

/assign @Gregory-Pereira

@goern
Copy link
Member

goern commented Jan 19, 2022

/priority important-soon

@sesheta sesheta added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jan 19, 2022
@Gregory-Pereira
Copy link
Member

Gregory-Pereira commented Jan 19, 2022

I am not too familiar with solver, so pardon my questions as I get up to speed. When you say Thoth should point to an exact Python artifact that should be used during the installation process, do you mean that a hash should be computed for that result of solver linking the OS, python version, and resulting dependency versions? Or would this python artifact refer to every individual dependency?

@fridex
Copy link
Contributor Author

fridex commented Jan 19, 2022

Check TensorFlow wheels published on PyPI as an example - https://pypi.org/project/tensorflow/2.7.0/#files

There are macos, windows and manylinux builds specific for some Python version (ex. Python 3.7, 3.8, 3.9). As of now we point users to tensorflow==2.7.0 from PyPI and provide all the artifact hashes (so that pip picks the right build on client side). In an ideal scenario, Thoth should give back just one hash pointing to specific artifact that should be used to install tensorflow==2.7.0. That can be, for example, tensorflow-2.7.0-cp39-cp39-manylinux2010_x86_64.whl if users run linux and use Python 3.9 (and x86_64 arch).

@Gregory-Pereira
Copy link
Member

Gregory-Pereira commented Jan 19, 2022

So we would save a bunch of these hashes that correspond to the a specific version of a package, the OS and python version on the Thoth server / API side? I guess whats confusing me about this is how does that fit in with the rest of resolver? For instance my understanding is that solver allows you to pass in some package version and constraints. ex:

cycler>=0.10.
kiwisolver==1.2.
matplotlib<=3.2.1
numpy==1.18.5

(note this example is completely made up I don't know if there is a solution for these dependencies / versions).

It will then recursively resolve all dependencies and transitive dependencies that would work for these rules. So I understand how that would work if we are looking a specified package version, but what about cycler, and matplotlib in this example? Would it then just recursively try all the versions of those two that meet the requirement, and then for each of those first look for this version specific hash that we are discussing?

@fridex
Copy link
Contributor Author

fridex commented Jan 20, 2022

So we would save a bunch of these hashes that correspond to the a specific version of a package, the OS and python version on the Thoth server / API side?

We already have them on Thoth server side (Thoth is a cloud/server side resolver). The think is that we miss the OS+python version linkage.

I guess whats confusing me about this is how does that fit in with the rest of resolver? For instance my understanding is that solver allows you to pass in some package version and constraints. ex:

cycler>=0.10.
kiwisolver==1.2.
matplotlib<=3.2.1
numpy==1.18.5

(note this example is completely made up I don't know if there is a solution for these dependencies / versions).

It will then recursively resolve all dependencies and transitive dependencies that would work for these rules. So I understand how that would work if we are looking a specified package version, but what about cycler, and matplotlib in this example? Would it then just recursively try all the versions of those two that meet the requirement, and then for each of those first look for this version specific hash that we are discussing?

The resolver is using tenporal difference learning (so no "recursive tries" per say). We use this "solver" component to aggregate information about packages for the resolver itself - so solver will just get corresponding hashes more accuratelly that are subsequently used by the server-side resolver.

@Gregory-Pereira
Copy link
Member

So for each dependency as its getting installed I am able to grab its SHA256. However the way pip does its hashes is per file in said package. Now not all files in a package may have a SHA. I stuck with selinon as one of my examples, when in pipenv shell I ran:
./thoth-solver --verbose python -r 'selinon==1.0.0' -o solver-output-selinon-1.0.0-darwin.json. While still in the shell I navigated to where the site packages were installed: ~/.local/share/virtualenvs/solver-CEbDbFsW/lib/python3.8/site-packages/.

Located in this folder there were two folders related to selinon, selinon/ and the distribution information selinon-1.0.0.dist-info/. The first of these of course held all the files that made up the package and the other held all the details related to its distribution. The RECORD file here held showed the files installed in the package and their SHA values if they had any, ex:

...
selinon/caches/__pycache__/lifo.cpython-38.pyc,,
selinon/caches/__pycache__/lru.cpython-38.pyc,,
selinon/caches/__pycache__/mru.cpython-38.pyc,,
selinon/caches/__pycache__/rr.cpython-38.pyc,,
selinon/caches/fifo.py,sha256=bZxu6sh_EelPUqSp6clYbTPjSrcc4Ok52AFHHc90aAA,2679
selinon/caches/lifo.py,sha256=-Db8LACEHNtO2-magPmdxWLDgNoy2_NXxAOWppRJecE,694
selinon/caches/lru.py,sha256=q6o2uyvMZoz8I0y1z0L3saHxzRKQDGbawnRPw8B3c5g,4338
selinon/caches/mru.py,sha256=6Y7MO1KX8rysPuFnsm8lbyxEvVQo3PlFxznHqrUVByQ,655
selinon/caches/rr.py,sha256=_trRSY5aHAE5AynRJERUAGTVBXoynzTAi8LuK8QyPRU,2458
selinon/celery.py,sha256=pkl7o7g-GyLMVvSN4OG7GDONM3RbpjT5-vaqf6GpnP0,1212
selinon/cli.py,sha256=8Ll6xGKaejY-YNe2Q2dt11vA6KDIxDt8xGwlpACLJBE,17332
selinon/codename.py,sha256=j6348ZPG6ml31-4v15Fj6jXneSjgneSSvkN3T_TQccs,38
selinon/config.py,sha256=EAEFVe0Is1iZZ7jFh2BEUhEttTF4E69WBYzTA73Plho,14363
selinon/data_storage.py,sha256=pFdIPwl5EJiVcFX1xo3XODmKNsSehMrz1d2mbHTYdzw,3147
selinon/dispatcher.py,sha256=StskSf2EbUq75Mtu-qJgvgOi6qZplrcuiXEnWwV7x2Q,9739
selinon/edge.py,sha256=bGZRrKdC1PKIsQ4LAYyuIuK6rlAeR0ticioZB-PBi_M,10162
selinon/errors.py,sha256=znvu-WKToPWa0UZbYHS_l3s3Xyxdd9qBbLS68iMm24c,5295
selinon/executor/__init__.py,sha256=4--nBjb69cDYdX9xvN9_maTXPq13Ki1zg1zOq-nmyS0,95
...

I thought there would be a way to grab a single hash for a package, but Im not sure I am looking in the right place, maybe this would be located somewhere on PYPI, but I haven't found it yet. Maybe I will need to save all the individual hashes or import some other library or package to use such as pip-compile but wanted to ask here first if there is a better strategy.

I plan to use this to build this out on the result object:

"thoth-wheels": [
      {
        "pyperclip-1.8.2-darwin-21.2.0-x86_64": "105254a8b04934f0bc84e9c24eb360a591aaf6535c9def5f29d92af107a9bf57"
      }
]

It will have this format as an object with a key of <package_name>--<system_platform>-<system_release>-<platform_architecture>, and value of the package SHA

Let me know if I am missing or misunderstanding anything.

@fridex
Copy link
Contributor Author

fridex commented Jan 25, 2022

Nice research.

Sadly, these hashes will not be part of the artifacts as the artifact hash is computed based on the artifact content, which makes it a chicken-egg problem.

As of now, we obtain all the artifact hashes in this function:

def _fill_hashes(source, package_name, package_version, extracted_metadata):
# type: (Source, str, str, Dict[str, Any]) -> None
extracted_metadata["sha256"] = []
try:
package_hashes = source.get_package_hashes(package_name, package_version)
except NotFoundError:
# Some older packages have different version on PyPI (considering simple API) than the ones
# stated in metadata.
package_hashes = source.get_package_hashes(package_name, extracted_metadata["version"])
for item in package_hashes:
extracted_metadata["sha256"].append(item["sha256"])

Ideally, thoth-solver could perform pip install for each artifact with hash:

python3 -m pip install --no-deps --no-cache-dir pyperclip==1.8.2 --hash=sha256:XYZ

here:

cmd = "{} -m pip install --force-reinstall --no-cache-dir --no-deps {}".format(python_bin, quote(package))

A brute-force approach would try all the artifacts and pip should report that the artifact is not suitable for the runtime environment. That fact can become part of the report. If the artifact is installable, thoth-solver can report its dependencies.

@Gregory-Pereira
Copy link
Member

So quick update. Firstly the only way I could successfully use the hashes when installing a pip package was to stuff it into some requirements file (I am using temp-requirements.txt) and provide the hash there, and then call it from the thoth-solver with the --require-hashes flag. It currently works, but this may create a little bit more overhead.

Second, of the list of SHA package hashes, sometimes multiple can actually work, like if the PYPI package provides a .whl and source .gz distribution. Because both can be successfully installed with pip, both could be considered "the correct version", and so I think we should pass both to the resolver. However, if desired, we could use some other criteria to ascertain the better package if we only want one solution stored on the thoth-server side (per package/package-version/system/system-distribution) such as package size, version number, etc. Currently however, this is how the Thoth-wheels are looking:

"thoth-wheels": {
      "pyroaring-0.3.3-darwin-21.2.0-x86_64": [
        {
          "pyroaring-0.3.3-cp39-cp39-macosx_10_14_x86_64.whl": "399730714584ec47b05978cc00b737478a10e2a6a8fed94d886fd0b25c522b05"
        },
        {
          "pyroaring-0.3.3.tar.gz": "232bf4cbdd7a1dad885171d9d7e59da5324b3d70c15a96a240f1319b870b46b7"
        }
      ]
    }

Is this acceptable? or should I try to resolve it as only one package, and if so what criteria should be used?

As for context to the next two points these are the packages and respective versions I have been using to test the thoth-solver:

selinon == 1.0.0
pyperclip == 1.8.2
pyroaring == 0.3.3
pytorch == 1.0.2
tensorflow == 2.7.0

With brute force solution to testing which SHAs work with which environment, I am running into issues for bigger packages (selinon and tesorflow). I ran my local feature branch version of the thoth-solver today in the background for selinon 1.0.0 (with transitive dependencies) and it didn't finish after about 40 minutes. I am going to be see what I can do in the way of efficiency today.

Also when testing I encountered a potential issue. This was specifically for the Pytorch package, so I am not certain this will apply to other packages. Its install instructions listed on PYPI are pip install pytorch, or pip install pytorch==1.0.2 for the specific package version, however it fails install, because, as it says on the PYPI package page, You tried to install “pytorch” The package named for PyTorch is “torch”. Is this normal for PYPI packages to be renamed in a manner such as this, and if so is this something we should support in Solver, or is this an edge case?

@KPostOffice
Copy link
Member

Are the hashes available in the package index warehouse useful for this problem at all? See: https://pypi.org/pypi/tensorflow/json.

@goern
Copy link
Member

goern commented Feb 15, 2022

moin all, any progress on this? is #5110 (comment) the blocker? @fridex could you work on it?

@Gregory-Pereira
Copy link
Member

So I am not sure if this is a valid solution to address Frido's comment, but I was thinking about adding the -vv flag to the pip install command and parsing the hash directly out of the resulting stdout (see code). Would love to get other's opinions on this, as it doesn't seem very robust, but this would be the solution with the lowest overhead, as the resulting wheel information comes directly from the one and only install command. Was this what Frido was talking about when he meant it would live in the part of the code "that does the actual dependency extraction"?

I also looked a bit into what Kevin was saying as well about the pypi package index warehouse. I am not certain this would be useful to us because we already already store the hash of every artifact per the release we are using, however we are attempting to ascertain which artifact is the best per given package, package version, and environment information (os, distro, etc.) and persist it on the Thoth side. We could take a pretty decent guess from the warehouse json for instance that for release 2.0.0, artifact index 6 with the filename of "tensorflow-2.0.0-cp36-cp…anylinux2010_x86_64.whl" would work for any linux distro with x86_64 architecture and python3.6, however we really should test that it installs properly and not just take an educated guess based on filename. Since that is the case this functionality really should come from the install command rather than an endpoint.

@goern
Copy link
Member

goern commented Mar 2, 2022

@fridex is this something to move forward?

/sig stack-guidance

@sesheta sesheta added the sig/stack-guidance Categorizes an issue or PR as relevant to SIG Stack Guidance. label Mar 2, 2022
@Gregory-Pereira
Copy link
Member

I was told that Thoth-Station is making a priority of stabilizing the system before introducing new changes, and so this might hang for a bit.
/lifecycle frozen

@sesheta sesheta added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Mar 15, 2022
@codificat
Copy link
Member

Based on the history so far, my understanding is that this is
/triage accepted
but
/priority important-longterm
/remove-priority important-soon

@sesheta sesheta added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 29, 2022
@codificat codificat moved this to 📋 Backlog in Planning Board Sep 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/stack-guidance Categorizes an issue or PR as relevant to SIG Stack Guidance. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: 📋 Backlog
Development

No branches or pull requests

6 participants