Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip dependencies in SOK installation #1138

Merged
merged 6 commits into from
Jun 11, 2023

Conversation

edknv
Copy link
Contributor

@edknv edknv commented Jun 6, 2023

gpu-multi / tensorflow (pull_request) in the CI fails with:

FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.8/dist-packages/horovod/openmpi_dist/bin/mpirun'

in #1110 for example, because SOK requires horovod and reinstalls it, overwriting the horovod installation in the ci-runner. This PR fixes that by adding --no-deps so that SOK does not install dependencies. We run setup.py in development mode because python install setup.py does not support --no-deps.

@edknv edknv added ci examples enhancement New feature or request labels Jun 6, 2023
@github-actions
Copy link

github-actions bot commented Jun 6, 2023

Documentation preview

https://nvidia-merlin.github.io/models/review/pr-1138

@edknv edknv self-assigned this Jun 8, 2023
@@ -11,6 +11,6 @@ rm -rf hugectr/
git clone https://github.com/NVIDIA-Merlin/HugeCTR.git hugectr

cd hugectr/sparse_operation_kit/
python setup.py install
python setup.py develop --no-deps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if horovod is installed already in the container, I wonder why this is trying to install horovod again. does it require a higher version that what we have already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't accurate when I said it's trying to install horovod again. setup.py detects that horovod is already installed from site_packages and skips installing the core horovod library it doesn't re-install horovod per se. But it does install some entrypoint scripts:

Installing horovodrun script to /home/gha-user/gha3/models/models/.tox/py38-multi-gpu/bin
Installing mpiexec script to /home/gha-user/gha3/models/models/.tox/py38-multi-gpu/bin
Installing mpirun script to /home/gha-user/gha3/models/models/.tox/py38-multi-gpu/bin
Installing ompi_info script to /home/gha-user/gha3/models/models/.tox/py38-multi-gpu/bin
Installing orted script to /home/gha-user/gha3/models/models/.tox/py38-multi-gpu/bin
Installing orterun script to /home/gha-user/gha3/models/models/.tox/py38-multi-gpu/bin

These are all installed in the tox environment .tox/py38-multi-gpu/bin, and horovod being from site_packages (i.e., /usr/local/bin) causes the issue. So, this issue is due to tox and mixing external commands inside the tox environment, and this shouldn't happen in normal environments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the explanation @edknv

@edknv edknv marked this pull request as ready for review June 11, 2023 23:14
@edknv edknv merged commit dd16d0a into NVIDIA-Merlin:main Jun 11, 2023
@edknv edknv deleted the ci/multi-gpu-horovod branch June 11, 2023 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci enhancement New feature or request examples
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants