Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Azure] Revert Azure images to address NCCL issues #4596

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

romilbhardwaj
Copy link
Collaborator

Intermediate fix for #4448 by reverting to Azure's default images. We should fix our custom image to support NCCL + Azure accelerated networking before starting to use them again.

Tested:

  • sky launch -c azure --cloud azure --gpus A100-80GB:1 -- nvidia-smi

Comment on lines 42 to 46
# TODO(romilb): Switch back to using our custom images after NCCL + Azure issues
# are resolved: https://github.com/skypilot-org/skypilot/issues/4448
_DEFAULT_CPU_IMAGE_ID = 'skypilot:gpu-ubuntu-2204'
_DEFAULT_GPU_IMAGE_ID = 'skypilot:gpu-ubuntu-2204'
_DEFAULT_V1_IMAGE_ID = 'skypilot:v1-ubuntu-2004'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the old Azure image as the base image for the packer file, so we don't get performance regression for A10 GPU instances?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's likely the fix for #4448. For A10 this PR is still using skypilot:custom-gpu-ubuntu-v2-grid to avoid the perf regression.

@romilbhardwaj
Copy link
Collaborator Author

/smoke-test azure

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with this temp solution, but we should figure out a good way to support it with our images, otherwise, this is a huge performance degradation on Azure.

sky/clouds/azure.py Outdated Show resolved Hide resolved
@romilbhardwaj
Copy link
Collaborator Author

I think a better fix for this is to update the catalog instead of our code. That way we can push an update the catalog when this issue is resolved without needing users to upgrade to the latest version/nightly.

@Michaelvll
Copy link
Collaborator

I think a better fix for this is to update the catalog instead of our code. That way we can push an update the catalog when this issue is resolved without needing users to upgrade to the latest version/nightly.

This may worth discussion. Directly updating the catalog can cause an implicit behavior change for the VM launched on Azure. This can lead to surprising behavior, such as the cluster launched today has different setup than yesterday, e.g., some specific packages disappear, etc.

I feel a more explicit tag change in this PR is a better way to do this.

@romilbhardwaj
Copy link
Collaborator Author

/smoke-test --azure

@zpoint
Copy link
Collaborator

zpoint commented Feb 5, 2025

You need to merge the latest master to make the --azure work. Now it triggers all smoke tests as before. Some smoke test failures have been fixed on master.
cc @romilbhardwaj

@romilbhardwaj
Copy link
Collaborator Author

/smoke-test --azure

@romilbhardwaj romilbhardwaj added the do not merge do not merge this PR now label Feb 5, 2025
@romilbhardwaj
Copy link
Collaborator Author

This PR seems to be having issues with tensorflow + CUDA, because of which test_cancel_azure --azure is failing. It seems to require the [and-cuda] extras for tensorflow for it to detect GPUs:

python3 -m pip install 'tensorflow[and-cuda]'
# Verify the installation:
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Do not merge until this is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do not merge do not merge this PR now
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants