Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Nebius Cloud #4573

Open
wants to merge 43 commits into
base: master
Choose a base branch
from
Open

Conversation

SalikovAlex
Copy link

@SalikovAlex SalikovAlex commented Jan 16, 2025

Add support Nebius Cloud

Tested (run the relevant ones):

  • run 'sky launch -c test-single-instance --cloud nebius echo hi'

  • run sky stop test-single-instance; sky start test-single-instance

  • run sky down test-single-instance

  • run sky launch -c test-single-instance --cloud nebius echo hi; sky stop test-single-instance; sky down test-single-instance

  • [UNSUPPORTED] sky launch --cloud nebius -c test-autostop -i 1 echo hi

  • [UNSUPPORTED] sky launch --cloud fluffycloud -c test-autodown -i 1 --down echo hi

  • sky launch examples/multi_hostname.yaml --cloud nebius;

  • Code formatting: bash format.sh

  • All smoke tests: pytest tests/test_smoke.py

  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SalikovAlex for this amazing work! I think the PR is in a good shape. Left some nits. In the same time, could you help test the basic functionality of this new cloud? include but not limited to:

  • Launch CPU only instance
  • Launch GPU instance
  • Stop & Re-launch, check if the disk is persistent (write some content before stop, and cat them after re-launch)
  • Autostop & Autodown
  • Launch on existing cluster
  • SSH to the cluster
  • Failover: make sure it can failover from lambda to other clouds and the exceptions are printed correctly
  • launch on other clouds without nebius dependencies installed (make sure it does not introduce unnecessary dependencies when vast is not enabled)

examples/minimal.yaml Outdated Show resolved Hide resolved
sky/authentication.py Outdated Show resolved Hide resolved
sky/clouds/nebius.py Outdated Show resolved Hide resolved
sky/clouds/nebius.py Outdated Show resolved Hide resolved
sky/clouds/nebius.py Outdated Show resolved Hide resolved
sky/provision/nebius/utils.py Outdated Show resolved Hide resolved
sky/setup_files/MANIFEST.in Outdated Show resolved Hide resolved
sky/setup_files/dependencies.py Outdated Show resolved Hide resolved
sky/templates/nebius-ray.yml.j2 Outdated Show resolved Hide resolved
sky/templates/nebius-ray.yml.j2 Show resolved Hide resolved
@cblmemo
Copy link
Collaborator

cblmemo commented Jan 22, 2025

Split instance launch configurations to use separate platform and preset values, enabling better resource distinction. Added support for ImageId configuration in templates and clarified GPU cluster fabric requirements in related functions. Enhanced unsupported feature descriptions for Nebius cloud.
Mark Nebius as unsupported for various test cases due to its lack of support for specific GPUs, features, or configurations. This ensures accurate test coverage and avoids unnecessary failures on incompatible platforms.
Replaced the specific Git-based Nebius SDK reference with a version constraint (`nebius>=0.2.0`). This simplifies dependency management and ensures compatibility with future updates of the SDK.
Implemented retry mechanisms for starting and stopping instances in Nebius. Updated file paths for IAM token and tenant ID to include the home directory for better user guidance. Adjusted dependency and testing configurations to disable autostop functionality.
Implemented retry mechanisms for starting and stopping instances in Nebius. Updated file paths for IAM token and tenant ID to include the home directory for better user guidance. Adjusted dependency and testing configurations to disable autostop functionality.
Revised instance naming logic by appending UUIDs to ensure uniqueness and clarity. Adjusted maximum cluster name length limit to accommodate the updated naming convention. Updated smoke tests and added or revised `no_nebius` marks to reflect current Nebius capabilities.
Revised instance naming logic by appending UUIDs to ensure uniqueness and clarity. Adjusted maximum cluster name length limit to accommodate the updated naming convention. Updated smoke tests and added or revised `no_nebius` marks to reflect current Nebius capabilities.
Added a no-op `open_ports` method for Nebius instances, as all ports are open by default. Updated features and tests to reflect the support for open ports while maintaining Nebius's behavior. Removed restrictions on open ports in the supported features list.
@SalikovAlex
Copy link
Author

SalikovAlex commented Jan 26, 2025

Unresolved problems:

  • Nebius require grpcio>=1.56.2 so ray>=2.6.1. So, we need to fix dependencies.py>remote
  • Nebius require python>=3.10

Feature request:
Split autoterminate to autostop and autodown

Enclosed the Python version in quotes to ensure proper YAML parsing. This change prevents potential issues with version interpretation in the CI pipeline.
Added error handling for missing IAM token and tenant ID files in `get_iam_token` and `get_tenant_id` functions, returning `None` if the files are not found. Updated `check_credentials` to provide clearer error messages with setup instructions when credentials are incomplete.
Reorganized initialization steps to include configurable setup commands and handled potential apt lock issues. Disabled unnecessary unattended-upgrades to prevent conflicts and ensured smoother package management during setup.
Reorganized initialization steps to include configurable setup commands and handled potential apt lock issues. Disabled unnecessary unattended-upgrades to prevent conflicts and ensured smoother package management during setup.
Simplify instance filtering logic for Nebius by removing redundant name checks. Add consistent `@pytest.mark.no_nebius` annotations in tests to reflect unsupported Autodown and Autostop features. Clean up unused Nebius-specific test case.
The `no_nebius` marker was added to skip the test where Autodown and Autostop are not supported. This ensures the test suite runs correctly on environments where these features are unavailable.
@SalikovAlex
Copy link
Author

All tests are marked, now pytest tests/test_smoke.py --nebius is green.

@romilbhardwaj
Copy link
Collaborator

Hey @SalikovAlex thanks for the amazing work!

Nebius require grpcio>=1.56.2 so ray>=2.6.1. So, we need to fix dependencies.py>remote

Perhaps we can pin grpcio>=1.56.2 as remote dependency requirement. From my very primitive manual testing, seems like it works without requiring us to bump the remote ray version. cc @Michaelvll @cblmemo wdyt?

Nebius require python>=3.10

This should be ok, since pip on py3.9 environments throws a clear error. Users wanting to use nebius can upgrade to py3.10, which is also supported in SkyPilot.

ERROR: Ignored the following versions that require a different python version: 0.2.0 Requires-Python >=3.10; 0.2.1 Requires-Python >=3.10
ERROR: Could not find a version that satisfies the requirement nebius>=0.2.0; extra == "nebius" (from skypilot[nebius]) (from versions: none)
ERROR: No matching distribution found for nebius>=0.2.0; extra == "nebius"

@romilbhardwaj
Copy link
Collaborator

BTW, we may need to fix the nebius adaptor to lazy import nebius only when required. E.g., on this branch if I install pip install -e .[aws] without nebius python package in my environment, I get this error:

>>> import sky
Traceback (most recent call last):
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/adaptors/common.py", line 37, in load_module
    self._module = importlib.import_module(self._module_name)
  File "/Users/romilb/tools/anaconda3/envs/py39/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 984, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'nebius'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/__init__.py", line 82, in <module>
    from sky import backends
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/backends/__init__.py", line 4, in <module>
    from sky.backends.cloud_vm_ray_backend import CloudVmRayBackend
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/backends/cloud_vm_ray_backend.py", line 30, in <module>
    from sky import check as sky_check
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/check.py", line 11, in <module>
    from sky import clouds as sky_clouds
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/clouds/__init__.py", line 15, in <module>
    from sky.clouds.aws import AWS
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/clouds/aws.py", line 16, in <module>
    from sky import provision as provision_lib
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/__init__.py", line 23, in <module>
    from sky.provision import nebius
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/nebius/__init__.py", line 4, in <module>
    from sky.provision.nebius.instance import cleanup_ports
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/nebius/instance.py", line 9, in <module>
    from sky.provision.nebius import utils
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/nebius/utils.py", line 12, in <module>
    sdk = nebius.sdk(credentials=nebius.get_iam_token())
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/adaptors/nebius.py", line 73, in sdk
    return nebius.sdk.SDK(credentials=credentials)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/adaptors/common.py", line 52, in __getattr__
    return getattr(self.load_module(), name)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/adaptors/common.py", line 42, in load_module
    raise ImportError(self._import_error_message) from e
ImportError: Failed to import dependencies for Nebius AI Cloud. Try running: pip install "skypilot[nebius]"

Centralize Nebius SDK initialization by removing global SDK variables and replacing them with direct calls to `nebius.sdk()`. Introduce caching for IAM tokens to reduce redundant file reads. Also, update GRPC logging settings to suppress unnecessary console output.
@SalikovAlex
Copy link
Author

SalikovAlex commented Jan 30, 2025

BTW, we may need to fix the nebius adaptor to lazy import nebius only when required. E.g., on this branch if I install pip install -e .[aws]

Fixed

Simplified grpcio dependency to version >=1.56.2 and removed redundant version constraints and outdated comments. Also removed unnecessary ray dependency under 'nebius', ensuring a cleaner dependency structure.
@SalikovAlex
Copy link
Author

This should be ok, since pip on py3.9 environments throws a clear error. Users wanting to use nebius can upgrade to py3.10, which is also supported in SkyPilot.

Yes, it's ok. But we need to change the version in GitHub actions for pep, lint and etc

This fixes a syntax issue in the `remote` list by adding a missing comma after `grpcio>=1.56.2`. This ensures the dependencies are parsed correctly.
@cblmemo
Copy link
Collaborator

cblmemo commented Jan 30, 2025

Hey @SalikovAlex thanks for the amazing work!

Nebius require grpcio>=1.56.2 so ray>=2.6.1. So, we need to fix dependencies.py>remote

Perhaps we can pin grpcio>=1.56.2 as remote dependency requirement. From my very primitive manual testing, seems like it works without requiring us to bump the remote ray version. cc @Michaelvll @cblmemo wdyt?

Nebius require python>=3.10

This should be ok, since pip on py3.9 environments throws a clear error. Users wanting to use nebius can upgrade to py3.10, which is also supported in SkyPilot.

ERROR: Ignored the following versions that require a different python version: 0.2.0 Requires-Python >=3.10; 0.2.1 Requires-Python >=3.10
ERROR: Could not find a version that satisfies the requirement nebius>=0.2.0; extra == "nebius" (from skypilot[nebius]) (from versions: none)
ERROR: No matching distribution found for nebius>=0.2.0; extra == "nebius"

This looks good to me!

@cblmemo
Copy link
Collaborator

cblmemo commented Jan 30, 2025

This should be ok, since pip on py3.9 environments throws a clear error. Users wanting to use nebius can upgrade to py3.10, which is also supported in SkyPilot.

Yes, it's ok. But we need to change the version in GitHub actions for pep, lint and etc

I think in the GH actions it will not install real cloud dependencies. Every test is running without actually provision VMs on the cloud, so IIUC no nebius dependency needs to be installed?

Those test that will provision VM and run workloads is located in the smoke_test folder. In GH actions we only run the unittests.

Adjust GitHub Actions workflows to use Python 3.8 instead of 3.10 for consistency. Simplify formatting in `dependencies.py` by condensing the 'nebius' dependency list.
@SalikovAlex
Copy link
Author

SalikovAlex commented Jan 30, 2025

I think in the GH actions it will not install real cloud dependencies. Every test is running without actually provision VMs on the cloud, so IIUC no nebius dependency needs to be installed?

Those test that will provision VM and run workloads is located in the smoke_test folder. In GH actions we only run the unittests.

pylint, pytest, doc build run uv pip install ".[all]"

@cblmemo
Copy link
Collaborator

cblmemo commented Jan 30, 2025

I think in the GH actions it will not install real cloud dependencies. Every test is running without actually provision VMs on the cloud, so IIUC no nebius dependency needs to be installed?
Those test that will provision VM and run workloads is located in the smoke_test folder. In GH actions we only run the unittests.

pylint, pytest, doc build run uv pip install ".[all]"

Humm, good point! cc @romilbhardwaj @Michaelvll for a look here

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the prompt fix @SalikovAlex! Mostly looks good to me. Left some discussions ;)

sky/adaptors/nebius.py Outdated Show resolved Hide resolved
sky/adaptors/nebius.py Outdated Show resolved Hide resolved
Comment on lines +41 to +45
# Nebius maximum instance name length defined as <= 63 as a hostname length
# 63 - 8 - 5 = 50 characters since
# we add 4 character from UUID to make uniq `-xxxx`
# our provisioner adds additional `-worker`.
_MAX_CLUSTER_NAME_LEN_LIMIT = 50
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the character uuid should be included in this limit. check aws & gcp implementation for details

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's another UUID
#4573 (comment)

sky/clouds/nebius.py Outdated Show resolved Hide resolved
sky/clouds/nebius.py Outdated Show resolved Hide resolved
sky/provision/nebius/instance.py Outdated Show resolved Hide resolved
sky/provision/nebius/instance.py Outdated Show resolved Hide resolved
sky/provision/nebius/instance.py Outdated Show resolved Hide resolved
sky/provision/nebius/instance.py Show resolved Hide resolved
sky/provision/nebius/instance.py Outdated Show resolved Hide resolved
.github/workflows/pylint.yml Outdated Show resolved Hide resolved
sky/setup_files/dependencies.py Outdated Show resolved Hide resolved
# Original issue: https://github.com/ray-project/ray/issues/33833
'grpcio >= 1.32.0, <= 1.51.3, != 1.48.0; python_version < \'3.10\' and sys_platform != \'darwin\'', # noqa:E501 pylint: disable=line-too-long
'grpcio >= 1.42.0, <= 1.51.3, != 1.48.0; python_version >= \'3.10\' and sys_platform != \'darwin\'', # noqa:E501 pylint: disable=line-too-long
'grpcio>=1.56.2',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cblmemo Could you help run smoke tests for this change for on other clouds as well? Also would be great if you can verify

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced duplicated logic with assertions, simplified filename management, and unified token/tenant ID handling. Fixed typos (e.g., "creat" to "create"), improved logging, and added Python version checks for dependencies. These changes enhance readability, maintainability, and robustness.
Simplified assertions and cleaned up parameter handling in provisioning logic. Improved variable naming and formatting in dependencies setup for clarity. Removed extra whitespace and adjusted line splitting for better readability.
    Updated the loop to properly iterate over dictionary items when processing stopped instances. This ensures correct functionality when starting instances and aligns with standard dictionary iteration practices.
Updated the Python version in `pytest`, `pylint`, and `test-doc-build` workflows from 3.10 to 3.8. This ensures consistency across workflows and accommodates compatibility with Python 3.8.
Switched the Python version from 3.8 to 3.10 in the GitHub Actions workflow for building documentation. This ensures compatibility with newer Python features and aligns with the project's current supported versions.
Add checks to ensure cluster states align with instance statuses during the wait operation, raising errors on mismatches. Enhance comments to explain how the region is inferred from project IDs due to the lack of direct region information retrieval in Nebius.
@SalikovAlex
Copy link
Author

SalikovAlex commented Jan 31, 2025

Also, why is there project metadata id limitation? Is this project a per-user thing? IIUC if it is the case, different user would have different id?

Added this to the comments.

    # To find a project in a specific region, we rely on the project ID to
    # deduce the region, since there is currently no method to retrieve region
    # information directly from the project. Additionally, there is only one
    # project per region, and projects cannot be created at this time.
    # The region is determined from the project ID using a region-specific
    # identifier embedded in it.
    # https://docs.nebius.com/overview/regions

@cblmemo
Copy link
Collaborator

cblmemo commented Jan 31, 2025

https://docs.nebius.com/overview/regions

Just to confirm, so if a new user is using nebius for skypilot, does that user need to replace the project id here?

@SalikovAlex
Copy link
Author

Just to confirm, so if a new user is using nebius for skypilot, does that user need to replace the project id here?

No. I'm requesting them by API. The user's project id looks like "project-e00xxxxxxxxxxxxxx", where "e00" is - the code of the region.

Where do you find hardcoded project id?

@cblmemo
Copy link
Collaborator

cblmemo commented Feb 1, 2025

Just to confirm, so if a new user is using nebius for skypilot, does that user need to replace the project id here?

No. I'm requesting them by API. The user's project id looks like "project-e00xxxxxxxxxxxxxx", where "e00" is - the code of the region.

Where do you find hardcoded project id?

Oh ic! that makes sense. I thought the e00 is some unique value related to your project id.

If that is the case, can we add a comment for what this 8-11 for (iiuc it is the length of project-?) and what is the e00 & e01 for? to reduce confusion :)) Thanks!

image

Clarified how project IDs correlate to regions and added a note to address improvements once region data is included in the project list. Also provided context on instance deletion flow, explaining the need to wait for disk unlock before retries.
Raise a RuntimeError if the total number of running and stopped instances exceeds the user's requested count. This helps prevent resource leaks and provides clear guidance to use "sky down" to clean up the cluster.
@SalikovAlex
Copy link
Author

If that is the case, can we add a comment for what this 8-11 for (iiuc it is the length of project-?) and what is the e00 & e01 for? to reduce confusion :)) Thanks!

added

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants