-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Nebius Cloud #4573
base: master
Are you sure you want to change the base?
Add Nebius Cloud #4573
Conversation
Add Nebius Cloud
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @SalikovAlex for this amazing work! I think the PR is in a good shape. Left some nits. In the same time, could you help test the basic functionality of this new cloud? include but not limited to:
- Launch CPU only instance
- Launch GPU instance
- Stop & Re-launch, check if the disk is persistent (write some content before stop, and cat them after re-launch)
- Autostop & Autodown
- Launch on existing cluster
- SSH to the cluster
- Failover: make sure it can failover from lambda to other clouds and the exceptions are printed correctly
- launch on other clouds without nebius dependencies installed (make sure it does not introduce unnecessary dependencies when vast is not enabled)
Split instance launch configurations to use separate platform and preset values, enabling better resource distinction. Added support for ImageId configuration in templates and clarified GPU cluster fabric requirements in related functions. Enhanced unsupported feature descriptions for Nebius cloud.
Mark Nebius as unsupported for various test cases due to its lack of support for specific GPUs, features, or configurations. This ensures accurate test coverage and avoids unnecessary failures on incompatible platforms.
Replaced the specific Git-based Nebius SDK reference with a version constraint (`nebius>=0.2.0`). This simplifies dependency management and ensures compatibility with future updates of the SDK.
Implemented retry mechanisms for starting and stopping instances in Nebius. Updated file paths for IAM token and tenant ID to include the home directory for better user guidance. Adjusted dependency and testing configurations to disable autostop functionality.
Implemented retry mechanisms for starting and stopping instances in Nebius. Updated file paths for IAM token and tenant ID to include the home directory for better user guidance. Adjusted dependency and testing configurations to disable autostop functionality.
Revised instance naming logic by appending UUIDs to ensure uniqueness and clarity. Adjusted maximum cluster name length limit to accommodate the updated naming convention. Updated smoke tests and added or revised `no_nebius` marks to reflect current Nebius capabilities.
Revised instance naming logic by appending UUIDs to ensure uniqueness and clarity. Adjusted maximum cluster name length limit to accommodate the updated naming convention. Updated smoke tests and added or revised `no_nebius` marks to reflect current Nebius capabilities.
Added a no-op `open_ports` method for Nebius instances, as all ports are open by default. Updated features and tests to reflect the support for open ports while maintaining Nebius's behavior. Removed restrictions on open ports in the supported features list.
Unresolved problems:
Feature request: |
Enclosed the Python version in quotes to ensure proper YAML parsing. This change prevents potential issues with version interpretation in the CI pipeline.
Added error handling for missing IAM token and tenant ID files in `get_iam_token` and `get_tenant_id` functions, returning `None` if the files are not found. Updated `check_credentials` to provide clearer error messages with setup instructions when credentials are incomplete.
Reorganized initialization steps to include configurable setup commands and handled potential apt lock issues. Disabled unnecessary unattended-upgrades to prevent conflicts and ensured smoother package management during setup.
Reorganized initialization steps to include configurable setup commands and handled potential apt lock issues. Disabled unnecessary unattended-upgrades to prevent conflicts and ensured smoother package management during setup.
Simplify instance filtering logic for Nebius by removing redundant name checks. Add consistent `@pytest.mark.no_nebius` annotations in tests to reflect unsupported Autodown and Autostop features. Clean up unused Nebius-specific test case.
The `no_nebius` marker was added to skip the test where Autodown and Autostop are not supported. This ensures the test suite runs correctly on environments where these features are unavailable.
All tests are marked, now |
Hey @SalikovAlex thanks for the amazing work!
Perhaps we can pin
This should be ok, since pip on py3.9 environments throws a clear error. Users wanting to use nebius can upgrade to py3.10, which is also supported in SkyPilot.
|
BTW, we may need to fix the nebius adaptor to lazy import nebius only when required. E.g., on this branch if I install
|
Centralize Nebius SDK initialization by removing global SDK variables and replacing them with direct calls to `nebius.sdk()`. Introduce caching for IAM tokens to reduce redundant file reads. Also, update GRPC logging settings to suppress unnecessary console output.
Fixed |
Simplified grpcio dependency to version >=1.56.2 and removed redundant version constraints and outdated comments. Also removed unnecessary ray dependency under 'nebius', ensuring a cleaner dependency structure.
Yes, it's ok. But we need to change the version in GitHub actions for pep, lint and etc |
This fixes a syntax issue in the `remote` list by adding a missing comma after `grpcio>=1.56.2`. This ensures the dependencies are parsed correctly.
This looks good to me! |
I think in the GH actions it will not install real cloud dependencies. Every test is running without actually provision VMs on the cloud, so IIUC no nebius dependency needs to be installed? Those test that will provision VM and run workloads is located in the |
Adjust GitHub Actions workflows to use Python 3.8 instead of 3.10 for consistency. Simplify formatting in `dependencies.py` by condensing the 'nebius' dependency list.
pylint, pytest, doc build run |
Humm, good point! cc @romilbhardwaj @Michaelvll for a look here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the prompt fix @SalikovAlex! Mostly looks good to me. Left some discussions ;)
# Nebius maximum instance name length defined as <= 63 as a hostname length | ||
# 63 - 8 - 5 = 50 characters since | ||
# we add 4 character from UUID to make uniq `-xxxx` | ||
# our provisioner adds additional `-worker`. | ||
_MAX_CLUSTER_NAME_LEN_LIMIT = 50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the character uuid should be included in this limit. check aws & gcp implementation for details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's another UUID
#4573 (comment)
# Original issue: https://github.com/ray-project/ray/issues/33833 | ||
'grpcio >= 1.32.0, <= 1.51.3, != 1.48.0; python_version < \'3.10\' and sys_platform != \'darwin\'', # noqa:E501 pylint: disable=line-too-long | ||
'grpcio >= 1.42.0, <= 1.51.3, != 1.48.0; python_version >= \'3.10\' and sys_platform != \'darwin\'', # noqa:E501 pylint: disable=line-too-long | ||
'grpcio>=1.56.2', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cblmemo Could you help run smoke tests for this change for on other clouds as well? Also would be great if you can verify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced duplicated logic with assertions, simplified filename management, and unified token/tenant ID handling. Fixed typos (e.g., "creat" to "create"), improved logging, and added Python version checks for dependencies. These changes enhance readability, maintainability, and robustness.
Simplified assertions and cleaned up parameter handling in provisioning logic. Improved variable naming and formatting in dependencies setup for clarity. Removed extra whitespace and adjusted line splitting for better readability.
Updated the loop to properly iterate over dictionary items when processing stopped instances. This ensures correct functionality when starting instances and aligns with standard dictionary iteration practices.
Updated the Python version in `pytest`, `pylint`, and `test-doc-build` workflows from 3.10 to 3.8. This ensures consistency across workflows and accommodates compatibility with Python 3.8.
Switched the Python version from 3.8 to 3.10 in the GitHub Actions workflow for building documentation. This ensures compatibility with newer Python features and aligns with the project's current supported versions.
Add checks to ensure cluster states align with instance statuses during the wait operation, raising errors on mismatches. Enhance comments to explain how the region is inferred from project IDs due to the lack of direct region information retrieval in Nebius.
Added this to the comments.
|
Just to confirm, so if a new user is using nebius for skypilot, does that user need to replace the project id here? |
No. I'm requesting them by API. The user's project id looks like "project-e00xxxxxxxxxxxxxx", where "e00" is - the code of the region. Where do you find hardcoded project id? |
Clarified how project IDs correlate to regions and added a note to address improvements once region data is included in the project list. Also provided context on instance deletion flow, explaining the need to wait for disk unlock before retries.
Raise a RuntimeError if the total number of running and stopped instances exceeds the user's requested count. This helps prevent resource leaks and provides clear guidance to use "sky down" to clean up the cluster.
added |
Add support Nebius Cloud
Tested (run the relevant ones):
run 'sky launch -c test-single-instance --cloud nebius echo hi'
run
sky stop test-single-instance; sky start test-single-instance
run
sky down test-single-instance
run
sky launch -c test-single-instance --cloud nebius echo hi; sky stop test-single-instance; sky down test-single-instance
[UNSUPPORTED]
sky launch --cloud nebius -c test-autostop -i 1 echo hi
[UNSUPPORTED]
sky launch --cloud fluffycloud -c test-autodown -i 1 --down echo hi
sky launch examples/multi_hostname.yaml --cloud nebius;
Code formatting:
bash format.sh
All smoke tests:
pytest tests/test_smoke.py
Backward compatibility tests:
conda deactivate; bash -i tests/backward_compatibility_tests.sh