Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FluidStack] Add NVLINK GPUs #3954

Merged
merged 2 commits into from
Mar 3, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions sky/clouds/service_catalog/data_fetchers/fetch_fluidstack.py
Original file line number Diff line number Diff line change
Expand Up @@ -168,8 +168,8 @@

GPU_MAP = {
'H100_PCIE_80GB': 'H100',
'H100_NVLINK_80GB': 'H100',
'A100_NVLINK_80GB': 'A100-80GB',
'H100_NVLINK_80GB': 'H100-NVLINK',
'A100_NVLINK_80GB': 'A100-80GB-NVLINK',
'A100_SXM4_80GB': 'A100-80GB-SXM',
'H100_SXM5_80GB': 'H100-SXM',
'A100_PCIE_80GB': 'A100-80GB',
Expand Down Expand Up @@ -206,6 +206,8 @@ def create_catalog(output_dir: str) -> None:
with open(DEFAULT_FLUIDSTACK_API_KEY_PATH, 'r', encoding='UTF-8') as f:
api_key = f.read().strip()
response = requests.get(ENDPOINT, headers={'api-key': api_key})
if not response.ok:
raise Exception(response.text)
plans = response.json()

with open(os.path.join(output_dir, 'vms.csv'), mode='w',
Expand Down
2 changes: 1 addition & 1 deletion sky/provision/fluidstack/instance.py
Original file line number Diff line number Diff line change
Expand Up @@ -298,7 +298,7 @@ def query_instances(
'pending': status_lib.ClusterStatus.INIT,
'stopped': status_lib.ClusterStatus.STOPPED,
'running': status_lib.ClusterStatus.UP,
'unhealthy': status_lib.ClusterStatus.INIT,
'failed': status_lib.ClusterStatus.INIT,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why changing this? Should we add a new entry instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"unhealthy" not used anymore, refactored to "failed".

Copy link
Contributor Author

@mjibril mjibril Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @cblmemo !

For H100-NVLink not showing in the list of GPUs, recall that the catalog is fetched from the Skypilot catalog repository, which itself is generated from code currently in the main branch of Skypilot. The code in the main branch does not contain the new mapping, as such the GPU will not show.

To view this new GPU locally, we need to fetch the catalog from FluidStack using the code from the forked repo.

python3 sky/clouds/service_catalog/data_fetchers/fetch_fluidstack.py
cp fluidstack/vms.csv ~/.sky/catalogs/v5/fluidstack/vms.csv 
sky show-gpus --cloud fluidstack -a

We also need to add the FluidStack API key ~/.fluidstack/api_key obtainable from the dashboard prior to fetching the catalog from the FluidStack API.

'terminated': None,
}
statuses: Dict[str, Optional[status_lib.ClusterStatus]] = {}
Expand Down