Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Unexpected error when relaunching an INIT cluster on k8s which failed due to capacity error #4625

Open
Michaelvll opened this issue Jan 30, 2025 · 0 comments

Comments

@Michaelvll
Copy link
Collaborator

To reproduce:

  1. Launch a managed job with the controller on k8s with the following ~/.sky/config.yaml
jobs:
  controller:
    resources:
      cpus: 2
      cloud: kubernetes
$ sky jobs launch test.yaml --cloud aws --cpus 2 -n test-mount-bucket
Task from YAML spec: test.yaml
Managed job 'test-mount-bucket' will be launched on (estimated):
Considered resources (1 node):
----------------------------------------------------------------------------------------
 CLOUD   INSTANCE    vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
----------------------------------------------------------------------------------------
 AWS     m6i.large   2       8         -              us-east-1     0.10          ✔     
----------------------------------------------------------------------------------------
Launching a managed job 'test-mount-bucket'. Proceed? [Y/n]: 
⚙︎ Translating workdir and file_mounts with local source paths to SkyPilot Storage...
  Workdir: 'examples' -> storage: 'skypilot-filemounts-vscode-904d206c'.
  Folder : 'examples' -> storage: 'skypilot-filemounts-vscode-904d206c'.
  Created S3 bucket 'skypilot-filemounts-vscode-904d206c' in us-east-1
  Excluded files to sync to cluster based on .gitignore.
✓ Storage synced: examples -> s3://skypilot-filemounts-vscode-904d206c/  View logs at: ~/sky_logs/sky-2025-01-30-23-19-02-003572/storage_sync.log
  Excluded files to sync to cluster based on .gitignore.
✓ Storage synced: examples -> s3://skypilot-filemounts-vscode-904d206c/  View logs at: ~/sky_logs/sky-2025-01-30-23-19-09-895566/storage_sync.log
✓ Uploaded local files/folders.
Launching managed job 'test-mount-bucket' from jobs controller...
Warning: Credentials used for [GCP, AWS] may expire. Clusters may be leaked if the credentials expire while jobs are running. It is recommended to use credentials that never expire or a service account.
⚙︎ Launching managed jobs controller on Kubernetes.
W 01-30 23:19:33 instance.py:863] run_instances: Error occurred when creating pods: sky.provision.kubernetes.config.KubernetesError: Insufficient memory capacity on the cluster. Required resources (cpu=4, memory=34359738368) were not found in a single node. Other SkyPilot tasks or pods may be using resources. Check resource usage by running `kubectl describe nodes`. Full error: 0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

sky.provision.kubernetes.config.KubernetesError: Insufficient memory capacity on the cluster. Required resources (cpu=4, memory=34359738368) were not found in a single node. Other SkyPilot tasks or pods may be using resources. Check resource usage by running `kubectl describe nodes`.
Full error: 0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

During handling of the above exception, another exception occurred:

NotImplementedError

The above exception was the direct cause of the following exception:

sky.provision.common.StopFailoverError: During provisioner's failover, stopping 'sky-jobs-controller-11d9a692' failed. We cannot stop the resources launched, as it is not supported by Kubernetes. Please try launching the cluster again, or terminate it with: sky down sky-jobs-controller-11d9a692
  1. Launch again:
$ sky jobs launch test.yaml --cloud aws --cpus 2 -n test-mount-bucket
Task from YAML spec: test.yaml
Managed job 'test-mount-bucket' will be launched on (estimated):
Considered resources (1 node):
----------------------------------------------------------------------------------------
 CLOUD   INSTANCE    vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
----------------------------------------------------------------------------------------
 AWS     m6i.large   2       8         -              us-east-1     0.10          ✔     
----------------------------------------------------------------------------------------
Launching a managed job 'test-mount-bucket'. Proceed? [Y/n]: 
⚙︎ Translating workdir and file_mounts with local source paths to SkyPilot Storage...
  Workdir: 'examples' -> storage: 'skypilot-filemounts-vscode-b7ba6a41'.
  Folder : 'examples' -> storage: 'skypilot-filemounts-vscode-b7ba6a41'.
  Created S3 bucket 'skypilot-filemounts-vscode-b7ba6a41' in us-east-1
  Excluded files to sync to cluster based on .gitignore.
✓ Storage synced: examples -> s3://skypilot-filemounts-vscode-b7ba6a41/  View logs at: ~/sky_logs/sky-2025-01-30-23-20-51-067815/storage_sync.log
  Excluded files to sync to cluster based on .gitignore.
✓ Storage synced: examples -> s3://skypilot-filemounts-vscode-b7ba6a41/  View logs at: ~/sky_logs/sky-2025-01-30-23-20-58-164407/storage_sync.log
✓ Uploaded local files/folders.
Launching managed job 'test-mount-bucket' from jobs controller...
Warning: Credentials used for [AWS, GCP] may expire. Clusters may be leaked if the credentials expire while jobs are running. It is recommended to use credentials that never expire or a service account.
Cluster 'sky-jobs-controller-11d9a692' (status: INIT) was previously in Kubernetes (gke_sky-dev-465_us-central1-c_skypilotalpha). Restarting.
⚙︎ Launching managed jobs controller on Kubernetes.
⨯ Failed to set up SkyPilot runtime on cluster.  View logs at: ~/sky_logs/sky-2025-01-30-23-21-05-243052/provision.log

AssertionError: cpu_request should not be None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant