Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate scale-up failure on GCP to learn from it #2185

Closed
3 tasks done
consideRatio opened this issue Feb 9, 2023 · 1 comment
Closed
3 tasks done

Investigate scale-up failure on GCP to learn from it #2185

consideRatio opened this issue Feb 9, 2023 · 1 comment
Assignees

Comments

@consideRatio
Copy link
Contributor

consideRatio commented Feb 9, 2023

The following was observed in the main 2i2c GCP cluster (project: two-eye-two-see, cluster: pilot-hubs-cluster).

Server requested
2023-02-08T18:24:48Z [Normal] pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had taint {k8s.dask.org_dedicated: worker}, that the pod didn't tolerate, 1 in backoff after failed scale-up
2023-02-08T18:27:02Z [Warning] 0/3 nodes are available: 1 Insufficient memory, 2 node(s) didn't match Pod's node affinity/selector.
Spawn failed: pod aup/jupyter-shaolintl did not start in 600 seconds!

Why was a node in a 1 in backoff after failed scale-up state? The time was 2023-02-08T18:24:48Z.

Action points

Overall investigate and learn.

  • Verify if this was really intermittent by using 2i2c and scaling up, using aup.pilot.2i2c.cloud to start pods where this ocurred.
    This was intermittent. I've triggered a new scale up by starting a few servers, and the scale up operation went fast and smooth.
     Server requested
     2023-02-09T08:51:30Z [Warning] 0/4 nodes are available: 2 Insufficient memory, 2 node(s) didn't match Pod's node affinity/selector.
     2023-02-09T08:51:37Z [Normal] pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/two-eye-two-see/zones/us-central1-b/instanceGroups/gke-pilot-hubs-cluster-nb-user-05192705-grp 2->3 (max: 20)}]
     2023-02-09T08:52:22Z [Normal] Successfully assigned aup/jupyter-consideratio to gke-pilot-hubs-cluster-nb-user-05192705-z2l8
     2023-02-09T08:52:23Z [Warning] Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "03d377c0372fa1036b3ff5d4a20cdf9fcacbc2f0c488199574a1b53d272edb84": stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
     2023-02-09T08:52:26Z [Normal] Cancelling deletion of Pod aup/jupyter-consideratio
     2023-02-09T08:52:34Z [Normal] Pulling image "busybox"
    
  • If this doesn't reproduce for a fresh error to understand, consider if we can get logs from the cluster-autoscaler on GCP retroactively or similar to learn more
  • Get node pools upgraded to the control plane's version
    I've upgraded the dask-worker node and core node to be k8s 1.22 and match the control plane version. I've not upgraded the user node version from 1.20 to 1.22, but I think that require us to declare a maintenance window ahead of time as it currently has ~5-10 active users or so, and they would be fully kicked out of their servers etc.

Future

Work related to this gets to be tracked in #2157

@consideRatio
Copy link
Contributor Author

consideRatio commented Feb 9, 2023

It seems as the k8s cluster has been ugpraded but the node pools hasn't been. Let's settle for that as an issue. They should at most be two minor versions out of sync, and now we have a core pool three minor versions out of sync which breaks k8s assumptions.

image

image

Related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

No branches or pull requests

1 participant