Investigate scale-up failure on GCP to learn from it #2185

consideRatio · 2023-02-09T06:51:36Z

The following was observed in the main 2i2c GCP cluster (project: two-eye-two-see, cluster: pilot-hubs-cluster).

Server requested
2023-02-08T18:24:48Z [Normal] pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had taint {k8s.dask.org_dedicated: worker}, that the pod didn't tolerate, 1 in backoff after failed scale-up
2023-02-08T18:27:02Z [Warning] 0/3 nodes are available: 1 Insufficient memory, 2 node(s) didn't match Pod's node affinity/selector.
Spawn failed: pod aup/jupyter-shaolintl did not start in 600 seconds!

Why was a node in a 1 in backoff after failed scale-up state? The time was 2023-02-08T18:24:48Z.

Action points

Overall investigate and learn.

Verify if this was really intermittent by using 2i2c and scaling up, using aup.pilot.2i2c.cloud to start pods where this ocurred.
This was intermittent. I've triggered a new scale up by starting a few servers, and the scale up operation went fast and smooth.

 Server requested
 2023-02-09T08:51:30Z [Warning] 0/4 nodes are available: 2 Insufficient memory, 2 node(s) didn't match Pod's node affinity/selector.
 2023-02-09T08:51:37Z [Normal] pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/two-eye-two-see/zones/us-central1-b/instanceGroups/gke-pilot-hubs-cluster-nb-user-05192705-grp 2->3 (max: 20)}]
 2023-02-09T08:52:22Z [Normal] Successfully assigned aup/jupyter-consideratio to gke-pilot-hubs-cluster-nb-user-05192705-z2l8
 2023-02-09T08:52:23Z [Warning] Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "03d377c0372fa1036b3ff5d4a20cdf9fcacbc2f0c488199574a1b53d272edb84": stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
 2023-02-09T08:52:26Z [Normal] Cancelling deletion of Pod aup/jupyter-consideratio
 2023-02-09T08:52:34Z [Normal] Pulling image "busybox"

If this doesn't reproduce for a fresh error to understand, consider if we can get logs from the cluster-autoscaler on GCP retroactively or similar to learn more
Get node pools upgraded to the control plane's version
I've upgraded the dask-worker node and core node to be k8s 1.22 and match the control plane version. I've not upgraded the user node version from 1.20 to 1.22, but I think that require us to declare a maintenance window ahead of time as it currently has ~5-10 active users or so, and they would be fully kicked out of their servers etc.

Future

Work related to this gets to be tracked in #2157

The text was updated successfully, but these errors were encountered:

consideRatio · 2023-02-09T08:58:40Z

It seems as the k8s cluster has been ugpraded but the node pools hasn't been. Let's settle for that as an issue. They should at most be two minor versions out of sync, and now we have a core pool three minor versions out of sync which breaks k8s assumptions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate scale-up failure on GCP to learn from it #2185

Investigate scale-up failure on GCP to learn from it #2185

consideRatio commented Feb 9, 2023 •

edited

Loading

consideRatio commented Feb 9, 2023 •

edited

Loading

Investigate scale-up failure on GCP to learn from it #2185

Investigate scale-up failure on GCP to learn from it #2185

Comments

consideRatio commented Feb 9, 2023 • edited Loading

Action points

Future

consideRatio commented Feb 9, 2023 • edited Loading

Related

consideRatio commented Feb 9, 2023 •

edited

Loading

consideRatio commented Feb 9, 2023 •

edited

Loading