You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The following was observed in the main 2i2c GCP cluster (project: two-eye-two-see, cluster: pilot-hubs-cluster).
Server requested
2023-02-08T18:24:48Z [Normal] pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had taint {k8s.dask.org_dedicated: worker}, that the pod didn't tolerate, 1 in backoff after failed scale-up
2023-02-08T18:27:02Z [Warning] 0/3 nodes are available: 1 Insufficient memory, 2 node(s) didn't match Pod's node affinity/selector.
Spawn failed: pod aup/jupyter-shaolintl did not start in 600 seconds!
Why was a node in a 1 in backoff after failed scale-up state? The time was 2023-02-08T18:24:48Z.
Action points
Overall investigate and learn.
Verify if this was really intermittent by using 2i2c and scaling up, using aup.pilot.2i2c.cloud to start pods where this ocurred.
This was intermittent. I've triggered a new scale up by starting a few servers, and the scale up operation went fast and smooth.
Server requested
2023-02-09T08:51:30Z [Warning] 0/4 nodes are available: 2 Insufficient memory, 2 node(s) didn't match Pod's node affinity/selector.
2023-02-09T08:51:37Z [Normal] pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/two-eye-two-see/zones/us-central1-b/instanceGroups/gke-pilot-hubs-cluster-nb-user-05192705-grp 2->3 (max: 20)}]
2023-02-09T08:52:22Z [Normal] Successfully assigned aup/jupyter-consideratio to gke-pilot-hubs-cluster-nb-user-05192705-z2l8
2023-02-09T08:52:23Z [Warning] Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "03d377c0372fa1036b3ff5d4a20cdf9fcacbc2f0c488199574a1b53d272edb84": stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
2023-02-09T08:52:26Z [Normal] Cancelling deletion of Pod aup/jupyter-consideratio
2023-02-09T08:52:34Z [Normal] Pulling image "busybox"
If this doesn't reproduce for a fresh error to understand, consider if we can get logs from the cluster-autoscaler on GCP retroactively or similar to learn more
Get node pools upgraded to the control plane's version
I've upgraded the dask-worker node and core node to be k8s 1.22 and match the control plane version. I've not upgraded the user node version from 1.20 to 1.22, but I think that require us to declare a maintenance window ahead of time as it currently has ~5-10 active users or so, and they would be fully kicked out of their servers etc.
It seems as the k8s cluster has been ugpraded but the node pools hasn't been. Let's settle for that as an issue. They should at most be two minor versions out of sync, and now we have a core pool three minor versions out of sync which breaks k8s assumptions.
The following was observed in the main
2i2c
GCP cluster (project: two-eye-two-see, cluster: pilot-hubs-cluster).Why was a node in a
1 in backoff after failed scale-up
state? The time was2023-02-08T18:24:48Z
.Action points
Overall investigate and learn.
aup.pilot.2i2c.cloud
to start pods where this ocurred.This was intermittent. I've triggered a new scale up by starting a few servers, and the scale up operation went fast and smooth.
I've upgraded the dask-worker node and core node to be k8s 1.22 and match the control plane version. I've not upgraded the user node version from 1.20 to 1.22, but I think that require us to declare a maintenance window ahead of time as it currently has ~5-10 active users or so, and they would be fully kicked out of their servers etc.
Future
Work related to this gets to be tracked in #2157
The text was updated successfully, but these errors were encountered: