You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During an AKS node upgrade, the node running the -0 pod decided not to wait for the pod to stop, leaving it corrupted. In this case the other 2 nodes in our cluster got restarted, and was unable to come back up, because they where waiting for -0 node to start
What did you expect to happen?
Cluster should start anyway. I'm aware that it can be hard to detect if it is because of corruption, but one solution could be to either have a timeout, or a max number of retries, before trying to use a different node as the initial seed
How can we reproduce it (as minimally and precisely as possible)?
Stop all node, corrupt data on -0, for example a commitlog file, and the restart all nodes
What happened?
During an AKS node upgrade, the node running the
-0
pod decided not to wait for the pod to stop, leaving it corrupted. In this case the other 2 nodes in our cluster got restarted, and was unable to come back up, because they where waiting for-0
node to startWhat did you expect to happen?
Cluster should start anyway. I'm aware that it can be hard to detect if it is because of corruption, but one solution could be to either have a timeout, or a max number of retries, before trying to use a different node as the initial seed
How can we reproduce it (as minimally and precisely as possible)?
Stop all node, corrupt data on
-0
, for example a commitlog file, and the restart all nodescass-operator version
1.22.4
Kubernetes version
1.30.7
Method of installation
Helm / FluxCD
Anything else we need to know?
Original Discord discussion:
https://discord.com/channels/836217371453685760/836217371453685763/1336286221495959552
Workaround
curl -XPOST http://localhost:8080/api/v0/lifecycle/start
┆Issue is synchronized with this Jira Story by Unito
┆Issue Number: CASS-88
The text was updated successfully, but these errors were encountered: