Cluster fails to start, when entire cluster have been shutdown, and first "seed" node is corrupted #750

qvistgaard · 2025-02-04T11:07:33Z

What happened?

During an AKS node upgrade, the node running the -0 pod decided not to wait for the pod to stop, leaving it corrupted. In this case the other 2 nodes in our cluster got restarted, and was unable to come back up, because they where waiting for -0 node to start

What did you expect to happen?

Cluster should start anyway. I'm aware that it can be hard to detect if it is because of corruption, but one solution could be to either have a timeout, or a max number of retries, before trying to use a different node as the initial seed

How can we reproduce it (as minimally and precisely as possible)?

Stop all node, corrupt data on -0, for example a commitlog file, and the restart all nodes

cass-operator version

1.22.4

Kubernetes version

1.30.7

Method of installation

Helm / FluxCD

Anything else we need to know?

Original Discord discussion:

https://discord.com/channels/836217371453685760/836217371453685763/1336286221495959552

Workaround

Manually start node using the API: curl -XPOST http://localhost:8080/api/v0/lifecycle/start
Patch pod with the following labels:

cassandra.datastax.com/seed-node=true
cassandra.datastax.com/node-state=Started

┆Issue is synchronized with this Jira Story by Unito
┆Issue Number: CASS-88

The text was updated successfully, but these errors were encountered:

qvistgaard added the bug Something isn't working label Feb 4, 2025

sync-by-unito bot assigned burmanm Feb 10, 2025

burmanm linked a pull request Feb 10, 2025 that will close this issue

Start sequence #754

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster fails to start, when entire cluster have been shutdown, and first "seed" node is corrupted #750

Cluster fails to start, when entire cluster have been shutdown, and first "seed" node is corrupted #750

qvistgaard commented Feb 4, 2025 •

edited

Loading

Cluster fails to start, when entire cluster have been shutdown, and first "seed" node is corrupted #750

Cluster fails to start, when entire cluster have been shutdown, and first "seed" node is corrupted #750

Comments

qvistgaard commented Feb 4, 2025 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

cass-operator version

Kubernetes version

Method of installation

Anything else we need to know?

Workaround

qvistgaard commented Feb 4, 2025 •

edited

Loading