Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster fails to start, when entire cluster have been shutdown, and first "seed" node is corrupted #750

Open
qvistgaard opened this issue Feb 4, 2025 · 0 comments · May be fixed by #754
Assignees
Labels
bug Something isn't working

Comments

@qvistgaard
Copy link

qvistgaard commented Feb 4, 2025

What happened?

During an AKS node upgrade, the node running the -0 pod decided not to wait for the pod to stop, leaving it corrupted. In this case the other 2 nodes in our cluster got restarted, and was unable to come back up, because they where waiting for -0 node to start

What did you expect to happen?

Cluster should start anyway. I'm aware that it can be hard to detect if it is because of corruption, but one solution could be to either have a timeout, or a max number of retries, before trying to use a different node as the initial seed

How can we reproduce it (as minimally and precisely as possible)?

Stop all node, corrupt data on -0, for example a commitlog file, and the restart all nodes

cass-operator version

1.22.4

Kubernetes version

1.30.7

Method of installation

Helm / FluxCD

Anything else we need to know?

Original Discord discussion:

https://discord.com/channels/836217371453685760/836217371453685763/1336286221495959552

Workaround

  1. Manually start node using the API: curl -XPOST http://localhost:8080/api/v0/lifecycle/start
  2. Patch pod with the following labels:
cassandra.datastax.com/seed-node=true
cassandra.datastax.com/node-state=Started

┆Issue is synchronized with this Jira Story by Unito
┆Issue Number: CASS-88

@qvistgaard qvistgaard added the bug Something isn't working label Feb 4, 2025
@burmanm burmanm linked a pull request Feb 10, 2025 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants