Etcdctl endpoint health command fails with error "Unable to fetch the alarm list Error: unhealthy cluster" #16398
Unanswered
rahulbapumore
asked this question in
Q&A
Replies: 2 comments
-
Hi @ahrtr , Thanks |
Beta Was this translation helpful? Give feedback.
0 replies
-
And sometimes revisions matches and become equal on all members. But Not able to understand what is trigger point to make revisions equal..And how should we trigger that. Thanks |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi Amigos,
We have etcd based microservice with 3 replicas in cloud native environment. Because PVC of pod-0 crashed , configurational data + db related to etcd stored on the PVC got lost causing etcd cluster unhealthy and 2 clusters were formed.
So we tried executing below procedure to recover from the cluster.
Recovery Procedure -
Steps -
# kubectl exec -it dced-2 -c dced -n <NAMESPACE> -- bash
bash-4.4$ rm -rf /data/member/wal
bash-4.4$ exit
2. Scale down dced statefulset to replicas=2 using below command -
# kubectl scale sts dced --replicas=2 -n <NAMESPACE>
statefulset.apps/dced scaled
# kubectl exec -it dced-1 -c dced -n <NAMESPACE> -- bash
bash-4.4$ rm -rf /data/member/wal
bash-4.4$ exit
Scale down dced statefulset to replicas=1 using below command -
# kubectl scale sts dced --replicas=1 -n <NAMESPACE>
statefulset.apps/dced scaled
Go inside dced pod-0 using below command and delete /data/member/wal folder as follows and exit -
# kubectl exec -it dced-0 -c dced -n <NAMESPACE> -- bash
bash-4.4$ rm -rf /data/member/wal
bash-4.4$ exit
6. Scale down dced statefulset to replicas=0 using below command -
# kubectl scale sts dced --replicas=0 -n <NAMESPACE>
statefulset.apps/dced scaled
# kubectl scale sts dced --replicas=3 -n <NAMESPACE>
statefulset.apps/dced scaled
After applying above procedure, we were able to make pods up and running, also member formed cluster again.
etcdctl member list working fine.
But etcdctl endpoint health command is failing with below error -
{"level":"warn","ts":"2023-08-10T07:56:24.469Z","logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004268c0/eric-data-distributed-coordinator-ed.zmorrah:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
eric-data-distributed-coordinator-ed.zmorrah:2379 is unhealthy: failed to commit proposal: Unable to fetch the alarm list
Error: unhealthy cluster
Can you please help us here in knowing whether above procedure is fine or not?
and why etcdctl endpoint health command is not working even though other commands are working.
Thanks
Beta Was this translation helpful? Give feedback.
All reactions