Etcdctl endpoint health command fails with error "Unable to fetch the alarm list Error: unhealthy cluster" #16398

rahulbapumore · 2023-08-10T08:37:08Z

rahulbapumore
Aug 10, 2023

Hi Amigos,

We have etcd based microservice with 3 replicas in cloud native environment. Because PVC of pod-0 crashed , configurational data + db related to etcd stored on the PVC got lost causing etcd cluster unhealthy and 2 clusters were formed.
So we tried executing below procedure to recover from the cluster.

Recovery Procedure -

Steps -

Go inside dced pod-2 using below command and delete /data/member/wal folder as follows and exit -

# kubectl exec -it dced-2 -c dced -n <NAMESPACE> -- bash
bash-4.4$ rm -rf /data/member/wal
bash-4.4$ exit
2. Scale down dced statefulset to replicas=2 using below command -
# kubectl scale sts dced --replicas=2 -n <NAMESPACE>
statefulset.apps/dced scaled

Go inside dced pod-1 using below command and delete /data/member/wal folder as follows and exit -

# kubectl exec -it dced-1 -c dced -n <NAMESPACE> -- bash
bash-4.4$ rm -rf /data/member/wal
bash-4.4$ exit

Scale down dced statefulset to replicas=1 using below command -
# kubectl scale sts dced --replicas=1 -n <NAMESPACE>
statefulset.apps/dced scaled
Go inside dced pod-0 using below command and delete /data/member/wal folder as follows and exit -

# kubectl exec -it dced-0 -c dced -n <NAMESPACE> -- bash
bash-4.4$ rm -rf /data/member/wal
bash-4.4$ exit
6. Scale down dced statefulset to replicas=0 using below command -

# kubectl scale sts dced --replicas=0 -n <NAMESPACE>
statefulset.apps/dced scaled

Finally scale up dced statefulset to replicas=3 using below command -

# kubectl scale sts dced --replicas=3 -n <NAMESPACE>
statefulset.apps/dced scaled

After applying above procedure, we were able to make pods up and running, also member formed cluster again.
etcdctl member list working fine.
But etcdctl endpoint health command is failing with below error -

{"level":"warn","ts":"2023-08-10T07:56:24.469Z","logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004268c0/eric-data-distributed-coordinator-ed.zmorrah:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
eric-data-distributed-coordinator-ed.zmorrah:2379 is unhealthy: failed to commit proposal: Unable to fetch the alarm list
Error: unhealthy cluster

Can you please help us here in knowing whether above procedure is fine or not?
and why etcdctl endpoint health command is not working even though other commands are working.

Thanks

rahulbapumore · 2023-08-14T06:39:26Z

rahulbapumore
Aug 14, 2023
Author

Hi @ahrtr ,
Also it is introducing data inconsistency issue in the cluster..
Do you know how to prevent it with above scripts?
I did not expect different sizes of database in 3 different members of etcd cluster.
Can you please guide on it?

Thanks

0 replies

rahulbapumore · 2023-08-14T09:45:50Z

rahulbapumore
Aug 14, 2023
Author

And sometimes revisions matches and become equal on all members. But Not able to understand what is trigger point to make revisions equal..And how should we trigger that.
Could you please help with that?

Thanks

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Etcdctl endpoint health command fails with error "Unable to fetch the alarm list Error: unhealthy cluster" #16398

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Etcdctl endpoint health command fails with error "Unable to fetch the alarm list Error: unhealthy cluster" #16398

rahulbapumore Aug 10, 2023

Replies: 2 comments

rahulbapumore Aug 14, 2023 Author

rahulbapumore Aug 14, 2023 Author

rahulbapumore
Aug 10, 2023

rahulbapumore
Aug 14, 2023
Author

rahulbapumore
Aug 14, 2023
Author