From d7a38f5469ead9cc01b69d1a1973d884894202f4 Mon Sep 17 00:00:00 2001 From: fabriziopandini Date: Wed, 15 May 2024 17:34:39 +0200 Subject: [PATCH] Document KCP limitation --- .../src/tasks/automated-machine-management/healthchecking.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/book/src/tasks/automated-machine-management/healthchecking.md b/docs/book/src/tasks/automated-machine-management/healthchecking.md index 117b3bdb0744..5b2afa8763e0 100644 --- a/docs/book/src/tasks/automated-machine-management/healthchecking.md +++ b/docs/book/src/tasks/automated-machine-management/healthchecking.md @@ -235,6 +235,9 @@ Before deploying a MachineHealthCheck, please familiarise yourself with the foll - If the Node for a Machine is removed from the cluster, a MachineHealthCheck will consider this Machine unhealthy and remediate it immediately - If no Node joins the cluster for a Machine after the `NodeStartupTimeout`, the Machine will be remediated - If a Machine fails for any reason (if the FailureReason is set), the Machine will be remediated immediately +- Important: if the kubelet on the node hosting the etcd leader member is not working, this prevents KCP from doing some checks it is expecting to do on the leader - and specifically on the leader -. + This prevents remediation to happen. There are ongoing discussions about how to overcome this limitation in https://github.com/kubernetes-sigs/cluster-api/issues/8465; as of today users facing this situation + are recommended to manually forward leadership to another etcd member and manually delete the corresponding machine. [management cluster]: ../../reference/glossary.md#management-cluster