Fleet Agent Supports Node Failover [SURE-9419] #3096

manno · 2024-11-25T14:19:18Z

Fleet-agent is deployed as statefulset with a single replica. In this case there is NOT automatic fail over in case the node hosting that pod fails. (this is per design in statefulsets). An administrator has to delete the pod manually to get fleet back running. This is inacceptable because it is reactive and not automatic.

We have to deploy stateful-sets with a replica >1 to have fault tolerance or we have to use deployments.
A deployment with replica count 1 can take a long time to migrate to another node. We should make the replica count configurable.

Business impact: High as it causes downtime

Repro steps:

Deploy fleet-agent
Poweroff the node hosting the fleet agent

Acceptance Criteria

Fleet controllers replica count and fleet agent replica count are configurable via the helm chart. We default to one.
Fleet agent init container and containers use leader election
Optional: merge clusterstatus ticker into controller container, migrate it to c-r. That way we have one leader election loop less. Also, clusterstatus is tiny nowadays.
Fleet agent is converted to a deployment
It might be necessary to adapt https://github.com/rancher/backup-restore-operator/blob/release/v5.0/charts/rancher-backup/files/default-resourceset-contents/fleet.yaml#L44-L55

manno added this to Fleet Nov 25, 2024

manno converted this from a draft issue Nov 25, 2024

manno added this to the v2.11.0 milestone Nov 25, 2024

manno added kind/enhancement JIRA Must shout labels Nov 25, 2024

manno moved this from To Triage to 📋 Backlog in Fleet Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fleet Agent Supports Node Failover [SURE-9419] #3096

Fleet Agent Supports Node Failover [SURE-9419] #3096

manno commented Nov 25, 2024 •

edited

Loading

Fleet Agent Supports Node Failover [SURE-9419] #3096

Fleet Agent Supports Node Failover [SURE-9419] #3096

Comments

manno commented Nov 25, 2024 • edited Loading

Acceptance Criteria

manno commented Nov 25, 2024 •

edited

Loading