Drift Detection and Correction for Cross-Cluster State Management #732

DinaBelova · 2024-12-09T16:46:07Z

Goals

Problem Statement: With the completion of Cross-Cluster State Management Templates, HMC now has centralized ServiceTemplate deployment across clusters. However, without drift detection and correction, clusters risk deviating from intended configurations over time, causing inconsistency and requiring manual interventions to restore alignment.
Epic Goal: Build upon the Cross-Cluster State Management Templates by integrating automated drift detection and correction, utilizing Sveltos to ensure clusters continuously conform to centrally defined ServiceTemplates. This system will enable automated detection, reporting, and correction of configuration drift, increasing consistency and minimizing operational load.

Major deliverables

Drift Detection and Correction System for ServiceTemplates:
- Integrate with Sveltos to enable ContinuousWithDriftDetection on ServiceTemplates, with automated re-syncing to central configurations.

Who it benefits
Customer Business:

Ensures clusters remain consistent with central configurations, reducing unplanned downtime and operational instability.
Automates manual drift correction efforts, lowering resource requirements for ongoing infrastructure management.

Platform Engineering Teams:

Provides real-time visibility into drift events, allowing for faster resolution and reduced monitoring effort.
Ensures clusters stay aligned with the intended state, improving system reliability and predictability.

Mirantis:

Expands the HMC offering with critical drift detection and correction functionality, increasing value for customers managing complex, large-scale Kubernetes estates.
Reduces support demands by enabling proactive configuration management across clusters.

Acceptance criteria

Automated Drift Detection on All ServiceTemplate Deployments:
- Ensure that all ServiceTemplate deployments with syncMode: ContinuousWithDriftDetection detect configuration drift and notify the central management cluster.
Automated and Manual Drift Correction Capabilities:
- Verify that configuration drift can be automatically corrected, with an option for manual override on select templates.
Profile-Based Drift Monitoring:
- Drift detection scope can be limited to specific clusters and namespaces through labels, ensuring targeted monitoring.

Assumptions

Platform Leads / Engineers will manage drift detection configurations by creating CRs in the management cluster.
Sveltos will be the primary tool for fulfilling drift detection and correction
Customers may optionally use other systems for application configuration management, with the alternative drift detection system supporting those workflows/pipelines.

Limitations

Sveltos drift detection capabilities and limitations
Performance overhead of continuous drift monitoring at scale
Network latency impact on drift detection accuracy
Limited ability to detect configuration changes made outside ServiceTemplate system
Resource constraints for continuous monitoring across large cluster fleets

Out of scope

API or UI development for new configuration interfaces.
Performance optimizations for large-scale drift detection.
Integration with 2A observability platform until those epics are developed.
- Centralized Reporting and Alerting of Drift Events
- Audit Logging for Compliance Tracking

User stories
As a Platform Lead:

I want automated drift detection for ServiceTemplates so I can ensure configuration consistency
I want to receive notifications when drift occurs so I can investigate root causes
I want automatic correction of detected drift so configurations stay aligned with templates
I want to override automatic correction for specific templates when manual control is needed

As a Platform Engineer:

I want to view drift status across my clusters so I can identify problematic deployments
I want to manually trigger drift correction so I can control timing of changes
I want to configure drift detection scope so I can focus on critical services
I want to exclude certain resources from drift detection so I can allow authorized local changes

DinaBelova · 2024-12-23T17:33:33Z

@wahabmk can you please add the most recent research results that you're running atm through sveltos docs?

wahabmk · 2024-12-30T20:50:13Z

I couldn't find a way in the Sveltos docs currently that would indicate how to watch for drift changes and trigger a custom notification, so I studied the code to understand how drift detection is implemented and how we could possibly implement a notification mechanism.

Sveltos Drift Detection & Correction

There are 2 ways to run the "drift-detection-manager" in Sveltos:

In the managed cluster if agent-in-mgmt-cluster==false (default).
In the management cluster if agent-in-mgmt-cluster==true.

In both of these methods, the drift detection CRDs are installed in the managed cluster. The ResourceSummary object (which is part of these CRDs) contains a list of objects to watch for drift and is also created in the managed cluster.

The high-level flow of how the drift correction actually happens is as below:

The "drift-detection-manager" watches resources and if it detects changes in the resources (based on hash values), it updates the status of the ResourceSummary object.
The change in ResourceSummary object triggers the processResourceSummary() function in the "addon-controller", which updates the status of the associated ClusterSummary object. The processResourceSummary() function runs in a separate go-routine which is run when setting up the ClusterSummaryReconciler.
- 2a. During this process, the "addon-controller" sets ClusterSummary.status.featureSummaries[].hash = nil. This is important to note because the ClusterSummary objects exists in the management cluster so we could potentially use it to watch for drift rather than ResourceSummary.
The update to ClusterSummary triggers the ClusterSummary.Reconcile function which then re-deploys the drifted resources using the feature handler for each feature. The feature handlers are defined in the createFeatureHandlerMaps() function.

How we can detect drift for notifications

We could use either the ResourceSummary or ClusterSummary object to detect drift, but keep in mind that the source of truth for determining that drift occurred is ResourceSummary.status.helmResourcesChanged=true.

Using `ResourceSummary` to detect drift

HMC will create a watcher for each cluster where we want to check for drift.
The watcher will get the kubeconfig for the cluster and watch for changes in ResourceSummary object.
When ResourceSummary.status.helmResourceChanged=true, the watcher can trigger a notification.
PROS:
- Using ResourceSummary is better as it is the source of truth for determining if drift has occurred.
CONS:
- More work to implement in HMC.
- Watcher might miss the change in status based on how its implemented and how quick Sveltos corrects the drift.
NOTE: We may be able to achieve this without making significant changes to HMC by using Sveltos HealthCheck object with a Lua script to detect changes to ResourceSummary. This object already has pre-defined notification mechanisms. See https://projectsveltos.github.io/sveltos/observability/example_crashloopbackoff_notification/ for more details.

Using `ClusterSummary` to detect drift

We are already watching for changes to ClusterSummary in the HMC controller.
Based on the findings in 2a) above, we can determine if a drift occurred with:

prevHash := nil
if hash == nil:
  if isReady == false:
    - can't be sure if there is a drift since cluster is not ready so ignore
  else:
    if prevHash == nil:
      - this means that it is the 1st time that resource is provisioned
      - or that the HMC has (re)started so it is observing the resource for the 1st time
      - so in either case we ignore
    else:
      if status == Provisioning:
        - drift occurred so trigger notification
prevHash := hash

We check for isReady == false because Sveltos sets hash = nil if the cluster is not ready along with status = Failed.
We can use the IsClusterReadyToBeConfigured() function to check if cluster is ready.
PROS:
- Potentially less work to implement in HMC.
- Also no need to access the managed cluster as the ClusterSummary object already exists in the management cluster.
CONS:
- Might be a bit hacky because it depends on how Sveltos implements correction for the detected drift.

Using metrics exposed by Sveltos

A third option is that we can use the projectsveltos_total_drifts metric exposed by Sveltos to have observability over drifts. See: https://projectsveltos.github.io/sveltos/getting_started/install/grafanadashboard/#12-drifts.
NOTE: This might be suitable to do as part of the Observability Epic.

UPDATE

Sveltos might implement a knob to send notification on detected drift as part of its drift detection mechanism. See: https://projectsveltos.slack.com/archives/C046P825BBL/p1735586788580649.

wahabmk · 2025-01-01T17:49:45Z

Acceptance criteria

Automated Drift Detection on All ServiceTemplate Deployments:

Ensure that all ServiceTemplate deployments with syncMode: ContinuousWithDriftDetection detect configuration drift and notify the central management cluster.

Configuring drift detection/correction and applying it to clusters will be implemented in #834.

Notification for drift is not something that Sveltos includes out of the box. The #732 (comment) describes possibles approaches we could take to implement it but need to try out as part of working on #835.

Automated and Manual Drift Correction Capabilities:

Verify that configuration drift can be automatically corrected, with an option for manual override on select templates.

I couldn't find any mechanism in Sveltos to manually trigger drift correction. Based on how drift correction has been implemented in Sveltos (summarized in #732 (comment)), the way to manually trigger correction would be to trigger the ClusterSummaryReconciler but its not ideal. TODO: Asking on Sveltos slack might give us some suggestions.

Profile-Based Drift Monitoring:

Drift detection scope can be limited to specific clusters and namespaces through labels, ensuring targeted monitoring.

Currently the way labels are used to target clusters for deploying services is that the ClusterDeployment will match only 1 cluster, whereas the MultiClusterService may match multiple clusters using labels. So if both are matching a particular cluster, the services and drift configuration for the one with higher priority will be applied to the cluster.

wahabmk · 2025-01-01T17:49:58Z

As a Platform Lead:

I want automated drift detection for ServiceTemplates so I can ensure configuration consistency #834

I want to receive notifications when drift occurs so I can investigate root causes #835

The options for notification or observability for detected drift is discussed in #732 (comment)

I want automatic correction of detected drift so configurations stay aligned with templates #834

I want to override automatic correction for specific templates when manual control is needed

The same comment for manually triggering drift correction as in #732 (comment).

wahabmk · 2025-01-01T17:52:48Z

As a Platform Engineer:

I want to view drift status across my clusters so I can identify problematic deployments #835

The options for notification or observability of detected drift is discussed in #732 (comment)

I want to manually trigger drift correction so I can control timing of changes

Same comment for manually triggering drift correction as in #732 (comment).

I want to configure drift detection scope so I can focus on critical services

If this is referring to scope of drift detection as determined by labels, then same comment applies as in the last point in #732 (comment). But if this is referring to opting certain services out of drift detection, then see the comment below.

I want to exclude certain resources from drift detection so I can allow authorized local changes #834

In the current implementation, we don't have any way to opt out of drift detection for lets say 1 out of 3 services deployed. The reason for this is because we map ClusterDeployment -> (Sveltos) Profile and MultiClusterService -> (Sveltos) ClusterProfile and the syncMode: ContinuousWithDriftDetection option is applied to all services (which get translated to helm charts on Sveltos objects) as described in https://projectsveltos.github.io/sveltos/features/configuration_drift/#configuration-drift.
We can, however, exclude certain Kubernetes objects deployed by these helm charts with "Ignore Annotation" and "Ignore Fields" if that is what is intended by this use case but there is no option currently to opt a helm chart as a whole out of drift detection if syncMode: ContinuousWithDriftDetection is set.
One possible workaround is that within the Mirantis official helm chart (which is used by the ServiceTemplate) we can add projectsveltos.io~1driftDetectionIgnore annotation to all Kubernetes objects if .Values.ignoreDrift=true. Then if we want to exclude a particular service from drift detection, we can create the ClusterDeployment as in the YAML below. All Kubernetes objects deployed for ingress-nginx would then be ignored for drift detection. However, this workaround relies on the pre-creating the helm chart with .Values.ignoreDrift=true.

apiVersion: hmc.mirantis.com/v1alpha1
kind: ClusterDeployment
metadata:
  name: wali-dev-1
  namespace: hmc-system
spec:
  . . .
  services:
    - template: kyverno-3-2-6
      name: kyverno
      namespace: kyverno
    - template: ingress-nginx-4-11-0
      name: ingress-nginx
      namespace: ingress-nginx
      values: |
        ignoreDrift: true
    - template: cert-manager-1-16-2
      name: cert-manager
      namespace: cert-manager
  syncMode: ContinuousWithDriftDetection
. . .

Yet another option to achieve this would be not to map ClusterDeployment -> (Sveltos) Profile but to create a separate Sveltos Profile object for each of the services defined in ClusterDeployment. This is a large change though as it will be a fundamental change in design.

DinaBelova added the epic Large body of work, can be broken down into individual issues label Dec 9, 2024

DinaBelova assigned wahabmk Dec 9, 2024

DinaBelova added this to Project 2A Dec 9, 2024

github-project-automation bot moved this to Todo in Project 2A Dec 9, 2024

DinaBelova changed the title ~~[placeholder] Drift Detection and Correction for Cross-Cluster State Management~~ Drift Detection and Correction for Cross-Cluster State Management Dec 9, 2024

DinaBelova moved this from Todo to In Progress in Project 2A Dec 19, 2024

wahabmk mentioned this issue Dec 31, 2024

Notification based on detected drift in Sveltos #835

Open

alex-shl added this to K0rdent Jan 3, 2025

alex-shl moved this to In Progress in K0rdent Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drift Detection and Correction for Cross-Cluster State Management #732

Drift Detection and Correction for Cross-Cluster State Management #732

DinaBelova commented Dec 9, 2024

DinaBelova commented Dec 23, 2024

wahabmk commented Dec 30, 2024 •

edited

Loading

wahabmk commented Jan 1, 2025 •

edited

Loading

wahabmk commented Jan 1, 2025 •

edited

Loading

wahabmk commented Jan 1, 2025 •

edited

Loading

Drift Detection and Correction for Cross-Cluster State Management #732

Drift Detection and Correction for Cross-Cluster State Management #732

Comments

DinaBelova commented Dec 9, 2024

DinaBelova commented Dec 23, 2024

wahabmk commented Dec 30, 2024 • edited Loading

Sveltos Drift Detection & Correction

How we can detect drift for notifications

Using ResourceSummary to detect drift

Using ClusterSummary to detect drift

Using metrics exposed by Sveltos

UPDATE

wahabmk commented Jan 1, 2025 • edited Loading

wahabmk commented Jan 1, 2025 • edited Loading

wahabmk commented Jan 1, 2025 • edited Loading

wahabmk commented Dec 30, 2024 •

edited

Loading

Using `ResourceSummary` to detect drift

Using `ClusterSummary` to detect drift

wahabmk commented Jan 1, 2025 •

edited

Loading

wahabmk commented Jan 1, 2025 •

edited

Loading

wahabmk commented Jan 1, 2025 •

edited

Loading