Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic SLO Cleanup Mechanism #198776

Open
framsouza opened this issue Nov 4, 2024 · 5 comments
Open

Automatic SLO Cleanup Mechanism #198776

framsouza opened this issue Nov 4, 2024 · 5 comments
Labels
Feature:SLO Team:obs-ux-management Observability Management User Experience Team

Comments

@framsouza
Copy link

framsouza commented Nov 4, 2024

Description

Currently, there is no automated cleanup feature for SLOs, and as a result, our existing SLOs may not accurately reflect the true reliability of our services. We propose a solution to introduce an automated cleanup mechanism for SLOs to ensure that only relevant and up-to-date SLOs are maintained in the production environment.

Currently, to clean up SLOs, we run an update_by_query against the SLO indices. However, we need a more straightforward method for users and customers to clean up their SLOs without added hassle

Problem Statement:

  • Outdated/Irrelevant SLOs: We have observed cases where SLOs are violated due to missing or irrelevant group_by fields, resulting in inaccurate reliability metrics
  • SLOs with No Data: There are instances where the status of an SLO remains as no_data for extended periods. An automatic removal of SLOs with a no_data status for more than X hours would help maintain only meaningful and actionable SLOs.
  • Stale/Unused SLOs: Over time, some SLOs become obsolete or are no longer needed. Currently, these must be manually deleted, which is time-consuming and may lead to cluttered monitoring setups.

Ideas/Solutions:

  • Automatic Removal Based on no_data Status: Allow SLOs with a no_data status to be automatically removed if this condition persists for more than a configurable duration (e.g., X hours).
  • Cleanup by Tags: Introduce a "Cleanup" button or functionality within the UI that enables the bulk deletion of SLOs based on specified tags, allowing users to easily identify and remove outdated SLOs.
  • Enhanced Validation for group_by Fields: Implement checks to ensure that SLOs referencing non-existent group_by fields are either flagged for review or automatically removed, depending on the configuration.

Benefits

This feature would help maintain a cleaner and more accurate set of SLOs that reflect only the SLOs that actually matters/works and by reducing the need for manual cleanup, engineers can focus on other critical tasks, improving overall productivity.

@framsouza framsouza added Feature:SLO Team:obs-ux-management Observability Management User Experience Team labels Nov 4, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

@drewpost
Copy link

drewpost commented Nov 4, 2024

Thanks for writing this up. In your use case, what is the scale we're talking about here? How often do you have an SLO that needs deleting vs updating?

@neiljbrookes
Copy link

From slack (https://elastic.slack.com/archives/C044PV8EJ4X/p1730729974044599?thread_ts=1730725339.130429&cid=C044PV8EJ4X)

I'd just like to clarify that its not the SLO that need removing, its the instance of an SLO that needs to be cleaned up. When using group_by aggs, an instance of the SLO is made for every unique value in the selected group_by field. We use it alot for project_id which is a fields with high cardinality, and it is perfectly possible for a value to be removed (on project deletion).

@framsouza
Copy link
Author

Thanks for following up, @drewpost! In our case, the scale is quite large, we’re managing thousands of SLOs, and over time, quite a few become outdated or irrelevant. We usually find that deletions are more common than updates, especially as services evolve or get deprecated. It’s not uncommon for large batches of SLOs to need periodic cleanup

@jasonrhodes
Copy link
Member

Related: #195266

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:SLO Team:obs-ux-management Observability Management User Experience Team
Projects
None yet
Development

No branches or pull requests

5 participants