-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update scrape and remote_write libs for generic HostHealth rules #660
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… host health rules
sed-i
approved these changes
Dec 20, 2024
Abuelodelanada
approved these changes
Dec 21, 2024
* integrate with new cosl Rules class * remove lib-juju pins in CI
* support generic alert rules in MetricsEndpointAggregator
* fix: update cos-tool permissions to adhere to cis hardening rules * add remote-write and bump versions * fix cos-tool permissions for this charm as well * try to fix itests * remove chmod from library * fix unit tests * fix unit tests
sed-i
approved these changes
Feb 4, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue
Within the tandem
cosl
PR, we have a centralized (Prometheus rather than individual charms) way to "inject" alerts on the fly so we can extend Prometheus for generic up/absent rules.Solution
The prometheus_scrape and prometheus_remote_write libs were updated by injecting generic HostHealth rules to the rules.
Documentation for implementation and testing
Context
In tandem with:
Testing Instructions
Without Grafana Agent
prom:metrics-endpoint avalanche
andalertmanager prom:alertmanager
juju add-unit avalanche -n 1
HostHealth
rules or query Prometheus withjuju show-unit prom/0 | yq -r '."prom/0"."relation-info"[1]."application-data"."alert_rules"' | jq
HostDown
Alert firing (showing the specific unit Avalanche/0)Host 'prom_8b073ff8-5456-492a-804b-7f9f15c996dc_avalanche_avalanche/0' is down. VALUE = 0 LABELS = map[__name__:up instance:prom_8b073ff8-5456-492a-804b-7f9f15c996dc_avalanche_avalanche/0 job:juju_prom_8b073ff8_avalanche_prometheus_scrape_avalanche-0 juju_application:avalanche juju_charm:avalanche-k8s juju_model:prom juju_model_uuid:8b073ff8-5456-492a-804b-7f9f15c996dc juju_unit:avalanche/0]
With Grafana Agent
Scrape
gagent:metrics-endpoint avalanche
,gagent:send-remote-write prom
, andalertmanager prom:alertmanager
juju add-unit avalanche -n 1
HostHealth
andAggregatorHostHealth
rules or query Grafana Agent withjuju show-unit gagent/0 | yq -r '."gagent/0"."relation-info"[0]."application-data"."alert_rules"' | jq
HostDown
Alert firing (showing the specific unit Avalanche/0)Remote write
HostUnavailable
Alert firing in both theHostHealth
andAggregatorHostHealth
groupsNote
This does not show each unit, with 2 avalanche units, alert labels shows once per app:
alertname=HostUnavailablejuju_application=avalanchejuju_charm=avalanche-k8sjuju_model=promjuju_model_uuid=8b073ff8-5456-492a-804b-7f9f15c996dcseverity=critical
Metrics not received from host ''. VALUE = 1 LABELS = map[juju_application:avalanche juju_model:prom juju_model_uuid:8b073ff8-5456-492a-804b-7f9f15c996dc]
Cos-proxy
lib/charms/prometheus_k8s/v0/prometheus_scrape.py
into cos-proxy and pack the charm.k8s (in a model named "prom")
lxd
juju exec --unit cp/0 "sudo systemctl stop vector"
Upgrade Notes
By fetching the new libs you would get a set of new alerts automatically. If charms already had up/absent alerts, this will result in duplication of alerts and rules.
up
/absent
alerts are ubiquitous and are handled by the libs modified in this PR. Any custom alerts duplicating this behaviour can be removed.With the new design introduced in this PR, you would get a separate HostUnavailable alert for Grafana Agent itself and each unit that is aggregated by it.