Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus alert rules are created only for one of the two (or more) applications related over the cos_agent interface #17

Closed
przemeklal opened this issue Oct 24, 2023 · 6 comments · Fixed by #47

Comments

@przemeklal
Copy link
Member

Bug Description

After relating grafana-agent to two principal applications, in our case, zookeeper and kafka, the generic grafana-agent host alert rules (e.g. HostCpuHighIowait) are generated only for one of the apps (zookeeper).

Noteworthy, the grafana-agent leader unit is the one related to the zookeeper app (as noticed by @dstathis on our live debugging call).

To Reproduce

  1. Deploy grafana-agent
  2. Deploy two principal applications, in our case zookeeper and kafka.
  3. Relate grafana-agent using cos-agent interface to both applications.
  4. Relate grafana-agent to COS prometheus:
prometheus-receive-remote-write:receive-remote-write  grafana-agent:send-remote-write
  1. Alert rules are created only for one app, for example:
name: HostCpuHighIowait
expr: avg by (instance) (rate(node_cpu_seconds_total{juju_application="zookeeper",juju_charm="grafana-agent",juju_model="redacted",juju_model_uuid="redacted",mode="iowait"}[5m])) * 100 > 10
labels:
juju_application: zookeeper
juju_charm: grafana-agent
juju_model: redacted
juju_model_uuid: redacted
severity: warning
annotations:
description: CPU iowait > 10%. A high iowait means that you are disk or network bound.
  VALUE = {{ $value }}
  LABELS = {{ $labels }}
summary: Host CPU high iowait (instance {{ $labels.instance }})

The same alert rule for kafka is missing.

Noteworthy, the grafana-agent leader unit is the one related to the zookeeper app.

Environment

Monitored model:

grafana-agent                       active      6  grafana-agent              latest/edge   20  no
kafka                               active      3  kafka                      3/edge       121  no
zookeeper                           active      3  zookeeper                  3/edge       100  no

Juju status:

kafka/0*                      active    idle   0        redacted_subnet.4
  grafana-agent/10            active    idle            redacted_subnet.4
  ubuntu-advantage/2          active    idle            redacted_subnet.4            Attached (esm-infra,livepatch)
kafka/1                       active    idle   1        redacted_subnet.5
  grafana-agent/9             active    idle            redacted_subnet.5
  ubuntu-advantage/1          active    idle            redacted_subnet.5            Attached (esm-infra,livepatch)
kafka/2                       active    idle   2        redacted_subnet.6
  grafana-agent/11            active    idle            redacted_subnet.6
  ubuntu-advantage/0          active    idle            redacted_subnet.6            Attached (esm-infra,livepatch)
tls-certificates-operator/0*  active    idle   3        redacted_subnet.7
zookeeper/0                   active    idle   3        redacted_subnet.7
  grafana-agent/6             active    idle            redacted_subnet.7
  ubuntu-advantage/3*         active    idle            redacted_subnet.7            Attached (esm-infra,livepatch)
zookeeper/1                   active    idle   4        redacted_subnet.8
  grafana-agent/8             active    idle            redacted_subnet.8
  ubuntu-advantage/4          active    idle            redacted_subnet.8            Attached (esm-infra,livepatch)
zookeeper/2*                  active    idle   5        redacted_subnet.9
  grafana-agent/7*            active    idle            redacted_subnet.9
  ubuntu-advantage/5          active    idle            redacted_subnet.9            Attached (esm-infra,livepatch)
$ juju status --relations | grep grafana-agent
grafana-agent:grafana-dashboards-provider             grafana-dashboards:grafana-dashboard   grafana_dashboard         regular
grafana-agent:peers                                   grafana-agent:peers                    grafana_agent_replica     peer
kafka:cos-agent                                       grafana-agent:cos-agent                cos_agent                 subordinate
loki-logging:logging                                  grafana-agent:logging-consumer         loki_push_api             regular
prometheus-receive-remote-write:receive-remote-write  grafana-agent:send-remote-write        prometheus_remote_write   regular
zookeeper:cos-agent                                   grafana-agent:cos-agent                cos_agent                 subordinate

Prometheus version in COS model:

prometheus             2.46.0   active      1  prometheus-k8s         edge     133  10.152.183.234  no       

Relevant log output

CLI commands output above, let me know what logs would help.

Additional context

No response

@przemeklal przemeklal changed the title Proemtheus alert rules are created only for one of the two (or more) applications related over the cos_agent interface Prometheus alert rules are created only for one of the two (or more) applications related over the cos_agent interface Oct 24, 2023
@przemeklal
Copy link
Member Author

Confirmed in another model that the alert rule is created only for the app that has the grafana-agent's juju leader as its subordinate:

name: HostCpuHighIowait
expr: avg by (instance) (rate(node_cpu_seconds_total{juju_application="aodh",juju_charm="grafana-agent",juju_model="openstack",juju_model_uuid="redacted",mode="iowait"}[5m])) * 100 > 10
labels:
juju_application: aodh
juju_charm: grafana-agent
juju_model: openstack
juju_model_uuid: redacted
severity: warning
annotations:
description: CPU iowait > 10%. A high iowait means that you are disk or network bound.
  VALUE = {{ $value }}
  LABELS = {{ $labels }}
summary: Host CPU high iowait (instance {{ $labels.instance }})
aodh/2                        active    idle   26/lxd/0  redacted     8042/tcp       Unit is ready
  grafana-agent-container/0*  active    idle             redacted                    grafana-cloud-config: off, logging-consumer: off

@dstathis
Copy link
Contributor

dstathis commented Nov 3, 2023

When I reproduce this, it looks like the kafka alerts made it in to prometheus, but they are labeled with juju_application="zookeeper". Can you check if, for example, there is an alert called "Kafka missing" in your deployment?

@przemeklal
Copy link
Member Author

@dstathis Confirmed:

name: Kafka Missing
expr: up{juju_application="zookeeper",juju_charm!=".*",juju_model="redacted",juju_model_uuid="redacted"} == 0
labels:
juju_application: zookeeper
juju_charm: kafka
juju_model: redacted
juju_model_uuid: redacted
severity: critical
annotations:
description: Kafka target has disappeared. An exporter might be crashed.
  VALUE = {{ $value }}
  LABELS = {{ $labels }}
summary: Prometheus target missing (instance {{ $labels.instance }})

After splitting grafana-agent into two separate apps, and relating them to kafka and zookeeper respectively, I also see Kafka Missing with the correct label: juju_application: kafka.

However, Kafka Missing with the label juju_application: zookeeper is still there.

@dstathis
Copy link
Contributor

dstathis commented Nov 3, 2023

Created #29 for the the alerts sticking around.

@nobuto-m
Copy link

@simskij This issue is now critical once we sort out canonical/prometheus-k8s-operator#551. Without fixing this, all alert_rules for host metrics except for one application will be missing.

@dstathis
Copy link
Contributor

After investigating this issue more, we have determined that the implementation of alert and metrics labels will have to change significantly. We will now have all alerts and metrics labeled with the topology labels of the charm they came from rather than the topology of the principal.

This change will likely be completed after the winter break.

This change will require updates to the cos_agent library in client charms and may require changes to any git based alert rules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants