Invalid prometheus configuration, `lint error 45 duplicate rule(s) found.` after deploying hardware-observer #152

przemeklal · 2024-01-26T12:22:18Z

Versions:

hardware-observer rev 27
prometheus-k8s rev 159
grafana-agent rev 16

Relating hw-observer to grafana-agent related to COS Prometheus, resulted in Prometheus in a blocked state:

prometheus/0*                       blocked   idle   10.1.100.135         Invalid prometheus configuration; see debug logs

with the following errors in the debug-log output:

unit-prometheus-0: 12:09:42 ERROR unit.prometheus/0.juju-log Invalid prometheus configuration. Stdout: Checking /etc/prometheus/prometheus.yml
  SUCCESS: 11 rule files found
 SUCCESS: /etc/prometheus/prometheus.yml is valid prometheus config file syntax

Checking /etc/prometheus/rules/juju_controller_21d95645_prometheus-juju-exporter_metrics-endpoint_48.rules
  SUCCESS: 2 rules found

Checking /etc/prometheus/rules/juju_cos_ba76c9c5_alertmanager_metrics-endpoint_17.rules
  SUCCESS: 4 rules found

Checking /etc/prometheus/rules/juju_cos_ba76c9c5_grafana_metrics-endpoint_19.rules
  SUCCESS: 2 rules found

Checking /etc/prometheus/rules/juju_cos_ba76c9c5_loki_metrics-endpoint_18.rules
  SUCCESS: 4 rules found

Checking /etc/prometheus/rules/juju_cos_ba76c9c5_traefik_metrics-endpoint_16.rules
  SUCCESS: 2 rules found

Checking /etc/prometheus/rules/juju_juju_openstack_c2eab20_redacted-manila_0_alert_rules_metrics-endpoint_34.rules
  SUCCESS: 1889 rules found

Checking /etc/prometheus/rules/juju_maas-infra_1e82964a_infra-node.rules
  SUCCESS: 80 rules found

Checking /etc/prometheus/rules/juju_microk8s_300c24ad_microk8s.rules
  SUCCESS: 168 rules found

Checking /etc/prometheus/rules/juju_openstack_c2eab20b_ceph-osd.rules
  SUCCESS: 80 rules found

Checking /etc/prometheus/rules/juju_openstack_c2eab20b_manila-ganesha.rules
  SUCCESS: 35 rules found

Checking /etc/prometheus/rules/juju_openstack_c2eab20b_nova-compute.rules

 Stderr:   FAILED:
lint error 45 duplicate rule(s) found.
Metric: CollectorFailed
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: error
Metric: IPMICurrentStateNotOk
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: {{ toLower $labels.state }}
Metric: IPMIDCMICommandFailed
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: IPMIDCMIPowerConsumptionPercentageOutstanding
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: IPMIFanSpeedStateNotOk
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: {{ toLower $labels.state }}
Metric: IPMIMonitoringCommandFailed
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: IPMIPowerStateNotOk
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: {{ toLower $labels.state }}
Metric: IPMISELCommandFailed
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: IPMISELStateCritical
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: IPMISELStateWarning
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: IPMISensorStateNotOk
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: {{ toLower $labels.state }}
Metric: IPMITemperatureStateNotOk
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: {{ toLower $labels.state }}
Metric: IPMIVoltageStateNotOk
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: {{ toLower $labels.state }}
Metric: LSISASControllerNotFound
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: LSISASIRVolumeNotFound
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: LSISASIRVolumeUnready
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: LSISASPhysicalDiskUnready
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: MegaRAIDControllerNotFound
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: MegaRAIDVirtualDriveNotOptimal
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: PerccliCommandFailed
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: PowerEdgeRAIDControllerNotFound
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: PowerEdgeRAIDControllerSuccess
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: PowerEdgeRAIDVirtualDriveNotOptimal
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: RedfishCallFailed
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: RedfishChassisHealthNotAvailable
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: RedfishChassisHealthNotOk
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: RedfishMemoryDimmHealthNotAvailable
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: RedfishMemoryDimmHealthNotOk
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: RedfishProcessorHealthNotAvailable
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: RedfishProcessorHealthNotOk
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: RedfishSensorHealthNotAvailable
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: RedfishSensorHealthNotOk
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: RedfishServiceUnavailable
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: RedfishSmartStorageHealthNotOk
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: RedfishStorageControllerHealthNotAvailable
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: RedfishStorageControllerHealthNotOk
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: RedfishStorageDriveHealthNotAvailable
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: RedfishStorageDriveHealthNotOk
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: SasircuCommandFailed
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: SsaCLICommandFailed
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: SsaCLIControllerNotFound
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: SsaCLIControllerNotOK
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: SsaCLILogicalDriveNotOK
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: SsaCLIPhysicalDriveNotOK
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: StorcliCommandFailed
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Might cause inconsistency while recording expressions

unit-prometheus-0: 12:09:42 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)

We have multiple hardware-observer and grafana-agent applications running on this cluster so it might be a possible reason, though I believe such deployments should be supported.

The text was updated successfully, but these errors were encountered:

przemeklal · 2024-01-29T08:47:13Z

Removing hw-obs <-> g-agent <-> cos prometheus/loki relations and recreating loki and prometheus pods results in active/idle status of prometheus and loki.

After removing all relations between g-agent/hw-observer to COS, switching to a single hw-observer application and re-adding the relations, now I have this again:

loki/0*                             blocked   idle       x.x.x.x          Errors in alert rule groups. Check juju debug-log
prometheus/0*                       blocked   idle       x.x.x.y          Invalid prometheus configuration; see debug logs

with the same errors in debug-log. This might be related to old alert rules not being removed/updated after removing and re-adding relations.

Pjack · 2024-01-29T08:54:19Z

May be related to this issue which we are handling. We will take a look.
#127

przemeklal · 2024-01-29T09:08:24Z

Thanks, it sounds like this is the case. I managed to "workaround" by removing duplicated rules manually from /etc/prometheus/rules/juju_openstack_c2eab20b_nova-compute.rules inside the Prometheus pod. The duplicated alert groups were because in the past nova-compute was related to hardware-observer-compute. After redeployment, the app name was changed to hardware-observer so I ended up with two alert groups. That said, there would be no issue if the original alert group was removed after un-relating hardware-observer-compute from nova-compute. It might be the same issue with Loki, I'll report back once I figure something out.

przemeklal · 2024-01-29T09:35:32Z

Update: I don't see any alert rules in Loki specific to hw-observer so I believe it is an issue with multiple grafana-agent instances.

Pjack · 2024-01-29T09:59:05Z

Multiple hardware-observer applications in the same cluster is not supported. Do you mind to share the reason? I think it's not supported in grafana-agent either.

przemeklal · 2024-01-29T10:03:55Z

Examples of why multiple hw-observer instances might need to be deployed:

hw-observer is deployed in separate models (e.g. openstack and maas-infra)
non-uniform hardware in the same model (different vendors with different RAID controllers)
non-uniform redfish credentials
as a workaround for Prometheus alert rules are created only for one of the two (or more) applications related over the cos_agent interface grafana-agent-operator#17

Pjack · 2024-02-02T09:37:15Z

After discuss with @przemeklal , this issue seemed to be between grafana-agent and prometheus.
#127 was also fixed. Therefore, I will close this issue.

Pjack modified the milestones: 23.10.3, 23.10.4 Jan 30, 2024

Pjack closed this as completed Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid prometheus configuration, `lint error 45 duplicate rule(s) found.` after deploying hardware-observer #152

Invalid prometheus configuration, `lint error 45 duplicate rule(s) found.` after deploying hardware-observer #152

przemeklal commented Jan 26, 2024

przemeklal commented Jan 29, 2024

Pjack commented Jan 29, 2024

przemeklal commented Jan 29, 2024

przemeklal commented Jan 29, 2024

Pjack commented Jan 29, 2024

przemeklal commented Jan 29, 2024

Pjack commented Feb 2, 2024

Invalid prometheus configuration, lint error 45 duplicate rule(s) found. after deploying hardware-observer #152

Invalid prometheus configuration, lint error 45 duplicate rule(s) found. after deploying hardware-observer #152

Comments

przemeklal commented Jan 26, 2024

przemeklal commented Jan 29, 2024

Pjack commented Jan 29, 2024

przemeklal commented Jan 29, 2024

przemeklal commented Jan 29, 2024

Pjack commented Jan 29, 2024

przemeklal commented Jan 29, 2024

Pjack commented Feb 2, 2024

Invalid prometheus configuration, `lint error 45 duplicate rule(s) found.` after deploying hardware-observer #152

Invalid prometheus configuration, `lint error 45 duplicate rule(s) found.` after deploying hardware-observer #152