Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid prometheus configuration, lint error 45 duplicate rule(s) found. after deploying hardware-observer #152

Closed
przemeklal opened this issue Jan 26, 2024 · 7 comments
Milestone

Comments

@przemeklal
Copy link
Member

Versions:

hardware-observer rev 27
prometheus-k8s rev 159
grafana-agent rev 16

Relating hw-observer to grafana-agent related to COS Prometheus, resulted in Prometheus in a blocked state:

prometheus/0*                       blocked   idle   10.1.100.135         Invalid prometheus configuration; see debug logs

with the following errors in the debug-log output:

unit-prometheus-0: 12:09:42 ERROR unit.prometheus/0.juju-log Invalid prometheus configuration. Stdout: Checking /etc/prometheus/prometheus.yml
  SUCCESS: 11 rule files found
 SUCCESS: /etc/prometheus/prometheus.yml is valid prometheus config file syntax

Checking /etc/prometheus/rules/juju_controller_21d95645_prometheus-juju-exporter_metrics-endpoint_48.rules
  SUCCESS: 2 rules found

Checking /etc/prometheus/rules/juju_cos_ba76c9c5_alertmanager_metrics-endpoint_17.rules
  SUCCESS: 4 rules found

Checking /etc/prometheus/rules/juju_cos_ba76c9c5_grafana_metrics-endpoint_19.rules
  SUCCESS: 2 rules found

Checking /etc/prometheus/rules/juju_cos_ba76c9c5_loki_metrics-endpoint_18.rules
  SUCCESS: 4 rules found

Checking /etc/prometheus/rules/juju_cos_ba76c9c5_traefik_metrics-endpoint_16.rules
  SUCCESS: 2 rules found

Checking /etc/prometheus/rules/juju_juju_openstack_c2eab20_redacted-manila_0_alert_rules_metrics-endpoint_34.rules
  SUCCESS: 1889 rules found

Checking /etc/prometheus/rules/juju_maas-infra_1e82964a_infra-node.rules
  SUCCESS: 80 rules found

Checking /etc/prometheus/rules/juju_microk8s_300c24ad_microk8s.rules
  SUCCESS: 168 rules found

Checking /etc/prometheus/rules/juju_openstack_c2eab20b_ceph-osd.rules
  SUCCESS: 80 rules found

Checking /etc/prometheus/rules/juju_openstack_c2eab20b_manila-ganesha.rules
  SUCCESS: 35 rules found

Checking /etc/prometheus/rules/juju_openstack_c2eab20b_nova-compute.rules

 Stderr:   FAILED:
lint error 45 duplicate rule(s) found.
Metric: CollectorFailed
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: error
Metric: IPMICurrentStateNotOk
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: {{ toLower $labels.state }}
Metric: IPMIDCMICommandFailed
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: IPMIDCMIPowerConsumptionPercentageOutstanding
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: IPMIFanSpeedStateNotOk
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: {{ toLower $labels.state }}
Metric: IPMIMonitoringCommandFailed
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: IPMIPowerStateNotOk
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: {{ toLower $labels.state }}
Metric: IPMISELCommandFailed
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: IPMISELStateCritical
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: IPMISELStateWarning
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: IPMISensorStateNotOk
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: {{ toLower $labels.state }}
Metric: IPMITemperatureStateNotOk
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: {{ toLower $labels.state }}
Metric: IPMIVoltageStateNotOk
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: {{ toLower $labels.state }}
Metric: LSISASControllerNotFound
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: LSISASIRVolumeNotFound
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: LSISASIRVolumeUnready
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: LSISASPhysicalDiskUnready
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: MegaRAIDControllerNotFound
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: MegaRAIDVirtualDriveNotOptimal
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: PerccliCommandFailed
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: PowerEdgeRAIDControllerNotFound
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: PowerEdgeRAIDControllerSuccess
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: PowerEdgeRAIDVirtualDriveNotOptimal
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: RedfishCallFailed
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: RedfishChassisHealthNotAvailable
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: RedfishChassisHealthNotOk
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: RedfishMemoryDimmHealthNotAvailable
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: RedfishMemoryDimmHealthNotOk
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: RedfishProcessorHealthNotAvailable
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: RedfishProcessorHealthNotOk
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: RedfishSensorHealthNotAvailable
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: RedfishSensorHealthNotOk
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: RedfishServiceUnavailable
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: RedfishSmartStorageHealthNotOk
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: RedfishStorageControllerHealthNotAvailable
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: RedfishStorageControllerHealthNotOk
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: RedfishStorageDriveHealthNotAvailable
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: RedfishStorageDriveHealthNotOk
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: SasircuCommandFailed
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: SsaCLICommandFailed
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: SsaCLIControllerNotFound
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: warning
Metric: SsaCLIControllerNotOK
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: SsaCLILogicalDriveNotOK
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: SsaCLIPhysicalDriveNotOK
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Metric: StorcliCommandFailed
Label(s):
	juju_application: nova-compute
	juju_charm: hardware-observer
	juju_model: openstack
	juju_model_uuid: c2eab20b-21b7-4f8e-813b-5ae038e9cbb4
	severity: critical
Might cause inconsistency while recording expressions

unit-prometheus-0: 12:09:42 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)

We have multiple hardware-observer and grafana-agent applications running on this cluster so it might be a possible reason, though I believe such deployments should be supported.

@przemeklal
Copy link
Member Author

Removing hw-obs <-> g-agent <-> cos prometheus/loki relations and recreating loki and prometheus pods results in active/idle status of prometheus and loki.

After removing all relations between g-agent/hw-observer to COS, switching to a single hw-observer application and re-adding the relations, now I have this again:

loki/0*                             blocked   idle       x.x.x.x          Errors in alert rule groups. Check juju debug-log
prometheus/0*                       blocked   idle       x.x.x.y          Invalid prometheus configuration; see debug logs

with the same errors in debug-log. This might be related to old alert rules not being removed/updated after removing and re-adding relations.

@Pjack
Copy link

Pjack commented Jan 29, 2024

May be related to this issue which we are handling. We will take a look.
#127

@przemeklal
Copy link
Member Author

Thanks, it sounds like this is the case. I managed to "workaround" by removing duplicated rules manually from /etc/prometheus/rules/juju_openstack_c2eab20b_nova-compute.rules inside the Prometheus pod. The duplicated alert groups were because in the past nova-compute was related to hardware-observer-compute. After redeployment, the app name was changed to hardware-observer so I ended up with two alert groups. That said, there would be no issue if the original alert group was removed after un-relating hardware-observer-compute from nova-compute. It might be the same issue with Loki, I'll report back once I figure something out.

@przemeklal
Copy link
Member Author

Update: I don't see any alert rules in Loki specific to hw-observer so I believe it is an issue with multiple grafana-agent instances.

@Pjack
Copy link

Pjack commented Jan 29, 2024

Multiple hardware-observer applications in the same cluster is not supported. Do you mind to share the reason? I think it's not supported in grafana-agent either.

@przemeklal
Copy link
Member Author

Examples of why multiple hw-observer instances might need to be deployed:

@Pjack Pjack modified the milestones: 23.10.3, 23.10.4 Jan 30, 2024
@Pjack
Copy link

Pjack commented Feb 2, 2024

After discuss with @przemeklal , this issue seemed to be between grafana-agent and prometheus.
#127 was also fixed. Therefore, I will close this issue.

@Pjack Pjack closed this as completed Feb 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants