Skip to content
This repository has been archived by the owner on Oct 16, 2024. It is now read-only.

add and emit pool owner metadata for alerting #327

Merged
merged 11 commits into from
Oct 13, 2023
Merged

Conversation

gmdfalk
Copy link
Contributor

@gmdfalk gmdfalk commented Jun 27, 2023

We'd like to have pool_owner metadata on each pool for both informational and alerting purposes.
The new attribute on the pool config, pool_owner, defaults to compute_infra.

Signed-off-by: Max Falk [email protected]

Description

Please fill out!

Testing Done

Please fill out! Generally speaking any new features should include
additional unit or integration tests to ensure the behaviour is
working correctly.

Copy link
Member

@nemacysts nemacysts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nits, but nothing blocking :)

@@ -181,7 +181,7 @@ def run(self, dry_run: bool = False, timestamp: Optional[arrow.Arrow] = None) ->
self.target_capacity_gauge.set(new_target_capacity, {"dry_run": dry_run})
self.max_capacity_gauge.set(
self.pool_manager.max_capacity,
{"dry_run": dry_run, "alert_on_max_capacity": self.pool_manager.alert_on_max_capacity},
{"dry_run": dry_run, "alert_on_max_capacity": self.pool_manager.alert_on_max_capacity, "team": self.pool_manager.pool_owner},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we s/team/pool_owner to keep things consistent between config and metric labeling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I knowingly set it to team because that's what our alertmanager automatically resolves for ticketing/alerting.
If we don't set it to team here, we'll have to relabel in the prometheus query for the max_pool_capacity alerting.

Still want to keep it at pool_owner? @nemacysts @jfongatyelp

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, that's fine then - might be good to add that in a comment here tho since i'm not sure how many people are aware of that (or maybe i'm just the only one that didn't know this :p)

gmdfalk and others added 3 commits June 29, 2023 10:29
{
"dry_run": dry_run,
"alert_on_max_capacity": self.pool_manager.alert_on_max_capacity,
"team": self.pool_manager.pool_owner,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmdfalk actually, I was just talking to @EmanElsaban about another alert we wanted to use internally and we realized that having a team label on all of the clusterman metrics would be useful - thoughts on adding the label to everything else in this PR?

Copy link
Contributor Author

@gmdfalk gmdfalk Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nemacysts Yeah, probably makes sense!

In that case, i think we should be able to add the team label to autoscaler._emit_requested_resource_metrics and drainer._emit_draining_metrics to tag onto most if not all metrics emitted?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ - if we have a centralized place that sounds even better!

I'm a little torn about whether or not this would be of use for the drainer metrics: is there any case where we're not at fault for the drainer not working and we'd want pool owners to deal with drainer pages?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although, i suppose it would be useful for filtering regardless :)

* master:
  CLUSTERMAN-812: upgrade k8s client library (#334)
  Revert "Revert "Add support to use in-cluster service account (#338)"" (#340)
  Revert "Add support to use in-cluster service account (#338)"
  Add support to use in-cluster service account (#338)
  Updated PATH variable in supervisod.conf
  Grant home dir permission to user nobody
  Failing on signal errors
  pin cryptography for acceptance venv
  bump yelp-batch to 11.2.7
  using right targz
  Adding new targz with metrics fix
  Adding old acceptance test
  Revert "Revert "Upgrading clusterman image to Ubuntu Jammy""
Signed-off-by: Max Falk <[email protected]>
Signed-off-by: Max Falk <[email protected]>
Signed-off-by: Max Falk <[email protected]>
Signed-off-by: Max Falk <[email protected]>
@gmdfalk gmdfalk merged commit 2595b5e into master Oct 13, 2023
2 of 5 checks passed
nemacysts added a commit that referenced this pull request Oct 18, 2023
nemacysts added a commit that referenced this pull request Oct 18, 2023
* Revert "add and emit pool owner metadata for alerting (#327)"

This reverts commit 2595b5e.

* Revert "CLUSTERMAN-812: upgrade k8s client library (#334)"

This reverts commit 6c4b8bb.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants