Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hub metrics collector feature #1340

Merged
merged 55 commits into from
Feb 27, 2024
Merged

Hub metrics collector feature #1340

merged 55 commits into from
Feb 27, 2024

Conversation

coleenquadros
Copy link
Contributor

@coleenquadros coleenquadros commented Feb 20, 2024

Ticket - https://issues.redhat.com/browse/ACM-8509
This PR refactors the deployment of the metrics-collector in the hub/local-cluster. We want to ensure that the metrics-collector on the hub runs independent of hubselfmanagement toggle.

  • The placement controller of MCO operator treats the hub cluster as a special case, and directly creates the endpoint operator and the related resources (mTLS certificates, metrics lists, hub-info-secret that drives CMO configuration changes), in the open-cluster-management-observability namespace, instead of creating them as ManifestWork objects. It further ensures that it has no dependencies on the add-on framework. There will no longer be an Observability add-on on the hub cluster. In this respect, the hub cluster (local-cluster managed cluster) will be treated as a special case. In turn, the endpoint operator:
  • Creates a metrics collector deployment always in open-cluster-management-observability namespace on the hub. This namespace will be different from open-cluster-management-addon-observability used in all other spokes
  • Makes modifications to CMO operator configuration to ensure that hub cluster Prometheus continues to forward alerts to ACM Alert Manager
  • The MCO operator placement controller ignores managed cluster create/update/delete events for local-cluster, so observability add-on is not creating managed cluster components in the old way. There will be only one way the endpoint operator is created on the hub.
  • The MCO operator takes ownership of refreshing mTLS certificates for the collector on the hub, and restarts metrics collector deployment when the certificates expire/change/refreshed.
  • The MCO operator will be responsible for installing these hub components when ACM is installed and Observability is enabled.
  • The MCO operator will be responsible for upgrading these components when ACM is upgraded.
  • The MCO operator will be responsible for uninstalling these components when ACM is uninstalled or Observability is disabled.

Since we no longer use observability add-on on the hub, we lose the ability to report failure to send metrics as via observability addon status. To compensate for this:

  • Metrics collector should publish metrics related to failed metrics upload count.
  • A local alert is raised when the upload metrics count exceeds a defined threshold
  • The count should be reset after the error is cleared and metrics are propagating again.

Known issues (to be tracked as defects)

  1. Certificate rotation for hub metrics collector after it expires in a year
  2. Image replacement annotations don't work for hub endpoint operator and metrics collector (dev feature)
  3. Some scenarios where "local-cluster" is still not available in grafana
  4. Retrofit e2e tests (dev feature)

Copy link

openshift-ci bot commented Feb 20, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: coleenquadros

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coleenquadros
Copy link
Contributor Author

/retest

@subbarao-meduri subbarao-meduri changed the title First drop: Hub metrics collector feature Hub metrics collector feature Feb 21, 2024
@coleenquadros
Copy link
Contributor Author

/retest

return err
}

func createCSR() ([]byte, []byte) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems it's not a correct choice to use CSR flow here. Have you check the possibility to use the functions in https://github.com/stolostron/multicluster-observability-operator/blob/main/operators/multiclusterobservability/pkg/certificates/certificates.go?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just generate a private key and then sign it to have the client cert.

Copy link
Contributor

@clyang82 clyang82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check if we create the obsaddon in open-cluster-management-observability namespace or not? I think no. but just double check. thanks.

return err
}

func createCSR() ([]byte, []byte) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just generate a private key and then sign it to have the client cert.

return csr, privateKey
}

func createMtlsCertSecretForHubCollector(c client.Client) error {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall procedure to create client certificate looks good. The client cert may expire, cert rotation is required. You can find some example code from registration agent: https://github.com/open-cluster-management-io/ocm/blob/1c3cb033b096c0e95575dfbf6d5d12126f5d7e57/pkg/registration/clientcert/cert_controller.go#L290

coleenquadros and others added 5 commits February 26, 2024 09:11
Signed-off-by: Coleen Iona Quadros <[email protected]>
Signed-off-by: Coleen Iona Quadros <[email protected]>
Signed-off-by: Coleen Iona Quadros <[email protected]>
Signed-off-by: Saswata Mukherjee <[email protected]>
coleenquadros and others added 9 commits February 26, 2024 18:01
Signed-off-by: Coleen Iona Quadros <[email protected]>
Signed-off-by: Coleen Iona Quadros <[email protected]>
Signed-off-by: Coleen Iona Quadros <[email protected]>
Signed-off-by: Coleen Iona Quadros <[email protected]>
Signed-off-by: Coleen Iona Quadros <[email protected]>
Signed-off-by: Subbarao Meduri <[email protected]>
@coleenquadros
Copy link
Contributor Author

/retest

Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
41.3% Coverage on New Code (required ≥ 70%)

See analysis details on SonarCloud

Copy link

openshift-ci bot commented Feb 27, 2024

@coleenquadros: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/sonarcloud d7af621 link true /test sonarcloud
ci/prow/e2e-kind d7af621 link true /test e2e-kind
ci/prow/test-e2e d7af621 link true /test test-e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

// default:
// obj.SetNamespace(config.GetDefaultNamespace())
// }
//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we ignoring other types of errors? Isn't risky?

// switch obj.GetObjectKind().GroupVersionKind().Kind {
// case "Deployment":
// currentObj = &appsv1.Deployment{}
// case "Secret":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

// if needsUpdate {
// err = c.Update(context.TODO(), obj)
// if err != nil {
// log.Error(err, "Failed to update resource", "kind", obj.GetObjectKind().GroupVersionKind().Kind)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we delete this?

@@ -360,11 +362,15 @@ func (r *MultiClusterObservabilityReconciler) initFinalization(
if mco.GetDeletionTimestamp() != nil && slices.Contains(mco.GetFinalizers(), resFinalizer) {
log.Info("To delete resources across namespaces")
// clean up the cluster resources, eg. clusterrole, clusterrolebinding, etc
operatorconfig.IsMCOTerminating = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need that aren't we? The caller sets it using returned boolean.

}
foundClusterMonitoringConfigurationJSON, err := yaml.YAMLToJSON([]byte(foundClusterMonitoringConfigurationYAML))
if err != nil {
log.Error(err, "failed to transform YAML to JSON", "YAML", foundClusterMonitoringConfigurationYAML)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a global remark regarding the logs again that doesn't need to be changed here.
We are logging a lot errors at nested functions level. This function is called 3 levels down the Reconcile function and we are logging returned errors at every one of them (except in the Reconcile function...). So we are generating 3 logs for the same error. We could instead log only in the Reconcile function, while wrapping errors with the additional info in sub functions (except in some rare cases were details cannot be wrapped in the error).

@@ -615,6 +863,9 @@ func generateMetricsListCM(client client.Client) (*corev1.ConfigMap, *corev1.Con

func getObservabilityAddon(c client.Client, namespace string,
mco *mcov1beta2.MultiClusterObservability) (*mcov1beta1.ObservabilityAddon, error) {
if namespace == config.GetDefaultNamespace() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a comment explaining this check?

@@ -57,6 +59,9 @@ func deleteObsAddon(c client.Client, namespace string) error {
}

func createObsAddon(c client.Client, namespace string) error {
if namespace == config.GetDefaultNamespace() {
Copy link
Contributor

@thibaultmg thibaultmg Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would appreciate a comment here also explaining the reason for this check.

// In the case when hubSelfManagement is enabled, we will delete it from the list and modify the object
// to cater to the use case of deploying in open-cluster-management-observability namespace
delete(managedClusterList, "local-cluster")
if _, ok := managedClusterList["local-cluster"]; !ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this check? I understand we just removed the key before.

@subbarao-meduri subbarao-meduri merged commit 634f766 into main Feb 27, 2024
10 of 15 checks passed
@subbarao-meduri subbarao-meduri deleted the hub-metrics-collector branch February 27, 2024 15:27
sradco pushed a commit to sradco/multicluster-observability-operator that referenced this pull request Mar 17, 2024
* Hub Metrics Collection (stolostron#1339)

* install Metrics Without Addon

Signed-off-by: clyang82 <[email protected]>
Signed-off-by: Coleen Iona Quadros <[email protected]>

* install Metrics Without Addon

Signed-off-by: clyang82 <[email protected]>
Signed-off-by: Coleen Iona Quadros <[email protected]>

* refactor

Signed-off-by: Coleen Iona Quadros <[email protected]>

* add reconcile check for hub collector resources

Signed-off-by: Coleen Iona Quadros <[email protected]>

* lint

Signed-off-by: Coleen Iona Quadros <[email protected]>

* make sure mco reconciles when resources for hub metrics collection are missing

* cleanup

Signed-off-by: Coleen Iona Quadros <[email protected]>

* test failures

Signed-off-by: Coleen Iona Quadros <[email protected]>

* fix test

Signed-off-by: Coleen Iona Quadros <[email protected]>

* cleanup logs

Signed-off-by: Coleen Iona Quadros <[email protected]>

* fixt test

Signed-off-by: Coleen Iona Quadros <[email protected]>

* lint errors

Signed-off-by: Coleen Iona Quadros <[email protected]>

* delete endpoint when mco is deleted

Signed-off-by: Coleen Iona Quadros <[email protected]>

* fixt test

Signed-off-by: Coleen Iona Quadros <[email protected]>

* cleanup logs

Signed-off-by: Coleen Iona Quadros <[email protected]>

* delete metrics collector when mco is deleted

Signed-off-by: Coleen Iona Quadros <[email protected]>

* collector naming

Signed-off-by: Coleen Iona Quadros <[email protected]>

* test cert

Signed-off-by: Coleen Iona Quadros <[email protected]>

* test cert

Signed-off-by: Coleen Iona Quadros <[email protected]>

* test metrics push

Signed-off-by: Coleen Iona Quadros <[email protected]>

* create correct CSR

Signed-off-by: Coleen Iona Quadros <[email protected]>

* cert

Signed-off-by: Coleen Iona Quadros <[email protected]>

* test

Signed-off-by: Coleen Iona Quadros <[email protected]>

* lint

Signed-off-by: Coleen Iona Quadros <[email protected]>

* add User and Organization to CSR

Signed-off-by: Subbarao Meduri <[email protected]>

* set Subject Alternative Name to observability-controller.addon.open-cluster-management.io

Signed-off-by: Subbarao Meduri <[email protected]>

* fix build

Signed-off-by: Coleen Iona Quadros <[email protected]>

* comment

Signed-off-by: Coleen Iona Quadros <[email protected]>

* rebase

Signed-off-by: Coleen Iona Quadros <[email protected]>

* change key usage to UsageDigitalSignature

Signed-off-by: Subbarao Meduri <[email protected]>

* add watch logic for hubs metric collection resources

Signed-off-by: Coleen Iona Quadros <[email protected]>

* add reconcile check for hub collector resources

Signed-off-by: Coleen Iona Quadros <[email protected]>

* watch delete of hub metrics collection resources

Signed-off-by: Coleen Iona Quadros <[email protected]>

* watch delete of hub metrics collection resources

Signed-off-by: Coleen Iona Quadros <[email protected]>

* add watch for server ca when deleted

Signed-off-by: Coleen Iona Quadros <[email protected]>

* remove logs

Signed-off-by: Coleen Iona Quadros <[email protected]>

---------

Signed-off-by: clyang82 <[email protected]>
Signed-off-by: Coleen Iona Quadros <[email protected]>
Signed-off-by: Subbarao Meduri <[email protected]>
Co-authored-by: clyang82 <[email protected]>
Co-authored-by: Subbarao Meduri <[email protected]>
Signed-off-by: Coleen Iona Quadros <[email protected]>

* Update placementrule_controller.go

Signed-off-by: Coleen Iona Quadros <[email protected]>

* lint (stolostron#1341)

Signed-off-by: Coleen Iona Quadros <[email protected]>

* lint

Signed-off-by: Coleen Iona Quadros <[email protected]>

* lint

Signed-off-by: Coleen Iona Quadros <[email protected]>

* change server ca cert

Signed-off-by: Coleen Iona Quadros <[email protected]>

* lint

Signed-off-by: Coleen Iona Quadros <[email protected]>

* make sure hub metrics collection resources are properly cleaned up

Signed-off-by: Coleen Iona Quadros <[email protected]>

* lint

Signed-off-by: Coleen Iona Quadros <[email protected]>

* remove obs add on for local-cluster

Signed-off-by: Coleen Iona Quadros <[email protected]>

* cleanup when mco cr is removed

Signed-off-by: Coleen Iona Quadros <[email protected]>

* test cleanup

Signed-off-by: Coleen Iona Quadros <[email protected]>

* remove additional watch for hub endpoint

Signed-off-by: Coleen Iona Quadros <[email protected]>

* typo

Signed-off-by: Coleen Iona Quadros <[email protected]>

* delete resources refactor

Signed-off-by: Coleen Iona Quadros <[email protected]>

* typo

Signed-off-by: Coleen Iona Quadros <[email protected]>

* support local-cluster labels in rbac-query-proxy

Signed-off-by: Subbarao Meduri <[email protected]>

* metrics collector replica

Signed-off-by: Coleen Iona Quadros <[email protected]>

* comment proxy chnages for local-cluster

Signed-off-by: Coleen Iona Quadros <[email protected]>

* add back proxy changes

Signed-off-by: Coleen Iona Quadros <[email protected]>

* cleanup

Signed-off-by: Coleen Iona Quadros <[email protected]>

* cleanup ocp configs

Signed-off-by: Coleen Iona Quadros <[email protected]>

* refactor

Signed-off-by: Coleen Iona Quadros <[email protected]>

* remove unused var

Signed-off-by: Coleen Iona Quadros <[email protected]>

* fix resource requests setting for hub metrics collector

Signed-off-by: Coleen Iona Quadros <[email protected]>

* lint

Signed-off-by: Coleen Iona Quadros <[email protected]>

* get resource requirements refactor

Signed-off-by: Coleen Iona Quadros <[email protected]>

* test clean

Signed-off-by: Coleen Iona Quadros <[email protected]>

* refactor hub metrics collector resource

Signed-off-by: Coleen Iona Quadros <[email protected]>

* resource refactor

Signed-off-by: Coleen Iona Quadros <[email protected]>

* resource refactor

Signed-off-by: Coleen Iona Quadros <[email protected]>

* remove proxy changes

Signed-off-by: Coleen Iona Quadros <[email protected]>

* only watch endpoint

Signed-off-by: Coleen Iona Quadros <[email protected]>

* refactor hub metrics resource logic

Signed-off-by: Coleen Iona Quadros <[email protected]>

* clean resources

Signed-off-by: Coleen Iona Quadros <[email protected]>

* Do not access mco resource after delete is initiated

Signed-off-by: Subbarao Meduri <[email protected]>

* RHOBS-1009: Instrument metrics-collector properly and add ServiceMonitor and alerts for it (feature branch) (stolostron#1343)

* [RHOBS-1009/ACM-8509]: Add ServiceMonitor for metrics-collector and instrument it properly

Signed-off-by: Saswata Mukherjee <[email protected]>

* fix test

Signed-off-by: Saswata Mukherjee <[email protected]>

* Add status code label to metrics and clean up

Signed-off-by: Saswata Mukherjee <[email protected]>

* Ensure ServiceMonitor is properly configured and authenticated according to https://rhobs-handbook.netlify.app/products/openshiftmonitoring/collecting_metrics.md/

Signed-off-by: Saswata Mukherjee <[email protected]>

* Fix lint

Signed-off-by: Saswata Mukherjee <[email protected]>

* Use correct image for kube-rbac-proxy

Signed-off-by: Saswata Mukherjee <[email protected]>

* Correct indentation for kube-rbac-proxy secret

Signed-off-by: Saswata Mukherjee <[email protected]>

* Correct servicemonitor and add ns monitoring label

Signed-off-by: Saswata Mukherjee <[email protected]>

* Correct deployment label and test

Signed-off-by: Saswata Mukherjee <[email protected]>

* Skip KRP

Signed-off-by: Saswata Mukherjee <[email protected]>

* Final port/scheme fix

Signed-off-by: Saswata Mukherjee <[email protected]>

* Add local alerting rules

Signed-off-by: Saswata Mukherjee <[email protected]>

* Add apiextensionsv1 to scheme

Signed-off-by: Saswata Mukherjee <[email protected]>

* Add resource version to crd update

Signed-off-by: Saswata Mukherjee <[email protected]>

* Fix formatting

Signed-off-by: Saswata Mukherjee <[email protected]>

* Only import once

Signed-off-by: Saswata Mukherjee <[email protected]>

* Don't create krp resources

Signed-off-by: Saswata Mukherjee <[email protected]>

* Fix lint

Signed-off-by: Saswata Mukherjee <[email protected]>

* Fix lint

Signed-off-by: Saswata Mukherjee <[email protected]>

* Comment unused

Signed-off-by: Saswata Mukherjee <[email protected]>

---------

Signed-off-by: Saswata Mukherjee <[email protected]>

* cleanup refactor

Signed-off-by: Coleen Iona Quadros <[email protected]>

* cleanup

Signed-off-by: Coleen Iona Quadros <[email protected]>

* lint

Signed-off-by: Coleen Iona Quadros <[email protected]>

* lint

Signed-off-by: Coleen Iona Quadros <[email protected]>

* refactor obsadd on delete

Signed-off-by: Coleen Iona Quadros <[email protected]>

* refactor obsadd on delete

Signed-off-by: Coleen Iona Quadros <[email protected]>

* cleanup

Signed-off-by: Coleen Iona Quadros <[email protected]>

* Add roles for monitoring addon-obs ns

Signed-off-by: Saswata Mukherjee <[email protected]>

* Cleanup and some fixes

Signed-off-by: Saswata Mukherjee <[email protected]>

* ensure MCO terminates with proper cleanup

Signed-off-by: Coleen Iona Quadros <[email protected]>

* ensure MCO terminates with proper cleanup

Signed-off-by: Coleen Iona Quadros <[email protected]>

* add special case for collector

Signed-off-by: Coleen Iona Quadros <[email protected]>

* ensure hub metrics collection's resources are updated when spec differs

Signed-off-by: Coleen Iona Quadros <[email protected]>

* cleanup

Signed-off-by: Coleen Iona Quadros <[email protected]>

* cleanup

Signed-off-by: Coleen Iona Quadros <[email protected]>

* cleanup

Signed-off-by: Coleen Iona Quadros <[email protected]>

* cleanup

Signed-off-by: Coleen Iona Quadros <[email protected]>

* fix e2e test failures

Signed-off-by: Subbarao Meduri <[email protected]>

---------

Signed-off-by: clyang82 <[email protected]>
Signed-off-by: Coleen Iona Quadros <[email protected]>
Signed-off-by: Subbarao Meduri <[email protected]>
Signed-off-by: Saswata Mukherjee <[email protected]>
Co-authored-by: clyang82 <[email protected]>
Co-authored-by: Subbarao Meduri <[email protected]>
Co-authored-by: Saswata Mukherjee <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants