-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hub metrics collector feature #1340
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: coleenquadros The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
222a411
to
0d25eba
Compare
/retest |
8b99b9d
to
a1cb5ef
Compare
operators/endpointmetrics/controllers/observabilityendpoint/observabilityaddon_controller.go
Outdated
Show resolved
Hide resolved
operators/endpointmetrics/controllers/observabilityendpoint/observabilityaddon_controller.go
Outdated
Show resolved
Hide resolved
return err | ||
} | ||
|
||
func createCSR() ([]byte, []byte) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems it's not a correct choice to use CSR flow here. Have you check the possibility to use the functions in https://github.com/stolostron/multicluster-observability-operator/blob/main/operators/multiclusterobservability/pkg/certificates/certificates.go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just generate a private key and then sign it to have the client cert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you check if we create the obsaddon in open-cluster-management-observability namespace or not? I think no. but just double check. thanks.
operators/endpointmetrics/controllers/observabilityendpoint/metrics_collector.go
Outdated
Show resolved
Hide resolved
return err | ||
} | ||
|
||
func createCSR() ([]byte, []byte) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just generate a private key and then sign it to have the client cert.
operators/multiclusterobservability/controllers/placementrule/placementrule_controller.go
Show resolved
Hide resolved
return csr, privateKey | ||
} | ||
|
||
func createMtlsCertSecretForHubCollector(c client.Client) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall procedure to create client certificate looks good. The client cert may expire, cert rotation is required. You can find some example code from registration agent: https://github.com/open-cluster-management-io/ocm/blob/1c3cb033b096c0e95575dfbf6d5d12126f5d7e57/pkg/registration/clientcert/cert_controller.go#L290
eaf0f12
to
4520002
Compare
Signed-off-by: Coleen Iona Quadros <[email protected]>
Signed-off-by: Coleen Iona Quadros <[email protected]>
Signed-off-by: Coleen Iona Quadros <[email protected]>
Signed-off-by: Saswata Mukherjee <[email protected]>
Signed-off-by: Saswata Mukherjee <[email protected]>
7ef9549
to
f7ea1c2
Compare
Signed-off-by: Coleen Iona Quadros <[email protected]>
Signed-off-by: Coleen Iona Quadros <[email protected]>
Signed-off-by: Coleen Iona Quadros <[email protected]>
Signed-off-by: Coleen Iona Quadros <[email protected]>
Signed-off-by: Coleen Iona Quadros <[email protected]>
Signed-off-by: Coleen Iona Quadros <[email protected]>
Signed-off-by: Coleen Iona Quadros <[email protected]>
Signed-off-by: Coleen Iona Quadros <[email protected]>
Signed-off-by: Subbarao Meduri <[email protected]>
/retest |
Quality Gate failedFailed conditions |
@coleenquadros: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
// default: | ||
// obj.SetNamespace(config.GetDefaultNamespace()) | ||
// } | ||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we ignoring other types of errors? Isn't risky?
// switch obj.GetObjectKind().GroupVersionKind().Kind { | ||
// case "Deployment": | ||
// currentObj = &appsv1.Deployment{} | ||
// case "Secret": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
// if needsUpdate { | ||
// err = c.Update(context.TODO(), obj) | ||
// if err != nil { | ||
// log.Error(err, "Failed to update resource", "kind", obj.GetObjectKind().GroupVersionKind().Kind) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we delete this?
@@ -360,11 +362,15 @@ func (r *MultiClusterObservabilityReconciler) initFinalization( | |||
if mco.GetDeletionTimestamp() != nil && slices.Contains(mco.GetFinalizers(), resFinalizer) { | |||
log.Info("To delete resources across namespaces") | |||
// clean up the cluster resources, eg. clusterrole, clusterrolebinding, etc | |||
operatorconfig.IsMCOTerminating = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need that aren't we? The caller sets it using returned boolean.
} | ||
foundClusterMonitoringConfigurationJSON, err := yaml.YAMLToJSON([]byte(foundClusterMonitoringConfigurationYAML)) | ||
if err != nil { | ||
log.Error(err, "failed to transform YAML to JSON", "YAML", foundClusterMonitoringConfigurationYAML) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a global remark regarding the logs again that doesn't need to be changed here.
We are logging a lot errors at nested functions level. This function is called 3 levels down the Reconcile function and we are logging returned errors at every one of them (except in the Reconcile function...). So we are generating 3 logs for the same error. We could instead log only in the Reconcile
function, while wrapping errors with the additional info in sub functions (except in some rare cases were details cannot be wrapped in the error).
@@ -615,6 +863,9 @@ func generateMetricsListCM(client client.Client) (*corev1.ConfigMap, *corev1.Con | |||
|
|||
func getObservabilityAddon(c client.Client, namespace string, | |||
mco *mcov1beta2.MultiClusterObservability) (*mcov1beta1.ObservabilityAddon, error) { | |||
if namespace == config.GetDefaultNamespace() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a comment explaining this check?
@@ -57,6 +59,9 @@ func deleteObsAddon(c client.Client, namespace string) error { | |||
} | |||
|
|||
func createObsAddon(c client.Client, namespace string) error { | |||
if namespace == config.GetDefaultNamespace() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would appreciate a comment here also explaining the reason for this check.
// In the case when hubSelfManagement is enabled, we will delete it from the list and modify the object | ||
// to cater to the use case of deploying in open-cluster-management-observability namespace | ||
delete(managedClusterList, "local-cluster") | ||
if _, ok := managedClusterList["local-cluster"]; !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this check? I understand we just removed the key before.
* Hub Metrics Collection (stolostron#1339) * install Metrics Without Addon Signed-off-by: clyang82 <[email protected]> Signed-off-by: Coleen Iona Quadros <[email protected]> * install Metrics Without Addon Signed-off-by: clyang82 <[email protected]> Signed-off-by: Coleen Iona Quadros <[email protected]> * refactor Signed-off-by: Coleen Iona Quadros <[email protected]> * add reconcile check for hub collector resources Signed-off-by: Coleen Iona Quadros <[email protected]> * lint Signed-off-by: Coleen Iona Quadros <[email protected]> * make sure mco reconciles when resources for hub metrics collection are missing * cleanup Signed-off-by: Coleen Iona Quadros <[email protected]> * test failures Signed-off-by: Coleen Iona Quadros <[email protected]> * fix test Signed-off-by: Coleen Iona Quadros <[email protected]> * cleanup logs Signed-off-by: Coleen Iona Quadros <[email protected]> * fixt test Signed-off-by: Coleen Iona Quadros <[email protected]> * lint errors Signed-off-by: Coleen Iona Quadros <[email protected]> * delete endpoint when mco is deleted Signed-off-by: Coleen Iona Quadros <[email protected]> * fixt test Signed-off-by: Coleen Iona Quadros <[email protected]> * cleanup logs Signed-off-by: Coleen Iona Quadros <[email protected]> * delete metrics collector when mco is deleted Signed-off-by: Coleen Iona Quadros <[email protected]> * collector naming Signed-off-by: Coleen Iona Quadros <[email protected]> * test cert Signed-off-by: Coleen Iona Quadros <[email protected]> * test cert Signed-off-by: Coleen Iona Quadros <[email protected]> * test metrics push Signed-off-by: Coleen Iona Quadros <[email protected]> * create correct CSR Signed-off-by: Coleen Iona Quadros <[email protected]> * cert Signed-off-by: Coleen Iona Quadros <[email protected]> * test Signed-off-by: Coleen Iona Quadros <[email protected]> * lint Signed-off-by: Coleen Iona Quadros <[email protected]> * add User and Organization to CSR Signed-off-by: Subbarao Meduri <[email protected]> * set Subject Alternative Name to observability-controller.addon.open-cluster-management.io Signed-off-by: Subbarao Meduri <[email protected]> * fix build Signed-off-by: Coleen Iona Quadros <[email protected]> * comment Signed-off-by: Coleen Iona Quadros <[email protected]> * rebase Signed-off-by: Coleen Iona Quadros <[email protected]> * change key usage to UsageDigitalSignature Signed-off-by: Subbarao Meduri <[email protected]> * add watch logic for hubs metric collection resources Signed-off-by: Coleen Iona Quadros <[email protected]> * add reconcile check for hub collector resources Signed-off-by: Coleen Iona Quadros <[email protected]> * watch delete of hub metrics collection resources Signed-off-by: Coleen Iona Quadros <[email protected]> * watch delete of hub metrics collection resources Signed-off-by: Coleen Iona Quadros <[email protected]> * add watch for server ca when deleted Signed-off-by: Coleen Iona Quadros <[email protected]> * remove logs Signed-off-by: Coleen Iona Quadros <[email protected]> --------- Signed-off-by: clyang82 <[email protected]> Signed-off-by: Coleen Iona Quadros <[email protected]> Signed-off-by: Subbarao Meduri <[email protected]> Co-authored-by: clyang82 <[email protected]> Co-authored-by: Subbarao Meduri <[email protected]> Signed-off-by: Coleen Iona Quadros <[email protected]> * Update placementrule_controller.go Signed-off-by: Coleen Iona Quadros <[email protected]> * lint (stolostron#1341) Signed-off-by: Coleen Iona Quadros <[email protected]> * lint Signed-off-by: Coleen Iona Quadros <[email protected]> * lint Signed-off-by: Coleen Iona Quadros <[email protected]> * change server ca cert Signed-off-by: Coleen Iona Quadros <[email protected]> * lint Signed-off-by: Coleen Iona Quadros <[email protected]> * make sure hub metrics collection resources are properly cleaned up Signed-off-by: Coleen Iona Quadros <[email protected]> * lint Signed-off-by: Coleen Iona Quadros <[email protected]> * remove obs add on for local-cluster Signed-off-by: Coleen Iona Quadros <[email protected]> * cleanup when mco cr is removed Signed-off-by: Coleen Iona Quadros <[email protected]> * test cleanup Signed-off-by: Coleen Iona Quadros <[email protected]> * remove additional watch for hub endpoint Signed-off-by: Coleen Iona Quadros <[email protected]> * typo Signed-off-by: Coleen Iona Quadros <[email protected]> * delete resources refactor Signed-off-by: Coleen Iona Quadros <[email protected]> * typo Signed-off-by: Coleen Iona Quadros <[email protected]> * support local-cluster labels in rbac-query-proxy Signed-off-by: Subbarao Meduri <[email protected]> * metrics collector replica Signed-off-by: Coleen Iona Quadros <[email protected]> * comment proxy chnages for local-cluster Signed-off-by: Coleen Iona Quadros <[email protected]> * add back proxy changes Signed-off-by: Coleen Iona Quadros <[email protected]> * cleanup Signed-off-by: Coleen Iona Quadros <[email protected]> * cleanup ocp configs Signed-off-by: Coleen Iona Quadros <[email protected]> * refactor Signed-off-by: Coleen Iona Quadros <[email protected]> * remove unused var Signed-off-by: Coleen Iona Quadros <[email protected]> * fix resource requests setting for hub metrics collector Signed-off-by: Coleen Iona Quadros <[email protected]> * lint Signed-off-by: Coleen Iona Quadros <[email protected]> * get resource requirements refactor Signed-off-by: Coleen Iona Quadros <[email protected]> * test clean Signed-off-by: Coleen Iona Quadros <[email protected]> * refactor hub metrics collector resource Signed-off-by: Coleen Iona Quadros <[email protected]> * resource refactor Signed-off-by: Coleen Iona Quadros <[email protected]> * resource refactor Signed-off-by: Coleen Iona Quadros <[email protected]> * remove proxy changes Signed-off-by: Coleen Iona Quadros <[email protected]> * only watch endpoint Signed-off-by: Coleen Iona Quadros <[email protected]> * refactor hub metrics resource logic Signed-off-by: Coleen Iona Quadros <[email protected]> * clean resources Signed-off-by: Coleen Iona Quadros <[email protected]> * Do not access mco resource after delete is initiated Signed-off-by: Subbarao Meduri <[email protected]> * RHOBS-1009: Instrument metrics-collector properly and add ServiceMonitor and alerts for it (feature branch) (stolostron#1343) * [RHOBS-1009/ACM-8509]: Add ServiceMonitor for metrics-collector and instrument it properly Signed-off-by: Saswata Mukherjee <[email protected]> * fix test Signed-off-by: Saswata Mukherjee <[email protected]> * Add status code label to metrics and clean up Signed-off-by: Saswata Mukherjee <[email protected]> * Ensure ServiceMonitor is properly configured and authenticated according to https://rhobs-handbook.netlify.app/products/openshiftmonitoring/collecting_metrics.md/ Signed-off-by: Saswata Mukherjee <[email protected]> * Fix lint Signed-off-by: Saswata Mukherjee <[email protected]> * Use correct image for kube-rbac-proxy Signed-off-by: Saswata Mukherjee <[email protected]> * Correct indentation for kube-rbac-proxy secret Signed-off-by: Saswata Mukherjee <[email protected]> * Correct servicemonitor and add ns monitoring label Signed-off-by: Saswata Mukherjee <[email protected]> * Correct deployment label and test Signed-off-by: Saswata Mukherjee <[email protected]> * Skip KRP Signed-off-by: Saswata Mukherjee <[email protected]> * Final port/scheme fix Signed-off-by: Saswata Mukherjee <[email protected]> * Add local alerting rules Signed-off-by: Saswata Mukherjee <[email protected]> * Add apiextensionsv1 to scheme Signed-off-by: Saswata Mukherjee <[email protected]> * Add resource version to crd update Signed-off-by: Saswata Mukherjee <[email protected]> * Fix formatting Signed-off-by: Saswata Mukherjee <[email protected]> * Only import once Signed-off-by: Saswata Mukherjee <[email protected]> * Don't create krp resources Signed-off-by: Saswata Mukherjee <[email protected]> * Fix lint Signed-off-by: Saswata Mukherjee <[email protected]> * Fix lint Signed-off-by: Saswata Mukherjee <[email protected]> * Comment unused Signed-off-by: Saswata Mukherjee <[email protected]> --------- Signed-off-by: Saswata Mukherjee <[email protected]> * cleanup refactor Signed-off-by: Coleen Iona Quadros <[email protected]> * cleanup Signed-off-by: Coleen Iona Quadros <[email protected]> * lint Signed-off-by: Coleen Iona Quadros <[email protected]> * lint Signed-off-by: Coleen Iona Quadros <[email protected]> * refactor obsadd on delete Signed-off-by: Coleen Iona Quadros <[email protected]> * refactor obsadd on delete Signed-off-by: Coleen Iona Quadros <[email protected]> * cleanup Signed-off-by: Coleen Iona Quadros <[email protected]> * Add roles for monitoring addon-obs ns Signed-off-by: Saswata Mukherjee <[email protected]> * Cleanup and some fixes Signed-off-by: Saswata Mukherjee <[email protected]> * ensure MCO terminates with proper cleanup Signed-off-by: Coleen Iona Quadros <[email protected]> * ensure MCO terminates with proper cleanup Signed-off-by: Coleen Iona Quadros <[email protected]> * add special case for collector Signed-off-by: Coleen Iona Quadros <[email protected]> * ensure hub metrics collection's resources are updated when spec differs Signed-off-by: Coleen Iona Quadros <[email protected]> * cleanup Signed-off-by: Coleen Iona Quadros <[email protected]> * cleanup Signed-off-by: Coleen Iona Quadros <[email protected]> * cleanup Signed-off-by: Coleen Iona Quadros <[email protected]> * cleanup Signed-off-by: Coleen Iona Quadros <[email protected]> * fix e2e test failures Signed-off-by: Subbarao Meduri <[email protected]> --------- Signed-off-by: clyang82 <[email protected]> Signed-off-by: Coleen Iona Quadros <[email protected]> Signed-off-by: Subbarao Meduri <[email protected]> Signed-off-by: Saswata Mukherjee <[email protected]> Co-authored-by: clyang82 <[email protected]> Co-authored-by: Subbarao Meduri <[email protected]> Co-authored-by: Saswata Mukherjee <[email protected]>
Ticket - https://issues.redhat.com/browse/ACM-8509
This PR refactors the deployment of the metrics-collector in the hub/local-cluster. We want to ensure that the metrics-collector on the hub runs independent of hubselfmanagement toggle.
Since we no longer use observability add-on on the hub, we lose the ability to report failure to send metrics as via observability addon status. To compensate for this:
Known issues (to be tracked as defects)