-
Notifications
You must be signed in to change notification settings - Fork 1
Centralised Logging and Monitoring
DiSSCo uses OpenTelemetry, an open source observability framework, to collect logs and metrics from the DiSSCo Kubernetes cluster.
Kubernetes exposes telemetry in many different ways, including logs, metrics, and events. We use the Prometheus Node Exporter and the Kube State Metrics service to capture and expose this telemetry data to the OpenTelemetry collector, which exports it to the Naturalis Observability stack. The data is then visualized in Grafana.
This document describes the observability stack used to monitor the DiSSCo Kubernetes cluster. Its designed to complement the observability documentation developed by the Naturalis Infra team.
This document references the following:
- Observability at Naturalis
- DiSSCO OTel Kubernetes configuration
- Kubernetes Events Receiver
- Sending Kubernetes Metrics to Grafana
- Prometheus Receiver GitHub
- Filelog Receiver GitHub
- Debug Receiver
The OpenTelemetry Collector is a proxy that can receive, process, and export telemetry data. To gather metrics and logs in kubernetes, two deployments of the OTel Collector are required:
- A Deployment collector on a single Pod for cluster metrics
- A Daemonset deployed on each Node to gather the Pod logs.
Receivers are tools that collect telemetry data. While OpenTelemtry offers some out-of-the-box Kubernetes configurations, they do not export metrics that grafana can effectively use. Instead of using the default OTel kubernetes receivers, we use a Prometheus receiver to collect data from the node-exporter and kube-state-metrics deployments. Prometheus is an open-source monitoring and alerting toolkit.
Daemonset Receivers
- Filelog
- See Grafana tutorial to configure receiver to capture appropriate logs
- See GitHub for overview of the receiver
- (Optional) Debug receiver to output results to console
Deployment Receivers
- Kubernetes Events Receiver
- Prometheus Receiver
- See Grafana tutorial to configure receiver to capture appropriate metrics
- See GitHub for overview of the receiver (Optional) Debug receiver to output results to console
The Kubernetes events receiver and the Prometheus receiver rely on their own dedicated deployments to receive the metrics needed. These can be deployed on a Kubernetes cluster via their respective helm charts:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install ksm prometheus-community/kube-state-metrics -n "otel"
helm install nodeexporter prometheus-community/prometheus-node-exporter -n "default"
For more information on helm charts, see section 3.2
Processors are collector components that process data in intermediate steps. Extensions add additional functionality.
Extensions
- Health check
- Oauth2client -> Authenticates collector with the Naturalis Observability stack. This requires an injected secret.
Processors
- Batch {} -> including is best practice
- Memory_limiter -> used to prevent out of memory situations on the collector
- transform/host -> adds metadata to resources. Useful to help locate appropriate telemetry data downstream.
In Grafana, all Naturalis data is aggregated in one place. In order to identify the telemetry data of your service, you need to configure the service namespace correctly. The following attributes are set using the transform/host processor:
Attribute Name | Value |
---|---|
service_namespace | naturalis.bii.dissco |
deployment_environment | test / acceptance / production |
service_name | dissco-deploy / dissco-node |
service_owner | [email protected] |
service_team | dissco |
cluster | test-cluster / acc-cluster / prod-cluster |
Exporters export telemetry data to a given source. In our cae, we want to send data to the Naturalis observability gateways. There are two gateways: one for metrics, one for logs. Once data is sent, it is pushed through the Naturalis Observability Stack and made available through Grafana.
The OTel collectors will need specific permissions. A Secret Provider Class needs to be configured with the exporter secret, and a Kubernetes Service Account needs to be configured with permissions to get the secrets. In a separate Role, it will need permission to get, list, and watch various resources to read their endpoints. These Roles then will need a RoleBinding to the service account.
Both the daemonset and deployment collectors need the secret store volumes mounted in their configuration. The daemonset collector will also need the log volumes mounted.
See the DiSSCO OTel Kubernetes configuration for specific configurations and role permissions needed.
Operators are software extensions to Kubernetes that make use of custom resources (such as open telemetry collectors) to manage applications. We use the OpenTelemetry Operator to manage our OpnenTelemetry Collectors (deployment and daemonsets).
The Operator is deployed as a Helm chart. Helm is a tool for managing Kubernetes resources
In DiSSCo, we manage our helm charts using ArgoCD, a Continuous Delivert tool for Kubernetes. ArgoCD configurations are deployed as applications on Kubernetes.
To summarize: OTel Collectors -> managed by -> OTel Operator -> deployed as -> Helm chart -> managed by -> ArgoCD
Additional ArgoCD applications manage the Prometheus node exporter and the kubernetes state metrics deployments, described in section 2.2.a
Telemetry data across Naturalis is aggregated in Grafana, a data visualization platform.
Metrics are stored in Mimir, a storage solution for Prometheus metrics, and are available in Grafana as a data source. Mimir fetches data using PromQL, the Prometheus query language.
To get metrics received in a given namespace, you can use the following query:
group by(__name__) ({__name__!="", service_namespace="${service_namespace}"})
There are a few Prometheus data sources in Grafana, but DiSSCo data goes directly to the main Mimir data source.
There are many Grafana dashboard templates publicly available. We use dashboards from this repository.
Logs are stored in Clickhouse, a column-oriented SQL database for analytics. DiSSCo data is stored in the main Clickhouse datasource on Grafana.