Skip to content

Centralised Logging and Monitoring

southeo edited this page Nov 21, 2024 · 5 revisions

1. Introduction

DiSSCo uses OpenTelemetry, an open source observability framework, to collect logs and metrics from the DiSSCo Kubernetes cluster.
Kubernetes exposes telemetry in many different ways, including logs, metrics, and events. We use the Prometheus Node Exporter and the Kube State Metrics service to capture and expose this telemetry data to the OpenTelemetry collector, which exports it to the Naturalis Observability stack. The data is then visualized in Grafana.

This document describes the observability stack used to monitor the DiSSCo Kubernetes cluster. Its designed to complement the observability documentation developed by the Naturalis Infra team.

This document references the following:

2. Collecting Data with OpenTelemetry (OTel)

The OpenTelemetry Collector is a proxy that can receive, process, and export telemetry data. To gather metrics and logs in kubernetes, two deployments of the OTel Collector are required:

  1. A Deployment collector on a single Pod for cluster metrics
  2. A Daemonset deployed on each Node to gather the Pod logs.

2.1 Receivers

Receivers are tools that collect telemetry data. While OpenTelemtry offers some out-of-the-box Kubernetes configurations, they do not export metrics that grafana can effectively use. Instead of using the default OTel kubernetes receivers, we use a Prometheus receiver to collect data from the node-exporter and kube-state-metrics deployments. Prometheus is an open-source monitoring and alerting toolkit.

Daemonset Receivers

Deployment Receivers

2.1.a Deployment Collector Prerequisites

The Kubernetes events receiver and the Prometheus receiver rely on their own dedicated deployments to receive the metrics needed. These can be deployed on a Kubernetes cluster via their respective helm charts:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install ksm prometheus-community/kube-state-metrics -n "otel"  
helm install nodeexporter prometheus-community/prometheus-node-exporter -n "default"

For more information on helm charts, see section 3.2

2.2. Processors and Extensions

Processors are collector components that process data in intermediate steps. Extensions add additional functionality.

Extensions

  • Health check
  • Oauth2client -> Authenticates collector with the Naturalis Observability stack. This requires an injected secret.

Processors

  • Batch {} -> including is best practice
  • Memory_limiter -> used to prevent out of memory situations on the collector
  • transform/host -> adds metadata to resources. Useful to help locate appropriate telemetry data downstream.

2.2.a Annotating Data Appropriately

In Grafana, all Naturalis data is aggregated in one place. In order to identify the telemetry data of your service, you need to configure the service namespace correctly. The following attributes are set using the transform/host processor:

Attribute Name Value
service_namespace naturalis.bii.dissco
deployment_environment test / acceptance / production
service_name dissco-deploy / dissco-node
service_owner [email protected]
service_team dissco
cluster test-cluster / acc-cluster / prod-cluster

2.3 Exporters

Exporters export telemetry data to a given source. In our cae, we want to send data to the Naturalis observability gateways. There are two gateways: one for metrics, one for logs. Once data is sent, it is pushed through the Naturalis Observability Stack and made available through Grafana.

3. Deploying Components

3.1 Configuring OpenTelemetry Collectors on Kubernetes

The OTel collectors will need specific permissions. A Secret Provider Class needs to be configured with the exporter secret, and a Kubernetes Service Account needs to be configured with permissions to get the secrets. In a separate Role, it will need permission to get, list, and watch various resources to read their endpoints. These Roles then will need a RoleBinding to the service account.

Both the daemonset and deployment collectors need the secret store volumes mounted in their configuration. The daemonset collector will also need the log volumes mounted.

See the DiSSCO OTel Kubernetes configuration for specific configurations and role permissions needed.

3.2 Helm and ArgoCD

Operators are software extensions to Kubernetes that make use of custom resources (such as open telemetry collectors) to manage applications. We use the OpenTelemetry Operator to manage our OpnenTelemetry Collectors (deployment and daemonsets).

The Operator is deployed as a Helm chart. Helm is a tool for managing Kubernetes resources

In DiSSCo, we manage our helm charts using ArgoCD, a Continuous Delivert tool for Kubernetes. ArgoCD configurations are deployed as applications on Kubernetes.

To summarize: OTel Collectors -> managed by -> OTel Operator -> deployed as -> Helm chart -> managed by -> ArgoCD

Additional ArgoCD applications manage the Prometheus node exporter and the kubernetes state metrics deployments, described in section 2.2.a

4. Visualizing Data With Grafana

Telemetry data across Naturalis is aggregated in Grafana, a data visualization platform.

4.1 Visualizing Metrics Data

Metrics are stored in Mimir, a storage solution for Prometheus metrics, and are available in Grafana as a data source. Mimir fetches data using PromQL, the Prometheus query language.

To get metrics received in a given namespace, you can use the following query:

group by(__name__) ({__name__!="", service_namespace="${service_namespace}"})

There are a few Prometheus data sources in Grafana, but DiSSCo data goes directly to the main Mimir data source.

There are many Grafana dashboard templates publicly available. We use dashboards from this repository.

4.2 Visualizing Log Data

Logs are stored in Clickhouse, a column-oriented SQL database for analytics. DiSSCo data is stored in the main Clickhouse datasource on Grafana.

Clone this wiki locally