Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observability stack #11

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions observability/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@

# Observability Stack

This directory contains an observability implementation based on Grafana tooling

## Caveats
1) reliance on ref-implementation for SSO
- This is possible to work around by removing the `auth.generic_oauth` section from `prometheus.yaml` and removing the `grafana-config.yaml` and `grafana-external-secret.yaml` files
2) using `tls_skip_verify_insecure` for oauth
- This is due to using the ingress certificate. Once this is addressed, we can remove this
3) Bigger memory requirement required for kind cluster
- Due to using a more robust loki deployment, the memory limits have been increased. 16 GB seems to work while leaving ample room in the cluster.

## Components
The observability stack is built upon:
- Prometheus - metrics
- Loki - logging
- Promtail - log delivery

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interested in why promtail is used and something like fluentbit or opentelemetry collector isn't being used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Promtail is really easy to use if only Loki is being used. That being said, I may look at switching to fluent bit as I've used it in another project recently.

Is there a compelling reason to move from promtail to either? Grafana Agent could be another option as well.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fluentbit is more popular among EKS end users also supported when using fargate

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opentelemety for logs is very new and not a lot end users have adopted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be a pretty simple swap for fluebtbit. Let me take a look at.

- Opencost - cost accounting
- Grafana - visualization
- Alertmanager - alerting

## Installation
Note: The stack is configured to use Keycloak for SSO; therefore, the ref-implementation is required for this to work.

`idpbuilder create --use-path-routing --package-dir ./ref-implementation --package-dir ./observability`

A `grafana-config` job will be deployed into the keycloak namespace to create/patch some of the keycloak components. If deployed at the same time as the `ref-implementation`, this job will fail until the `config` job succeeds. This is normal
50 changes: 50 additions & 0 deletions observability/loki.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
apiVersion: argoproj.io/v1alpha1
kind: Application

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could move these to applicationsets to make it easier for folks to move to adopt in production easier now that idpbuilder supports appsets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is definitely something Manabu and I spoke about. I was waiting for the example to see how to conform this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it was already discussed, But I would question Loki too? Loki is GPL I would preferred OpenSearch is Apache 2

That's a typical stack that I see in fully open source for logs+traces fluentbit+opensearch

For metrics I see opentelemetry-collector-daemonset+prometheus

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@csantanapr thanks for raising this.

I am not a lawyer, so please correct me if I'm wrong: my understanding with AGPL is that we can't modify the source of the application without copy left. I believe many have built observability stacks on top of the Grafana stack, which is AGPL for the core components (Loki, Grafana, Tempo, Mimir). As we are not modifying the source, we should be ok to use it. Again, please please please correct me if I'm wrong.

This is a valid concern and I believe the flexibility of working in stacks allows us to create another implementation that relies on other tooling. What I do believe is that we need to come to an agreement on what our opinionated stack is. If this is not it, I'm ok with that, but let's discuss this during the next community meeting so we can figure out how we want to proceed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we discussed this and yes since Grafana is already AGPL, choosing Loki as another AGPL project is less of a concern. That said, I agree with the discussion above that we should also think about using OpenSearch given its popularity. Publishing it as an alternative observability stack sounds good.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will say we do use the Otel Collector as a daemonset for logs and prometheus metrics and eventually for traces.

I do understand that the community is much more invested in fluentbit for logging so I think a standard of OpenSearch + FluentBit with OpenTelemetry Collector Daemonset for Prometheus Metrics/Otel Traces seems to be a good pattern for me.

Copy link
Contributor Author

@omrishiv omrishiv Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If It makes sense, I suggest:

  1. Swapping promtail for FluentBit and keeping the rest of this stack the same. This gives us a Grafana based stack to work with (though, there's an argument to be made to swap promtail for Grafana Agent so we are closer to the LGTM implementation)
  2. Creating another stack based on Opensearch + Fluenbit, OTEL/prom as an alternative. This gives us the opportunity for testing how we have substitutable stacks.

We can have both live under /observability and have /observability/grafana-stack and /observability/otel-stack

I'm ok working on all of this and would be happy to put the otel stack together as well. Does that make sense to do?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@omrishiv
I agree with having separate stacks like /observability/grafana-stack and /observability/otel-stack.

@csantanapr @blakeromano What are your thoughts on this?

metadata:
name: loki
namespace: argocd
labels:
env: dev
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: default
sources:
- repoURL: 'https://grafana.github.io/helm-charts'
targetRevision: 6.6.3
helm:
releaseName: loki
values: |
deploymentMode: SingleBinary
loki:
commonConfig:
replication_factor: 1
storage:
type: 'filesystem'
schemaConfig:
configs:
- from: "2024-01-01"
store: tsdb
index:
prefix: loki_index_
period: 24h
object_store: filesystem # we're storing on filesystem so there's no real persistence here.
schema: v13
singleBinary:
replicas: 1
read:
replicas: 0
backend:
replicas: 0
write:
replicas: 0
chart: loki
destination:
server: "https://kubernetes.default.svc"
namespace: monitoring
syncPolicy:
syncOptions:
- CreateNamespace=true
- ServerSideApply=true
automated:
selfHeal: true
33 changes: 33 additions & 0 deletions observability/opencost.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: opencost
namespace: argocd
labels:
env: dev
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: default
sources:
- repoURL: 'https://opencost.github.io/opencost-helm-chart'
targetRevision: 1.38.1
helm:
releaseName: opencost
values: |
opencost:
prometheus:
internal:
serviceName: prometheus-kube-prometheus-prometheus
namespaceName: monitoring
port: 9090
chart: opencost
destination:
server: "https://kubernetes.default.svc"
namespace: monitoring
syncPolicy:
syncOptions:
- CreateNamespace=true
- ServerSideApply=true
automated:
selfHeal: true
59 changes: 59 additions & 0 deletions observability/prometheus.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: prometheus
namespace: argocd
labels:
env: dev
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: default
sources:
- repoURL: 'https://prometheus-community.github.io/helm-charts'
targetRevision: 57.2.0
helm:
releaseName: prometheus
values: |
grafana:
envFromSecret: grafana-oidc
additionalDataSources:
- name: loki
access: proxy
orgId: 1
type: loki
url: http://loki-gateway
jsonData:
httpHeaderName1: X-Scope-OrgID
secureJsonData:
httpHeaderValue1: '1'
grafana.ini:
server:
root_url: https://cnoe.localtest.me:8443/grafana
serve_from_sub_path: true
auth.generic_oauth:
enabled: true
name: grafana
allow_sign_up: true
auth_url: https://cnoe.localtest.me:8443/keycloak/realms/cnoe/protocol/openid-connect/auth
token_url: https://cnoe.localtest.me:8443/keycloak/realms/cnoe/protocol/openid-connect/token
api_url: https://cnoe.localtest.me:8443/keycloak/realms/cnoe/protocol/openid-connect/userinfo
scopes: openid email profile offline_access roles
role_attribute_path: contains(resource_access.grafana.roles[*], 'admin') && 'GrafanaAdmin' || contains(resource_access.grafana.roles[*], 'admin') && 'Admin' || contains(resource_access.grafana.roles[*], 'editor') && 'Editor' || 'Viewer'
allow_assign_grafana_admin: true
role_attribute_strict: true
auto_login: true
tls_skip_verify_insecure: true
chart: kube-prometheus-stack
- repoURL: cnoe://prometheus
targetRevision: HEAD
path: "manifests"
destination:
server: "https://kubernetes.default.svc"
namespace: monitoring
syncPolicy:
syncOptions:
- CreateNamespace=true
- ServerSideApply=true
automated:
selfHeal: true
200 changes: 200 additions & 0 deletions observability/prometheus/manifests/grafana-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
---
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-config-job
namespace: keycloak
data:
client-role-admin-payload.json: |
{"name": "admin"}
client-role-editor-payload.json: |
{"name": "editor"}
client-role-viewer-payload.json: |
{"name": "viewer"}
admin-role-assignment-payload.json: |
[
{
"id": "$ADMIN_ROLE_ID",
"name": "admin"
}
]
roles-mapper-payload.json: |
{
"id":"$CLIENT_ROLES_MAPPER_ID",
"name": "client roles",
"protocol":"openid-connect",
"protocolMapper":"oidc-usermodel-client-role-mapper",
"config": {
"access.token.claim":"true",
"claim.name":"resource_access.${client_id}.roles",
"jsonType.label":"String",
"multivalued":"true",
"id.token.claim": "true",
"userinfo.token.claim": "true"
}
}
grafana-client-payload.json: |
{
"protocol": "openid-connect",
"clientId": "grafana",
"name": "Grafana Client",
"description": "Used for Grafana SSO",
"publicClient": false,
"authorizationServicesEnabled": false,
"serviceAccountsEnabled": false,
"implicitFlowEnabled": false,
"directAccessGrantsEnabled": true,
"standardFlowEnabled": true,
"frontchannelLogout": true,
"attributes": {
"saml_idp_initiated_sso_url_name": "",
"oauth2.device.authorization.grant.enabled": false,
"oidc.ciba.grant.enabled": false
},
"alwaysDisplayInConsole": false,
"rootUrl": "",
"baseUrl": "",
"redirectUris": [
"https://cnoe.localtest.me:8443/grafana/login/generic_oauth"
],
"webOrigins": [
"/*"
]
}

---
apiVersion: batch/v1
kind: Job
metadata:
name: grafana-config
namespace: keycloak
spec:
template:
metadata:
generateName: grafana-config
spec:
serviceAccountName: keycloak-config
restartPolicy: Never
volumes:
- name: keycloak-config
secret:
secretName: keycloak-config
- name: config-payloads
configMap:
name: grafana-config-job
containers:
- name: kubectl
image: docker.io/library/ubuntu:22.04
volumeMounts:
- name: keycloak-config
readOnly: true
mountPath: "/var/secrets/"
- name: config-payloads
readOnly: true
mountPath: "/var/config/"
command: ["/bin/bash", "-c"]
args:
- |
#! /bin/bash
set -ex -o pipefail
apt -qq update && apt -qq install curl jq gettext-base -y

curl -sS -LO "https://dl.k8s.io/release/v1.28.3//bin/linux/amd64/kubectl"
chmod +x kubectl

echo "checking if we're ready to start"
set +e
./kubectl get secret -n keycloak keycloak-clients &> /dev/null
if [ $? -ne 0 ]; then
exit 1
fi
set -e

ADMIN_PASSWORD=$(cat /var/secrets/KEYCLOAK_ADMIN_PASSWORD)

KEYCLOAK_URL=http://keycloak.keycloak.svc.cluster.local:8080/keycloak

KEYCLOAK_TOKEN=$(curl -sS --fail-with-body -X POST -H "Content-Type: application/x-www-form-urlencoded" \
--data-urlencode "username=cnoe-admin" \
--data-urlencode "password=${ADMIN_PASSWORD}" \
--data-urlencode "grant_type=password" \
--data-urlencode "client_id=admin-cli" \
${KEYCLOAK_URL}/realms/master/protocol/openid-connect/token | jq -e -r '.access_token')

set +e

curl --fail-with-body -H "Authorization: bearer ${KEYCLOAK_TOKEN}" "${KEYCLOAK_URL}/admin/realms/cnoe" &> /dev/null
if [ $? -ne 0 ]; then
exit 0
fi
set -e

echo "creating Grafana client"
curl -sS -H "Content-Type: application/json" \
-H "Authorization: bearer ${KEYCLOAK_TOKEN}" \
-X POST --data @/var/config/grafana-client-payload.json \
${KEYCLOAK_URL}/admin/realms/cnoe/clients

CLIENT_ID=$(curl -sS -H "Content-Type: application/json" \
-H "Authorization: bearer ${KEYCLOAK_TOKEN}" \
-X GET ${KEYCLOAK_URL}/admin/realms/cnoe/clients | jq -e -r '.[] | select(.clientId == "grafana") | .id')

CLIENT_SCOPE_GROUPS_ID=$(curl -sS -H "Content-Type: application/json" -H "Authorization: bearer ${KEYCLOAK_TOKEN}" -X GET ${KEYCLOAK_URL}/admin/realms/cnoe/client-scopes | jq -e -r '.[] | select(.name == "groups") | .id')
curl -sS -H "Content-Type: application/json" -H "Authorization: bearer ${KEYCLOAK_TOKEN}" -X PUT ${KEYCLOAK_URL}/admin/realms/cnoe/clients/${CLIENT_ID}/default-client-scopes/${CLIENT_SCOPE_GROUPS_ID}

GRAFANA_CLIENT_SECRET=$(curl -sS -H "Content-Type: application/json" \
-H "Authorization: bearer ${KEYCLOAK_TOKEN}" \
-X GET ${KEYCLOAK_URL}/admin/realms/cnoe/clients/${CLIENT_ID} | jq -e -r '.secret')

# Add Grafana roles to client
curl -sS -H "Content-Type: application/json" \
-H "Authorization: bearer ${KEYCLOAK_TOKEN}" \
-X POST --data @/var/config/client-role-admin-payload.json \
${KEYCLOAK_URL}/admin/realms/cnoe/clients/${CLIENT_ID}/roles

curl -sS -H "Content-Type: application/json" \
-H "Authorization: bearer ${KEYCLOAK_TOKEN}" \
-X POST --data @/var/config/client-role-editor-payload.json \
${KEYCLOAK_URL}/admin/realms/cnoe/clients/${CLIENT_ID}/roles

curl -sS -H "Content-Type: application/json" \
-H "Authorization: bearer ${KEYCLOAK_TOKEN}" \
-X POST --data @/var/config/client-role-viewer-payload.json \
${KEYCLOAK_URL}/admin/realms/cnoe/clients/${CLIENT_ID}/roles

export ADMIN_ROLE_ID=$(curl -sS -H "Content-Type: application/json" \
-H "Authorization: bearer ${KEYCLOAK_TOKEN}" "${KEYCLOAK_URL}/admin/realms/cnoe/clients/${CLIENT_ID}/roles/admin" | jq -r '.id')

# Assign admin role to user1
USER1_USERID=$(curl -sS -H "Content-Type: application/json" \
-H "Authorization: bearer ${KEYCLOAK_TOKEN}" "${KEYCLOAK_URL}/admin/realms/cnoe/users?lastName=one" | jq -r '.[0].id')

envsubst < /var/config/admin-role-assignment-payload.json | curl -k -sS -H 'Content-Type: application/json' \
-H "Authorization: bearer ${KEYCLOAK_TOKEN}" \
-X POST --data @- \
${KEYCLOAK_URL}/admin/realms/cnoe/users/${USER1_USERID}/role-mappings/clients/${CLIENT_ID}

# Add role to token
CLIENT_SCOPE_ROLES_ID=$(curl -sS -H "Content-Type: application/json" \
-H "Authorization: bearer ${KEYCLOAK_TOKEN}" \
-X GET ${KEYCLOAK_URL}/admin/realms/cnoe/client-scopes | jq -e -r '.[] | select(.name == "roles") | .id')

export CLIENT_ROLES_MAPPER_ID=$(curl -sS -H "Content-Type: application/json" \
-H "Authorization: bearer ${KEYCLOAK_TOKEN}" \
-X GET ${KEYCLOAK_URL}/admin/realms/cnoe/client-scopes/${CLIENT_SCOPE_ROLES_ID}/protocol-mappers/models | jq -e -r '.[] | select(.name == "client roles") | .id')

cat /var/config/roles-mapper-payload.json | envsubst '$CLIENT_ROLES_MAPPER_ID' | curl -sS -H "Content-Type: application/json" \
-H "Authorization: bearer ${KEYCLOAK_TOKEN}" \
-X PUT --data @- \
${KEYCLOAK_URL}/admin/realms/cnoe/client-scopes/${CLIENT_SCOPE_ROLES_ID}/protocol-mappers/models/${CLIENT_ROLES_MAPPER_ID}

./kubectl patch secret -n keycloak keycloak-clients --type=json \
-p='[{
"op" : "add" ,
"path" : "/data/GRAFANA_CLIENT_SECRET" ,
"value" : "'$(echo -n "$GRAFANA_CLIENT_SECRET" | base64 -w 0)'"
},{
"op" : "add" ,
"path" : "/data/GRAFANA_CLIENT_ID" ,
"value" : "'$(echo -n "grafana" | base64 -w 0)'"
}]'
20 changes: 20 additions & 0 deletions observability/prometheus/manifests/grafana-external-secret.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: keycloak-oidc
namespace: monitoring
spec:
secretStoreRef:
name: keycloak
kind: ClusterSecretStore
target:
name: grafana-oidc
data:
- secretKey: GF_AUTH_GENERIC_OAUTH_CLIENT_ID
remoteRef:
key: keycloak-clients
property: GRAFANA_CLIENT_ID
- secretKey: GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET
remoteRef:
key: keycloak-clients
property: GRAFANA_CLIENT_SECRET
Loading
Loading