We use Prometheus to collect metrics, and can trigger alerts based on these metrics. Alerts
are specified in their own Kubernetes resource called alerts
as we have made our own operator called
Alerterator.
This means that you can manage your alerts with kubectl
, and that alerts are not tied to a specific app or
namespace. Giving you more freedom to set up the necessary alerts for one or several apps. You can even make your own
personal alert profile.
Using kubectl
you can add, update, and remove the alert resource. Adding and updating is done with the kubectl apply -f alerts.yaml
, while delete is done either with kubectl delete alert <alert-name>
og kubectl delete -f alerts.yaml
.
You can list alerts in the cluster with kubectl get alerts
(singluar: alert
, shorten: al
), and describe a
specific alert resource with kubectl describe alert <alert-name>
.
To get started using Alerts we need to have a yaml-file describing our rules. You can start of with our recommended alerts (made by contributions from our users!).
Usually it's simplest to just call this file alerts.yaml
, which we will do in this example. You can see the content of
the file below, and you will see that we only notify Slack, but you can add more receivers.
apiVersion: "nais.io/v1"
kind: "Alert"
metadata:
name: nais-testapp
namespace: aura
labels:
team: aura
spec:
receivers: # receivers for all alerts below
slack:
channel: '#nais-alerts-dev'
prependText: '<!here> | ' # this text will be prepended to the Slack alert title
alerts:
- alert: Nais-testapp unavailable
expr: 'kube_deployment_status_replicas_unavailable{deployment="nais-testapp"} > 0'
for: 2m
action: Read app logs(kubectl logs appname). Read Application events (kubectl descibe deployment appname)
description: The app might crash sometimes due to startup errors
documentation: https://github.com/navikt/aura-doc/naisvakt/alerts.md#app_unavailable
sla: respond within 1h, during office hours
severity: danger
- alert: CoreDNS unavailable
description: CoreDNS unavailable, there are zero replicas
expr: 'kube_deployment_status_replicas_available{namespace="kube-system", deployment="coredns"} == 0'
for: 1m
action: kubectl describe pod -l app=nais-testapp
documentation: https://github.com/navikt/aura-doc/naisvakt/alerts.md#coredns
sla: respond within 1h, solve within 4h, around the clock
severity: danger
When you have written your alert-file, you can apply it with kubectl
,
kubectl apply -f alerts.yaml
If you already use nais/deploy
, adding a new deploy-action is rather simple. Create a file called
.github/workflows/alert-deploy.yaml
name: Deploy alerts to NAIS
on:
push:
branches:
- master
paths:
- 'alerts.yaml'
- '.github/workflows/alerts.yaml'
jobs:
apply-alerts:
name: Apply alerts to cluster
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v1
- name: deploy to dev
uses: nais/deploy/actions/deploy@v1
env:
APIKEY: ${{ secrets.NAIS_DEPLOY_APIKEY }}
CLUSTER: dev-fss
RESOURCE: /path/to/alerts.yaml
- name: deploy to prod
uses: nais/deploy/actions/deploy@v1
env:
APIKEY: ${{ secrets.NAIS_DEPLOY_APIKEY }}
CLUSTER: prod-fss
RESOURCE: /path/to/alerts.yaml
Alerterator supports three different types of notification, Slack,
[E-mail](spec#spec-receivers-email], and SMS. In our spec we follow the naming convention of
Alertmanager and call it receivers
.
Take a look below too see exactly how each receiver is configured.
spec:
receivers:
slack:
channel: '#teamname-alerts'
prependText: '<!here>'
Slack will always notify you when an alert is resolved.
spec:
receivers:
email:
to: '[email protected]'
send_resolved: 'true'
send_resolved
can be set to true
if you want to receiver notification for when an alert is resolved.
spec:
receivers:
sms:
recipients: '12345678,8764321'
send_resolved: 'false'
send_resolved
can be set to true
if you want to receiver notification for when an alert is resolved.
In order to minimize the feedback loop we suggest experimenting on the Prometheus server to find the right metric for
your alert and the notification threshold. The Prometheus server can be found in each
cluster, at https://prometheus.{cluster.ingress}
(i.e. https://prometheus.nais.preprod.local).
You can also visit the Alertmanager at https://alertmanager.{cluster.ingress}
(i.e. https://alertmanager.nais.preprod.local) to see which alerts are
triggered now (you can also silence already triggered alerts).
You can also use labels
in your notification by referencing them with {{ $labels.<field> }}
. Run your query on the
Prometheus server to see which labels
are available for your alert.
For example:
{{ $labels.node }} is marked as unschedulable
turns into the following when notifying:
b27apvl00178.preprod.local is marked as unschedulable
You can read more about this over at the Prometheus documentation and at the Kubernetes documentation of exposed metrics.
Using regular expression, you can target multiple apps or namespaces with one query. This saves on repeating the same alert for each your apps.
absent(up{app=~"myapp|otherapp|thirdapp"})
Here we use =~
to select labels that match the provided string (or substring) using a regular expression. Use !~
to negate the regular expression.
When doing this, it can be smart to use labels (mention above), to list which namespace the alert is tied to. Use the
kubernetes_namespace
label in your action
or description
by adding {{ $labels.kubernetes_namespace }}
, and it
will write the namespace for the app that is having problems.
Slack has their own syntax for notifying @channel
or @here
, respectively <!channel>
and <!here>
.
Notifying a user group on the other hand is a bit more complicated. The user group @nais-vakt
is written
<!subteam^SB8KS4WAV|nais-vakt>
in a Slack alert message, where SB8KS4WAV
is the id for the specific user group, and
nais-vakt
is the name of the user group.
You can find the id by right clicking on the name in the user group list. Where last part in the URL is the id. The URL will look something like the one below:
https://nav-it.slack.com/usergroups/SB8KS4WAV
Slack alerts supports colors. The severity field defines the color (good
is results in a green message, warning
in a
yellow and danger
in a red). If you want to define another color, specify the wanted hex code as severity instead of
the pre-defined words.