Monitoring and alerts for Montagu and supporting services.
We should consider separating out the Montagu-specific bits.
This repo of a Docker Compose configuration that spins up a Prometheus instance with an accompanying alert manager. These instances are configured by:
prometheus.yml
- Main config (see docs)alert-rules.yml
- What conditions should trigger alerts (see docs)alertmanager.yml
- Alertmanager config. This controls where alerts get posted to (see docs)
To start the monitor and external metric exporters (see below) use:
git submodule init
git submodule update
pip3 install -r requirements.txt --user
./run
To reload Prometheus and the alert manager after a config change, run
./reload
To run locally and have the alert manager notify a test Slack channel rather than creating noise in the real monitor channel, use
./run --dev
and for reloading
./reload --dev
To force alerts to fire just invert the rules in prometheus/alert-rules.yml
temporarily, e.g. change a rule expression
like
up{job="bb8"} == 0
to
up{job="bb8"} == 1
Connect as the vagrant
user on bots.dide.ic.ac.uk
, then
# git clone --recursive https://github.com/vimc/montagu-monitor monitor
cd ~/monitor
git pull
pip3 install --user -r requirements.txt
And then either call ./run
(if there are code changes) or ./reload
(to
refresh the config).
Prometheus relies on the services it is monitoring serving up a text file that
exports values to monitor. By convention, these are served at
SERVICE_URL/metrics
, and each line follows this syntax:
<metric name>{<label name>=<label value>, ...} <metric value>
The intention is that we will add /metric
endpoints to our various apps,
either:
- Using existing metrics endpoints built-in to things like Docker Daemon (see list)
- Using existing "exporters", that sit alongside in a separate docker container, like the one for Postgres (see list)
- Directly integrated into the app (using one of the client libraries)
- Write our own exporter to sit alongside as a small Flask app in a separate container
For monitoring external services (like S3) there's no need to deploy them
separately; instead we can deploy them alongside Prometheus. So far we have one:
aws_metrics
. When you run run
it will also build and start the exporter.
See machine-metrics for turning on Prometheus Node Exporter for publishing machine metrics from a system. This will make the metrics accessible on localhost:9100
. You then need to add a new job to prometheus.yml
to pull metrics, they can then be used to build alerts or for graphs.