A monitoring solution for hosting a graph node on a single Docker host with Prometheus, Grafana, cAdvisor, NodeExporter and alerting with AlertManager.
The monitoring configuration adapted the template by the graph team in the mission control repository.
The main difference is that it's cheaper. the full deployment on google cloud costs $500 per month while this simple docker compose script can be hosted on any bare metal server with more than 10 cores at about the same performance. The drawdown is that no backups are created.
The data is stored in named volumes on the docker host and can be exported / copied over to a bigger machine once mpore performance is needed.
The minimum competitive configuration I would assume to be the CPX51 VPS at Hetzner. By signing up using my referral link you can save 20€ and I get 10€ bonus for more experiments.
You will need a achieve node to complete the testnet challenge. For testing purposes I can offer mine but make no guarantees regarding performance.
To enable SSL on your host you should get a domain.
You can use any domain and any regsitrar that allowes you to edit DNS records to point subdomains to your IP address.
For a free option go to dotTK and find a free domain name. Create a account and complete the registration.
Hint: the shop is a bit broken. On the first try to checkout the shopping cart was empty for me, but there is a link to go back to the search results. Click this to go back, put the domain in the shopping cart again, next time it worked.
In the last step choose "use dns" and enter the IP address of your server for 2 subdomains like in the picture. You can choose up to 12 months for free.
Under "Service > My Domains > Manage Domain > Manage Freenom DNS" you can add more subdomains later for e.g. the Grafana dashboard.
Prerequisites:
- Docker Engine >= 1.13
- Docker Compose >= 1.11
On a fresh Ubuntu server login via ssh and execute the following commands:
apt update -y
apt install docker.io docker-compose httpie
Clone this repository on your Docker host, cd into graphprotocol-infrastructure directory and run compose up:
git clone https://github.com/butterfly-academy/graphprotocol-infrastructure.git
cd graphprotocol-infrastructure
EMAIL=my@email INDEX_HOST=index.mydomain.tk QUERY_HOST=query.mydomain.tk ADMIN_USER=admin ADMIN_PASSWORD=change_me ETHEREUM="mainnet:<ETH_RPC_URL>" ETHEREUM_START_BLOCK=7710671 docker-compose up -d
The ADMIN_USER and ADMIN_PASSWORD will be used by Grafana, Prometheus and AlertManager. QUERY_HOST and INDEX_HOST should point to the subdomains created earlier.
Containers:
- Graph Node (indexer / query node)
http://<host-ip>:8000
- Postgres Database
- Prometheus (metrics database)
http://<host-ip>:9090
- Prometheus-Pushgateway (push acceptor for ephemeral and batch jobs)
http://<host-ip>:9091
- AlertManager (alerts management)
http://<host-ip>:9093
- Grafana (visualize metrics)
http://<host-ip>:3000
- NodeExporter (host metrics collector)
- cAdvisor (containers metrics collector)
- Caddy (reverse proxy and basic auth provider for prometheus and alertmanager)
Connect via ssh to the server and issue the following commands to index the subraphs required for phase 0 of the testnet challenge.
http post 127.0.0.1:8020 jsonrpc="2.0" method="subgraph_create" id="2" params:='{"name": "synthetixio-team/synthetix"}'
http post 127.0.0.1:8020 jsonrpc="2.0" id="2" method="subgraph_deploy" params:='{"name": "synthetixio-team/synthetix", "ipfs_hash": "Qme2hDXrkBpuXAYEuwGPAjr6zwiMZV4FHLLBa3BHzatBWx"}'
http post 127.0.0.1:8020 jsonrpc="2.0" method="subgraph_create" id="2" params:='{"name": "uniswap/uniswap-v2"}'
http post 127.0.0.1:8020 jsonrpc="2.0" id="2" method="subgraph_deploy" params:='{"name": "uniswap/uniswap-v2", "ipfs_hash": "QmXKwSEMirgWVn41nRzkT3hpUBw29cp619Gx58XW6mPhZP"}'
http post 127.0.0.1:8020 jsonrpc="2.0" method="subgraph_create" id="1" params:='{"name": "molochventures/moloch"}'
http post 127.0.0.1:8020 jsonrpc="2.0" id="1" method="subgraph_deploy" params:='{"name": "molochventures/moloch", "ipfs_hash": "QmTXzATwNfgGVukV1fX2T6xw9f6LAYRVWpsdXyRWzUR2H9"}'
http post 127.0.0.1:8020 jsonrpc="2.0" method="subgraph_create" id="4" params:='{"name": "jannis/gravity"}'
http post 127.0.0.1:8020 jsonrpc="2.0" id="4" method="subgraph_deploy" params:='{"name": "jannis/gravity", "ipfs_hash": "QmbeDC4G8iPAUJ6tRBu99vwyYkaSiFwtXWKwwYkoNphV4X"}'
In case of problems you can access the log output of each container (e.g. graph-node) with the command
docker logs <container> --follow --tail 100
Navigate to http://<host-ip>:3000
and login with user admin password admin. You can change the credentials in the compose file or by supplying the ADMIN_USER
and ADMIN_PASSWORD
environment variables on compose up. The config file can be added directly in grafana part like this
grafana:
image: grafana/grafana:5.2.4
env_file:
- config
and the config file format should have this content
GF_SECURITY_ADMIN_USER=admin
GF_SECURITY_ADMIN_PASSWORD=changeme
GF_USERS_ALLOW_SIGN_UP=false
If you want to change the password, you have to remove this entry, otherwise the change will not take effect
- grafana_data:/var/lib/grafana
Grafana is preconfigured with dashboards and Prometheus plus Postgres as the default data source:
Docker Host Dashboard
The Docker Host Dashboard shows key metrics for monitoring the resource usage of your server:
- Server uptime, CPU idle percent, number of CPU cores, available memory, swap and storage
- System load average graph, running and blocked by IO processes graph, interrupts graph
- CPU usage graph by mode (guest, idle, iowait, irq, nice, softirq, steal, system, user)
- Memory usage graph by distribution (used, free, buffers, cached)
- IO usage graph (read Bps, read Bps and IO time)
- Network usage graph by device (inbound Bps, Outbound Bps)
- Swap usage and activity graphs
For storage and particularly Free Storage graph, you have to specify the fstype in grafana graph request.
You can find it in grafana/dashboards/docker_host.json
, at line 480 :
"expr": "sum(node_filesystem_free_bytes{fstype=\"btrfs\"})",
I work on BTRFS, so i need to change aufs
to btrfs
.
You can find right value for your system in Prometheus http://<host-ip>:9090
launching this request :
node_filesystem_free_bytes
Docker Containers Dashboard
The Docker Containers Dashboard shows key metrics for monitoring running containers:
- Total containers CPU load, memory and storage usage
- Running containers graph, system load graph, IO usage graph
- Container CPU usage graph
- Container memory usage graph
- Container cached memory usage graph
- Container network inbound usage graph
- Container network outbound usage graph
Note that this dashboard doesn't show the containers that are part of the monitoring stack.
Monitor Services Dashboard
The Monitor Services Dashboard shows key metrics for monitoring the containers that make up the monitoring stack:
- Prometheus container uptime, monitoring stack total memory usage, Prometheus local storage memory chunks and series
- Container CPU usage graph
- Container memory usage graph
- Prometheus chunks to persist and persistence urgency graphs
- Prometheus chunks ops and checkpoint duration graphs
- Prometheus samples ingested rate, target scrapes and scrape duration graphs
- Prometheus HTTP requests graph
- Prometheus alerts graph
Three alert groups have been setup within the alert.rules configuration file:
- Monitoring services alerts targets
- Docker Host alerts host
- Docker Containers alerts containers
You can modify the alert rules and reload them by making a HTTP POST call to Prometheus:
curl -X POST http://admin:admin@<host-ip>:9090/-/reload
Monitoring services alerts
Trigger an alert if any of the monitoring targets (node-exporter and cAdvisor) are down for more than 30 seconds:
- alert: monitor_service_down
expr: up == 0
for: 30s
labels:
severity: critical
annotations:
summary: "Monitor service non-operational"
description: "Service {{ $labels.instance }} is down."
Docker Host alerts
Trigger an alert if the Docker host CPU is under high load for more than 30 seconds:
- alert: high_cpu_load
expr: node_load1 > 1.5
for: 30s
labels:
severity: warning
annotations:
summary: "Server under high load"
description: "Docker host is under high load, the avg load 1m is at {{ $value}}. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."
Modify the load threshold based on your CPU cores.
Trigger an alert if the Docker host memory is almost full:
- alert: high_memory_load
expr: (sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes) ) / sum(node_memory_MemTotal_bytes) * 100 > 85
for: 30s
labels:
severity: warning
annotations:
summary: "Server memory is almost full"
description: "Docker host memory usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."
Trigger an alert if the Docker host storage is almost full:
- alert: high_storage_load
expr: (node_filesystem_size_bytes{fstype="aufs"} - node_filesystem_free_bytes{fstype="aufs"}) / node_filesystem_size_bytes{fstype="aufs"} * 100 > 85
for: 30s
labels:
severity: warning
annotations:
summary: "Server storage is almost full"
description: "Docker host storage usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."
Docker Containers alerts
Trigger an alert if a container is down for more than 30 seconds:
- alert: jenkins_down
expr: absent(container_memory_usage_bytes{name="jenkins"})
for: 30s
labels:
severity: critical
annotations:
summary: "Jenkins down"
description: "Jenkins container is down for more than 30 seconds."
Trigger an alert if a container is using more than 10% of total CPU cores for more than 30 seconds:
- alert: jenkins_high_cpu
expr: sum(rate(container_cpu_usage_seconds_total{name="jenkins"}[1m])) / count(node_cpu_seconds_total{mode="system"}) * 100 > 10
for: 30s
labels:
severity: warning
annotations:
summary: "Jenkins high CPU usage"
description: "Jenkins CPU usage is {{ humanize $value}}%."
Trigger an alert if a container is using more than 1.2GB of RAM for more than 30 seconds:
- alert: jenkins_high_memory
expr: sum(container_memory_usage_bytes{name="jenkins"}) > 1200000000
for: 30s
labels:
severity: warning
annotations:
summary: "Jenkins high memory usage"
description: "Jenkins memory consumption is at {{ humanize $value}}."
The AlertManager service is responsible for handling alerts sent by Prometheus server. AlertManager can send notifications via email, Pushover, Slack, HipChat or any other system that exposes a webhook interface. A complete list of integrations can be found here.
You can view and silence notifications by accessing http://<host-ip>:9093
.
The notification receivers can be configured in alertmanager/config.yml file.
To receive alerts via Slack you need to make a custom integration by choose incoming web hooks in your Slack team app page. You can find more details on setting up Slack integration here.
Copy the Slack Webhook URL into the api_url field and specify a Slack channel.
route:
receiver: 'slack'
group_by: ['...']
receivers:
- name: 'slack'
slack_configs:
- send_resolved: true
text: "{{ .CommonAnnotations.description }}"
username: 'Prometheus'
channel: '#<channel>'
api_url: 'https://hooks.slack.com/services/<webhook-id>'
route:
receiver: 'email'
group_by: ['...']
receivers:
- name: 'email'
email_configs:
- to: [email protected]
from: [email protected]
smarthost: mail.server.biz:587
auth_username: [email protected]
auth_password: password
require_tls: true
Note: setting up sending alerts from popular services like Gmail is more complicated due to higer security precautions. You need App passwords and stuff. Take it easy and use a small but standard conform provider.
The pushgateway is used to collect data from batch jobs or from services.
To push data, simply execute:
echo "some_metric 3.14" | curl --data-binary @- http://user:password@localhost:9091/metrics/job/some_job
Please replace the user:password
part with your user and password set in the initial configuration (default: admin:admin
).
In Grafana versions >= 5.1 the id of the grafana user has been changed. Unfortunately this means that files created prior to 5.1 won’t have the correct permissions for later versions.
Version | User | User ID |
---|---|---|
< 5.1 | grafana | 104 |
>= 5.1 | grafana | 472 |
There are two possible solutions to this problem.
- Change ownership from 104 to 472
- Start the upgraded container as user 104
To change ownership of the files run your grafana container as root and modify the permissions.
First perform a docker-compose down
then modify your docker-compose.yml to include the user: root
option:
grafana:
image: grafana/grafana:5.2.2
container_name: grafana
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/datasources:/etc/grafana/datasources
- ./grafana/dashboards:/etc/grafana/dashboards
- ./grafana/setup.sh:/setup.sh
entrypoint: /setup.sh
user: root
environment:
- GF_SECURITY_ADMIN_USER=${ADMIN_USER:-admin}
- GF_SECURITY_ADMIN_PASSWORD=${ADMIN_PASSWORD:-admin}
- GF_USERS_ALLOW_SIGN_UP=false
restart: unless-stopped
expose:
- 3000
networks:
- monitor-net
labels:
org.label-schema.group: "monitoring"
Perform a docker-compose up -d
and then issue the following commands:
docker exec -it --user root grafana bash
# in the container you just started:
chown -R root:root /etc/grafana && \
chmod -R a+r /etc/grafana && \
chown -R grafana:grafana /var/lib/grafana && \
chown -R grafana:grafana /usr/share/grafana
To run the grafana container as user: 104
change your docker-compose.yml
like such:
grafana:
image: grafana/grafana:5.2.2
container_name: grafana
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/datasources:/etc/grafana/datasources
- ./grafana/dashboards:/etc/grafana/dashboards
- ./grafana/setup.sh:/setup.sh
entrypoint: /setup.sh
user: "104"
environment:
- GF_SECURITY_ADMIN_USER=${ADMIN_USER:-admin}
- GF_SECURITY_ADMIN_PASSWORD=${ADMIN_PASSWORD:-admin}
- GF_USERS_ALLOW_SIGN_UP=false
restart: unless-stopped
expose:
- 3000
networks:
- monitor-net
labels:
org.label-schema.group: "monitoring"