Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an alert and a runbook for low disk space #293

Merged
merged 1 commit into from
Jul 31, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@

# Alert Runbooks

- [DiskSpaceLow](./runbooks/alerts/diskspacelow.md)
- [ZPoolStatusDegraded](./runbooks/alerts/zpoolstatusdegraded.md)

# "How to" Runbooks
Expand Down
35 changes: 35 additions & 0 deletions docs/src/runbooks/alerts/diskspacelow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
DiskSpaceLow
============

This alert fires when a partition has under 10% free space remaining.

The alert will say which partitions are affected, `df -h` also has the
information:

```
$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 1.6G 0 1.6G 0% /dev
tmpfs 16G 112K 16G 1% /dev/shm
tmpfs 7.8G 9.8M 7.8G 1% /run
tmpfs 16G 1.1M 16G 1% /run/wrappers
local/volatile/root 1.7T 1.8G 1.7T 1% /
local/persistent/nix 1.7T 5.1G 1.7T 1% /nix
local/persistent/persist 1.7T 2.0G 1.7T 1% /persist
local/persistent/var-log 1.7T 540M 1.7T 1% /var/log
efivarfs 128K 40K 84K 33% /sys/firmware/efi/efivars
local/persistent/home 1.7T 32G 1.7T 2% /home
/dev/nvme0n1p2 487M 56M 431M 12% /boot
data/nas 33T 22T 11T 68% /mnt/nas
tmpfs 3.2G 12K 3.2G 1% /run/user/1000
```

Note all ZFS datasets in the same pool (`local/*` and `data/*` in the example
above) share the underlying storage.

Debugging steps:

- See the `node_filesystem_avail_bytes` metric for how quickly disk space is
being consumed
- Use `ncdu -x` to work out where the space is going
- Buy more storage if need be
24 changes: 14 additions & 10 deletions shared/default.nix
Original file line number Diff line number Diff line change
Expand Up @@ -146,16 +146,6 @@ in
services.zfs.autoSnapshot.enable = thereAreZfsFilesystems;
services.zfs.autoSnapshot.monthly = 3;

services.prometheus.rules = mkIf thereAreZfsFilesystems [
''
groups:
- name: zfs
rules:
- alert: ZPoolStatusDegraded
expr: node_zfs_zpool_state{state!="online"} > 0
''
];

# Actually panic when ZFS "panics"
# https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSPanicsNotKernelPanics
boot.extraModprobeConfig = mkIf thereAreZfsFilesystems ''
Expand Down Expand Up @@ -273,6 +263,20 @@ in
};
};

services.prometheus.rules = [
''
groups:
- name: disk
rules:
- alert: DiskSpaceLow
expr: node_filesystem_avail_bytes{fstype!~"(ramfs|tmpfs)"} / node_filesystem_size_bytes < 0.1
- name: zfs
rules:
- alert: ZPoolStatusDegraded
expr: node_zfs_zpool_state{state!="online"} > 0
''
];

# Host metrics
services.prometheus.exporters.node.enable = promcfg.enable;

Expand Down