Monitor GPUs with Grafana, Prometheus, and DCGM Exporter. Track temperature, power, memory, and utilization in real-time.
The monitoring solution provides several key visualizations:
- GPU Temperature: Real-time temperature monitoring with historical trends
- Power Consumption: Track power usage in watts with peaks and averages
- Memory Usage: Monitor GPU memory allocation in MB
- GPU Utilization: Track percentage utilization over time
- Memory Copy Utilization: Monitor memory bandwidth usage
Before you begin, ensure you have the following prerequisites installed:
Docker is required to run containerized applications. To install Docker:
- Linux: Follow the official Docker installation guide for your specific distribution.
The NVIDIA Container Toolkit is required to run GPU-accelerated containers. To install it, follow these steps:
For more detailed installation instructions, refer to the official NVIDIA Container Toolkit documentation.
The monitoring stack consists of the following components:
- DCGM Exporter: NVIDIA's Data Center GPU Manager exporter that collects GPU metrics
- Prometheus: Time-series database that stores the collected metrics
- Grafana: Visualization platform used to create dashboards and alerts
- Node Exporter: Collects system-level metrics including CPU, memory, disk, and network statistics
The default configuration should work for most setups. To customize:
- Prometheus settings: Edit
prometheus/prometheus.yml
- Grafana dashboards: Pre-configured dashboards are available at
config/grafana/dashboards
or you can create your own
This project is based on gpu-monitoring-docker-compose by hongshibao.