This guide explains how to use the Makefile to set up a Kind (Kubernetes in Docker) cluster with NVIDIA GPU support, including monitoring capabilities.
For detailed configuration information, please refer to:
The Makefile automatically installs the following requirements:
- Go
- kubectl (latest stable version)
- Kind (v0.20.0)
- Helm
make all
This runs the complete setup process in the following order:
- Installs prerequisites
- Creates Kind cluster
- Sets up NVIDIA support
- Installs GPU operator
- Tests GPU access
- Sets up monitoring
- Configures port forwarding
make prerequisites
Installs all required tools and dependencies.
make cluster
Creates a Kind cluster using the configuration from kind-config.yaml
. For detailed configuration information, see the NVIDIA and Kind Configuration Guide.
make setup-nvidia
Runs the setup-nvidia-kind.sh
script to configure NVIDIA container support. See the NVIDIA and Kind Configuration Guide for detailed explanation of the setup process.
make install-gpu-operator
Installs the NVIDIA GPU operator with the following configurations:
- Driver disabled (uses host driver)
- Toolkit enabled
- Device plugin enabled
- MIG manager disabled
- Host mounts enabled
- Specific toolkit and device plugin versions
make test-gpu
Runs a test pod with nvidia-smi
to verify GPU access.
make setup-monitoring
Sets up monitoring stack:
- Installs kube-prometheus-stack
- Configures DCGM monitoring
- Sets up custom service monitors
For detailed information about DCGM monitoring setup, refer to the DCGM Monitoring Setup Guide.
make port-forward
Sets up port forwarding for monitoring services:
- Prometheus:
9090
- Grafana:
3000
- Alertmanager:
9093
make clean
Deletes the Kind cluster.
make debug
Shows debug information including:
- Pod status in gpu-operator namespace
- Pod descriptions
- GPU operator logs
- NVIDIA container information
make reinstall-nvidia-runtime
Completely reinstalls the NVIDIA runtime:
- Uninstalls GPU operator
- Deletes gpu-operator namespace
- Recreates cluster
- Reinstalls NVIDIA support
- Reinstalls GPU operator
-
If port forwarding fails:
- Check if ports are already in use
- Verify the services are running in the monitoring namespace
-
If GPU operator installation fails:
- Use
make debug
to check the operator logs - Verify NVIDIA driver compatibility
- Check if all required mounts are properly configured
- See NVIDIA and Kind Configuration Guide for proper setup requirements
- Use
-
If monitoring setup fails:
- Ensure CustomResourceDefinitions are properly established
- Check if the prometheus-operator is running
- Verify RBAC permissions are correctly configured
- Refer to DCGM Monitoring Setup Guide for detailed monitoring configuration