Automating GPU cluster monitoring with Prometheus + custom alerts
We've been running a 12-node GPU farm (mostly RTX 4090s) on Hetzner and struggled with visibility into utilization patterns. Built a custom Prometheus exporter in Python that tracks:
- VRAM usage per GPU (not just overall)
- Temperature spikes (alerts if any GPU exceeds 80°C)
- Job queue depth in our Kubernetes cluster
Using AlertManager to send Slack notifications when a node drops below 50% util for >30min (means jobs are queuing). Reduced idle time by ~18% in the first week.
Anyone else doing this? Curious if there's a better approach than custom exporters. Considering moving to NVIDIA DCGM Exporter but haven't tested it yet.
Edited at 26 Mar 2026, 07:25
Nice setup. Have you considered adding power consumption metrics via IPMI or nvidia-smi's power draw? Idle time reduction is cool, but we found power costs were the actual bottleneck—some jobs were just inefficient, not queue depth. Also check the NVIDIA docs for sustained clocks vs thermal throttling: https://docs.nvidia.com/cuda/ might give you better tuning options beyond the 80°C threshold.
Good point! We're actually pulling power draw from nvidia-smi already, but haven't wired it into alerts yet. That's on the roadmap—curious what threshold you set for idle power warnings?
Have you looked into DCGM (Data Center GPU Manager) instead of rolling your own nvidia-smi parser? It handles thermal throttling prediction and can feed directly into Prometheus via the official exporter. Might catch issues before they hit your 80°C threshold.
DCGM is solid, but honestly the custom exporter gives you way more control over what you alert on—worth the extra maintenance IMO.
18% idle reduction is solid. We added job priority weighting to our queue and saw similar gains—worth exploring if you haven't.