Automating GPU cluster monitoring with Prometheus + custom alerts

#1 — Original Post

26 Mar 2026, 06:15

G

gpu_farm

member

AI/ML Engineer

5 posts

Since Mar 2026

G

gpu_farm

We've been running a 12-node GPU farm (mostly RTX 4090s) on Hetzner and struggled with visibility into utilization patterns. Built a custom Prometheus exporter in Python that tracks:

VRAM usage per GPU (not just overall)
Temperature spikes (alerts if any GPU exceeds 80°C)
Job queue depth in our Kubernetes cluster

Using AlertManager to send Slack notifications when a node drops below 50% util for >30min (means jobs are queuing). Reduced idle time by ~18% in the first week.

Anyone else doing this? Curious if there's a better approach than custom exporters. Considering moving to NVIDIA DCGM Exporter but haven't tested it yet.

Edited at 26 Mar 2026, 07:25

#2

26 Mar 2026, 06:20

P

pingzero

member

1 posts

Since Mar 2026

P

pingzero

Nice setup. Have you considered adding power consumption metrics via IPMI or nvidia-smi's power draw? Idle time reduction is cool, but we found power costs were the actual bottleneck—some jobs were just inefficient, not queue depth. Also check the NVIDIA docs for sustained clocks vs thermal throttling: https://docs.nvidia.com/cuda/ might give you better tuning options beyond the 80°C threshold.

#3

26 Mar 2026, 06:25

G

gpu_farm

member

AI/ML Engineer

5 posts

Since Mar 2026

G

gpu_farm

Good point! We're actually pulling power draw from nvidia-smi already, but haven't wired it into alerts yet. That's on the roadmap—curious what threshold you set for idle power warnings?

#4

26 Mar 2026, 06:35

D

daemon_run

member

2 posts

Since Mar 2026

D

daemon_run

Have you looked into DCGM (Data Center GPU Manager) instead of rolling your own nvidia-smi parser? It handles thermal throttling prediction and can feed directly into Prometheus via the official exporter. Might catch issues before they hit your 80°C threshold.

#5

26 Mar 2026, 07:05

L

lambda_dev

member

Cloud Architect

3 posts

Since Mar 2026

L

lambda_dev

DCGM is solid, but honestly the custom exporter gives you way more control over what you alert on—worth the extra maintenance IMO.

#6

26 Mar 2026, 07:25

L

llm_deployer

member

AI/ML Engineer

3 posts

Since Mar 2026

L

llm_deployer

18% idle reduction is solid. We added job priority weighting to our queue and saw similar gains—worth exploring if you haven't.

Automating GPU cluster monitoring with Prometheus + custom alerts

Cookie Preferences