Project
Cost Governor
A lightweight compute-efficiency analysis agent for GPU training workloads.
Python GPU Telemetry ML Infrastructure FinOps
Samples per-second GPU, CPU, memory, disk, and network telemetry, estimates compute cost, and emits actionable findings that help teams identify idle capacity and underutilized GPUs. Designed to be invisible to the workload it monitors.
Inspired by infrastructure work that reduced a projected $6.5M AWS trajectory to $1.5M while scaling simulation throughput 10×. Read the case study →
The problem
GPU infrastructure is expensive, yet most ML workloads operate far below peak efficiency. A single 8×H100 node can cost $30+ per hour. Common issues like idle GPUs during data loading, inefficient batch sizing, CPU bottlenecks starving GPUs, and oversized instances waste thousands of dollars per training run.
Most teams discover these problems through billing surprises, not instrumentation. Cost Governor explores how lightweight telemetry and rule-based analysis can detect these inefficiencies automatically.
Architecture
The agent wraps the training process as a subprocess, collects per-second telemetry from pynvml and /proc directly, and exits with the training job's return code. No frameworks, no heavy dependencies, no network I/O beyond a single AWS metadata check at startup.
The output contract (summary.json) is designed to support cross-run aggregation, enabling future analysis of GPU efficiency across projects and teams.
Stack
Python 3.9+ · pynvml (direct GPU queries, not nvidia-smi) · /proc reads (not psutil) · High test coverage enforced in CI · CI templates for GitLab CI and GitHub Actions
Design principles
- Invisible to workload
- <1% CPU overhead, <50 MB RSS, no GPU contention, no network I/O during collection
- Fail-safe
- Never crashes the training job. Individual collector failures degrade gracefully without affecting others.
- Minimal dependencies
- pynvml + Python stdlib. No numpy, pandas, or heavy libraries.
- Deterministic output
- Always produces summary.json and JSONL. Sorted keys for stable diffs. Atomic writes.
- Honest numbers
- Never extrapolates. Cost clearly labeled as estimate. Findings show observed value and threshold.
Example finding
Observed: GPU utilization p50 = 34%
Threshold: <60% triggers finding
Recommendation: Investigate data loader throughput or batch sizing.
Example report output
GPU: NVIDIA GeForce RTX 4090 ×1 · Duration: 5m 18s · Agent overhead: 0.002% CPU / 41.2 MB RSS
| Metric | Min | P50 | P95 | Max | Mean |
|---|---|---|---|---|---|
| GPU Util % | 0.0 | 100.0 | 100.0 | 100.0 | 96.5 |
| GPU Mem % | 8.7 | 79.7 | 80.0 | 80.1 | 77.7 |
| CPU % | 0.0 | 3.8 | 6.4 | 16.5 | 4.1 |
| Memory % | 19.0 | 22.6 | 23.8 | 24.1 | 22.8 |
319 samples at 1.0s intervals