Project
Cost Governor
Shows when ML training jobs are silently wasting GPU spend.
Python GPU Telemetry Cost Estimation ML Infrastructure
Wraps a training command, samples per-second GPU, CPU, memory, disk, and network telemetry, estimates compute cost, and reports idle GPU capacity, estimated avoidable cost, and likely bottlenecks. One CLI command, one stable JSON contract, zero infrastructure.
Informed by infrastructure work that reduced a projected $6.5M AWS trajectory to $1.5M. Read the case study →
The problem
Most training runs don't fail. They become quietly expensive. GPUs sit half-idle for hours while you pay full price. A single 8×A100 node at $32.77/hr can burn $130+ in avoidable cost on a 6-hour run if GPU utilization averages 34%.
Most teams discover this through billing surprises, not instrumentation. Cost Governor detects these inefficiencies automatically at the run level.
Usage
pip install . cost-governor monitor -- python train.py
Results in ./cost-governor-output/. No config files. No infrastructure.
Who it's for
- Engineers running PyTorch or TensorFlow jobs on single-node GPU instances
- ML infra teams trying to understand per-run cost and GPU efficiency
- Teams debugging slow or expensive training runs in CI or ad hoc environments
Not for v1: cluster-wide monitoring, distributed multi-node attribution, real-time dashboards.
Wasteful run example
6-hour training job on 8×A100. GPU utilization averaged 34%. About $130 of the $197 cost appears to be idle GPU capacity.
| Metric | Min | P50 | P95 | Max | Mean |
|---|---|---|---|---|---|
| GPU Util % | 0.0 | 31.0 | 72.0 | 82.0 | 34.0 |
| GPU Mem % | 8.0 | 27.0 | 40.0 | 42.0 | 28.5 |
| CPU % | 2.0 | 15.0 | 38.0 | 45.0 | 18.0 |
| Memory % | 12.0 | 23.0 | 32.0 | 35.0 | 24.0 |
21,600 samples at 1.0s intervals
Estimated total
$196.62
p4d.24xlarge at $32.77/hr on-demand
Estimated avoidable cost
$129.77
66% idle GPU capacity
Findings
GPU utilization p50 was 31%. Sustained accelerator underutilization observed.
Estimated avoidable cost: $129.77. 66% idle GPU capacity at on-demand rates.
GPU memory peaked at 40% (p95). Batch size could likely increase.
Bottleneck detection
Cost Governor goes beyond "GPU utilization was low." When CPU and GPU metrics tell a coherent story, the tool identifies the constraint.
CPU p95
96%
GPU mean
52%
GPU 0
68%
GPU 1
36%
CPU p95 was 96%. CPU may be a bottleneck.
GPU imbalance: utilization ranged from 36% to 68%.
CPU-GPU anti-correlation detected. Host-side constraint likely.
Five findings tell a coherent story: the host appears to be constraining GPU throughput. The pattern is consistent with a host-side bottleneck, not a GPU-side compute limit.
Architecture
The wrapper launches the training process, samples per-second telemetry during the run, and returns the training job's exit code on completion. Telemetry comes from pynvml (GPU) and /proc (CPU, memory, disk I/O, network). No frameworks, no heavy dependencies, no network I/O beyond a single AWS metadata check at startup.
The output contract (summary.json v1.3.0) is the stable API. Everything downstream reads from it: reports, CI checks, future integrations.
Detection rules
Seven rules covering threshold violations, multi-GPU imbalance, CPU-GPU anti-correlation, and waste estimation. Advanced rules require minimum sample counts and run duration before firing.
| Rule | Severity | Trigger |
|---|---|---|
| gpu_util_low | high | GPU utilization p50 < 60% |
| avoidable_gpu_cost_estimated | high | Estimated avoidable cost ≥ $1 |
| memory_high | high | System memory p95 > 90% |
| gpu_memory_low | medium | GPU memory p95 < 60% |
| cpu_util_high | medium | CPU utilization p95 > 90% |
| gpu_imbalance_detected | medium | Per-GPU utilization spread > 25pp |
| cpu_gpu_anticorrelation | medium | CPU-GPU anti-correlation (r ≤ -0.5) |
Output contract (summary.json)
{
"schema_version": "1.3.0",
"run": {
"instance_type": "p4d.24xlarge",
"gpu_model": "NVIDIA A100-SXM4-40GB",
"gpu_count": 8,
"duration_seconds": 21600.0,
"agent_overhead": { "cpu_pct": 0.001, "peak_rss_mb": 41.8 }
},
"cost": {
"estimated_total_usd": 196.62,
"hourly_rate_usd": 32.77,
"pricing_tier": "on_demand"
},
"waste": {
"gpu_idle_capacity_pct": 66.0,
"estimated_avoidable_cost_usd": 129.77
},
"findings": [
{ "rule_id": "gpu_util_low", "severity": "high",
"message": "GPU utilization p50 was 31%" },
{ "rule_id": "avoidable_gpu_cost_estimated", "severity": "high",
"message": "Estimated avoidable cost: $129.77 (66% idle)" },
{ "rule_id": "gpu_memory_low", "severity": "medium",
"message": "GPU memory peaked at 40% (p95)" }
]
} Abbreviated. Full schema includes per-GPU breakdown, disk/network I/O, and all 7 findings with recommendations.
Design constraints
- Invisible to workload
- <1% CPU, <50 MB RSS, no GPU contention, no network I/O during collection
- Fail-safe
- Agent failures never propagate to the training job. Individual collector failures degrade gracefully.
- Single dependency
- pynvml only. Everything else is Python stdlib + /proc reads. No numpy, pandas, or psutil.
- Deterministic output
- Always produces summary.json and JSONL. Sorted keys for stable diffs. Atomic writes.
- Honest numbers
- Never extrapolates. Cost clearly labeled as estimate. Findings show observed value and threshold.
Implementation choices
- Python
- Fast iteration, low-friction adoption by ML infra teams who already live in Python
- pynvml over nvidia-smi
- Direct GPU queries without subprocess overhead. Read-only metadata, no compute operations.
- /proc over psutil
- Zero runtime dependencies for system metrics. Direct reads keep the install footprint minimal.
- Stable JSON contract
- Versioned schema designed for CI pipelines and downstream automation, not just human reading.
226 tests. 100% coverage enforced. Test suite passes on any Linux machine without a GPU.