Project

Cost Governor

Shows when ML training jobs are silently wasting GPU spend.

Python GPU Telemetry Cost Estimation ML Infrastructure

Wraps a training command, samples per-second GPU, CPU, memory, disk, and network telemetry, estimates compute cost, and reports idle GPU capacity, estimated avoidable cost, and likely bottlenecks. One CLI command, one stable JSON contract, zero infrastructure.

Informed by infrastructure work that reduced a projected $6.5M AWS trajectory to $1.5M. Read the case study →

The problem

Most training runs don't fail. They become quietly expensive. GPUs sit half-idle for hours while you pay full price. A single 8×A100 node at $32.77/hr can burn $130+ in avoidable cost on a 6-hour run if GPU utilization averages 34%.

Most teams discover this through billing surprises, not instrumentation. Cost Governor detects these inefficiencies automatically at the run level.

Usage

pip install .
cost-governor monitor -- python train.py

Results in ./cost-governor-output/. No config files. No infrastructure.

Who it's for

Not for v1: cluster-wide monitoring, distributed multi-node attribution, real-time dashboards.

Wasteful run example

6-hour training job on 8×A100. GPU utilization averaged 34%. About $130 of the $197 cost appears to be idle GPU capacity.

p4d.24xlarge · NVIDIA A100-SXM4-40GB ×8 · 6h 0m · Agent overhead: 0.001% CPU / 41.8 MB RSS
Metric Min P50 P95 Max Mean
GPU Util % 0.0 31.0 72.0 82.0 34.0
GPU Mem % 8.0 27.0 40.0 42.0 28.5
CPU % 2.0 15.0 38.0 45.0 18.0
Memory % 12.0 23.0 32.0 35.0 24.0

21,600 samples at 1.0s intervals

Estimated total

$196.62

p4d.24xlarge at $32.77/hr on-demand

Estimated avoidable cost

$129.77

66% idle GPU capacity

Findings

High

GPU utilization p50 was 31%. Sustained accelerator underutilization observed.

High

Estimated avoidable cost: $129.77. 66% idle GPU capacity at on-demand rates.

Med

GPU memory peaked at 40% (p95). Batch size could likely increase.

Bottleneck detection

Cost Governor goes beyond "GPU utilization was low." When CPU and GPU metrics tell a coherent story, the tool identifies the constraint.

p3.8xlarge · Tesla V100 ×2 · 30m 0s

CPU p95

96%

GPU mean

52%

GPU 0

68%

GPU 1

36%

Med

CPU p95 was 96%. CPU may be a bottleneck.

Med

GPU imbalance: utilization ranged from 36% to 68%.

Med

CPU-GPU anti-correlation detected. Host-side constraint likely.

Five findings tell a coherent story: the host appears to be constraining GPU throughput. The pattern is consistent with a host-side bottleneck, not a GPU-side compute limit.

Architecture

Container / Node Training Job wrapped by Cost Governor Agent pynvml (GPU) + /proc (CPU, mem, disk, net) Output artifacts samples.jsonl raw telemetry summary.json v1.3.0 API Report Generator report.md findings + cost 7 Detection Rules threshold rules + correlation checks waste estimate + GPU imbalance Exit with training job return code

The wrapper launches the training process, samples per-second telemetry during the run, and returns the training job's exit code on completion. Telemetry comes from pynvml (GPU) and /proc (CPU, memory, disk I/O, network). No frameworks, no heavy dependencies, no network I/O beyond a single AWS metadata check at startup.

The output contract (summary.json v1.3.0) is the stable API. Everything downstream reads from it: reports, CI checks, future integrations.

Detection rules

Seven rules covering threshold violations, multi-GPU imbalance, CPU-GPU anti-correlation, and waste estimation. Advanced rules require minimum sample counts and run duration before firing.

Rule Severity Trigger
gpu_util_low high GPU utilization p50 < 60%
avoidable_gpu_cost_estimated high Estimated avoidable cost ≥ $1
memory_high high System memory p95 > 90%
gpu_memory_low medium GPU memory p95 < 60%
cpu_util_high medium CPU utilization p95 > 90%
gpu_imbalance_detected medium Per-GPU utilization spread > 25pp
cpu_gpu_anticorrelation medium CPU-GPU anti-correlation (r ≤ -0.5)

Output contract (summary.json)

{
  "schema_version": "1.3.0",
  "run": {
    "instance_type": "p4d.24xlarge",
    "gpu_model": "NVIDIA A100-SXM4-40GB",
    "gpu_count": 8,
    "duration_seconds": 21600.0,
    "agent_overhead": { "cpu_pct": 0.001, "peak_rss_mb": 41.8 }
  },
  "cost": {
    "estimated_total_usd": 196.62,
    "hourly_rate_usd": 32.77,
    "pricing_tier": "on_demand"
  },
  "waste": {
    "gpu_idle_capacity_pct": 66.0,
    "estimated_avoidable_cost_usd": 129.77
  },
  "findings": [
    { "rule_id": "gpu_util_low", "severity": "high",
      "message": "GPU utilization p50 was 31%" },
    { "rule_id": "avoidable_gpu_cost_estimated", "severity": "high",
      "message": "Estimated avoidable cost: $129.77 (66% idle)" },
    { "rule_id": "gpu_memory_low", "severity": "medium",
      "message": "GPU memory peaked at 40% (p95)" }
  ]
}

Abbreviated. Full schema includes per-GPU breakdown, disk/network I/O, and all 7 findings with recommendations.

Design constraints

Invisible to workload
<1% CPU, <50 MB RSS, no GPU contention, no network I/O during collection
Fail-safe
Agent failures never propagate to the training job. Individual collector failures degrade gracefully.
Single dependency
pynvml only. Everything else is Python stdlib + /proc reads. No numpy, pandas, or psutil.
Deterministic output
Always produces summary.json and JSONL. Sorted keys for stable diffs. Atomic writes.
Honest numbers
Never extrapolates. Cost clearly labeled as estimate. Findings show observed value and threshold.

Implementation choices

Python
Fast iteration, low-friction adoption by ML infra teams who already live in Python
pynvml over nvidia-smi
Direct GPU queries without subprocess overhead. Read-only metadata, no compute operations.
/proc over psutil
Zero runtime dependencies for system metrics. Direct reads keep the install footprint minimal.
Stable JSON contract
Versioned schema designed for CI pipelines and downstream automation, not just human reading.

226 tests. 100% coverage enforced. Test suite passes on any Linux machine without a GPU.

View source →