Project

Cost Governor

Shows when ML training jobs are silently wasting GPU spend.

Python GPU Telemetry Cost Estimation ML Infrastructure

Wraps a training command, samples per-second GPU, CPU, memory, disk, and network telemetry, estimates compute cost, and reports idle GPU capacity, estimated avoidable cost, and likely bottlenecks. One CLI command, one stable JSON contract, zero infrastructure.

Informed by infrastructure work that reduced a projected $6.5M AWS trajectory to $1.5M. Read the case study →

The problem

Most training runs don't fail. They become quietly expensive. GPUs sit half-idle for hours while you pay full price. A single 8×A100 node at $32.77/hr can burn $130+ in avoidable cost on a 6-hour run if GPU utilization averages 34%.

Most teams discover this through billing surprises, not instrumentation. Cost Governor detects these inefficiencies automatically at the run level.

Usage

pip install .
cost-governor monitor -- python train.py

Results in ./cost-governor-output/. No config files. No infrastructure.

Who it's for

Engineers running PyTorch or TensorFlow jobs on single-node GPU instances
ML infra teams trying to understand per-run cost and GPU efficiency
Teams debugging slow or expensive training runs in CI or ad hoc environments

Not for v1: cluster-wide monitoring, distributed multi-node attribution, real-time dashboards.

Wasteful run example

6-hour training job on 8×A100. GPU utilization averaged 34%. About $130 of the $197 cost appears to be idle GPU capacity.

p4d.24xlarge · NVIDIA A100-SXM4-40GB ×8 · 6h 0m · Agent overhead: 0.001% CPU / 41.8 MB RSS

Metric	Min	P50	P95	Max	Mean
GPU Util %	0.0	31.0	72.0	82.0	34.0
GPU Mem %	8.0	27.0	40.0	42.0	28.5
CPU %	2.0	15.0	38.0	45.0	18.0
Memory %	12.0	23.0	32.0	35.0	24.0

21,600 samples at 1.0s intervals

Estimated total

$196.62

p4d.24xlarge at $32.77/hr on-demand

Estimated avoidable cost

$129.77

66% idle GPU capacity

Findings

High

GPU utilization p50 was 31%. Sustained accelerator underutilization observed.

High

Estimated avoidable cost: $129.77. 66% idle GPU capacity at on-demand rates.

Med

GPU memory peaked at 40% (p95). Batch size could likely increase.

Bottleneck detection

Cost Governor goes beyond "GPU utilization was low." When CPU and GPU metrics tell a coherent story, the tool identifies the constraint.

p3.8xlarge · Tesla V100 ×2 · 30m 0s

CPU p95

96%

GPU mean

52%

GPU 0

68%

GPU 1

36%

Med

CPU p95 was 96%. CPU may be a bottleneck.

Med

GPU imbalance: utilization ranged from 36% to 68%.

Med

CPU-GPU anti-correlation detected. Host-side constraint likely.

Five findings tell a coherent story: the host appears to be constraining GPU throughput. The pattern is consistent with a host-side bottleneck, not a GPU-side compute limit.

Architecture

The wrapper launches the training process, samples per-second telemetry during the run, and returns the training job's exit code on completion. Telemetry comes from pynvml (GPU) and /proc (CPU, memory, disk I/O, network). No frameworks, no heavy dependencies, no network I/O beyond a single AWS metadata check at startup.

The output contract (summary.json v1.3.0) is the stable API. Everything downstream reads from it: reports, CI checks, future integrations.

Detection rules

Seven rules covering threshold violations, multi-GPU imbalance, CPU-GPU anti-correlation, and waste estimation. Advanced rules require minimum sample counts and run duration before firing.

Rule	Severity	Trigger
gpu_util_low	high	GPU utilization p50 < 60%
avoidable_gpu_cost_estimated	high	Estimated avoidable cost ≥ $1
memory_high	high	System memory p95 > 90%
gpu_memory_low	medium	GPU memory p95 < 60%
cpu_util_high	medium	CPU utilization p95 > 90%
gpu_imbalance_detected	medium	Per-GPU utilization spread > 25pp
cpu_gpu_anticorrelation	medium	CPU-GPU anti-correlation (r ≤ -0.5)

Output contract (summary.json)

{
  "schema_version": "1.3.0",
  "run": {
    "instance_type": "p4d.24xlarge",
    "gpu_model": "NVIDIA A100-SXM4-40GB",
    "gpu_count": 8,
    "duration_seconds": 21600.0,
    "agent_overhead": { "cpu_pct": 0.001, "peak_rss_mb": 41.8 }
  },
  "cost": {
    "estimated_total_usd": 196.62,
    "hourly_rate_usd": 32.77,
    "pricing_tier": "on_demand"
  },
  "waste": {
    "gpu_idle_capacity_pct": 66.0,
    "estimated_avoidable_cost_usd": 129.77
  },
  "findings": [
    { "rule_id": "gpu_util_low", "severity": "high",
      "message": "GPU utilization p50 was 31%" },
    { "rule_id": "avoidable_gpu_cost_estimated", "severity": "high",
      "message": "Estimated avoidable cost: $129.77 (66% idle)" },
    { "rule_id": "gpu_memory_low", "severity": "medium",
      "message": "GPU memory peaked at 40% (p95)" }
  ]
}

Abbreviated. Full schema includes per-GPU breakdown, disk/network I/O, and all 7 findings with recommendations.

Design constraints

Invisible to workload: <1% CPU, <50 MB RSS, no GPU contention, no network I/O during collection
Fail-safe: Agent failures never propagate to the training job. Individual collector failures degrade gracefully.
Single dependency: pynvml only. Everything else is Python stdlib + /proc reads. No numpy, pandas, or psutil.
Deterministic output: Always produces summary.json and JSONL. Sorted keys for stable diffs. Atomic writes.
Honest numbers: Never extrapolates. Cost clearly labeled as estimate. Findings show observed value and threshold.

Implementation choices

Python: Fast iteration, low-friction adoption by ML infra teams who already live in Python
pynvml over nvidia-smi: Direct GPU queries without subprocess overhead. Read-only metadata, no compute operations.
/proc over psutil: Zero runtime dependencies for system metrics. Direct reads keep the install footprint minimal.
Stable JSON contract: Versioned schema designed for CI pipelines and downstream automation, not just human reading.

226 tests. 100% coverage enforced. Test suite passes on any Linux machine without a GPU.

View source →