Project

Cost Governor

A lightweight compute-efficiency analysis agent for GPU training workloads.

Python GPU Telemetry ML Infrastructure FinOps

Samples per-second GPU, CPU, memory, disk, and network telemetry, estimates compute cost, and emits actionable findings that help teams identify idle capacity and underutilized GPUs. Designed to be invisible to the workload it monitors.

Inspired by infrastructure work that reduced a projected $6.5M AWS trajectory to $1.5M while scaling simulation throughput 10×. Read the case study →

The problem

GPU infrastructure is expensive, yet most ML workloads operate far below peak efficiency. A single 8×H100 node can cost $30+ per hour. Common issues like idle GPUs during data loading, inefficient batch sizing, CPU bottlenecks starving GPUs, and oversized instances waste thousands of dollars per training run.

Most teams discover these problems through billing surprises, not instrumentation. Cost Governor explores how lightweight telemetry and rule-based analysis can detect these inefficiencies automatically.

Architecture

Container / Node Training Job wrapped by Cost Governor Agent samples.jsonl summary.json Output artifacts JSONL stream v1.2.0 API Report Generator report.md findings Exit with training job return code pynvml + /proc

The agent wraps the training process as a subprocess, collects per-second telemetry from pynvml and /proc directly, and exits with the training job's return code. No frameworks, no heavy dependencies, no network I/O beyond a single AWS metadata check at startup.

The output contract (summary.json) is designed to support cross-run aggregation, enabling future analysis of GPU efficiency across projects and teams.

Stack

Python 3.9+  ·  pynvml (direct GPU queries, not nvidia-smi)  ·  /proc reads (not psutil)  ·  High test coverage enforced in CI  ·  CI templates for GitLab CI and GitHub Actions

Design principles

Invisible to workload
<1% CPU overhead, <50 MB RSS, no GPU contention, no network I/O during collection
Fail-safe
Never crashes the training job. Individual collector failures degrade gracefully without affecting others.
Minimal dependencies
pynvml + Python stdlib. No numpy, pandas, or heavy libraries.
Deterministic output
Always produces summary.json and JSONL. Sorted keys for stable diffs. Atomic writes.
Honest numbers
Never extrapolates. Cost clearly labeled as estimate. Findings show observed value and threshold.

Example finding

High GPU utilization low

Observed: GPU utilization p50 = 34%

Threshold: <60% triggers finding

Recommendation: Investigate data loader throughput or batch sizing.

Example report output

GPU: NVIDIA GeForce RTX 4090 ×1  ·  Duration: 5m 18s  ·  Agent overhead: 0.002% CPU / 41.2 MB RSS

Metric Min P50 P95 Max Mean
GPU Util % 0.0 100.0 100.0 100.0 96.5
GPU Mem % 8.7 79.7 80.0 80.1 77.7
CPU % 0.0 3.8 6.4 16.5 4.1
Memory % 19.0 22.6 23.8 24.1 22.8

319 samples at 1.0s intervals

View source →