Project

GPUFleet

Discrete-event GPU scheduler simulator for ML training workloads.

Python Discrete-Event Simulation GPU Scheduling ML Infrastructure

Simulates mixed ML workloads on heterogeneous GPU clusters and compares FIFO, Bin Packing, Cost-Aware, and Priority scheduling across utilization, cost, latency, fairness, and throughput. Deterministic, reproducible, and fully traced.

Built while exploring GPU scheduling tradeoffs for AI infrastructure, informed by prior work reducing a projected $6.5M AWS trajectory to $1.5M. Read the case study →

The problem

GPU clusters are expensive to operate, and scheduling strategy has a direct impact on utilization, cost, and job latency. But the tradeoffs between strategies are poorly understood because they depend on workload mix, cluster topology, and placement constraints.

GPUFleet provides a controlled environment to test these tradeoffs. Run the same workload through different schedulers, compare the outcomes, and inspect the decision trace to understand why.

Approach

A discrete-event simulation engine processes job arrivals, scheduling decisions, and completions using a min-heap event queue. The engine jumps between meaningful events rather than iterating through empty time. Time-stepped simulation becomes expensive and introduces artificial scheduling artifacts under sparse workloads; a discrete-event model keeps complexity proportional to actual system activity while preserving correctness.

Jobs Event Queue Scheduler Cluster State Trace Writer Metrics .json .jsonl min-heap pluggable

Four pluggable scheduler strategies compete on the same synthetic workload. Each produces a full decision trace (JSONL) recording every scheduling pass: what jobs were pending, what placements were feasible, and what the scheduler chose and why.

FIFO
First-come first-served, first-fit placement. Simple baseline.
Bin Packing
Largest jobs first, pack onto fullest nodes. Reduces fragmentation at the cost of latency.
Cost-Aware
Evaluates all feasible placements, chooses the cheapest. Advantage grows with GPU cost variance.
Priority
High-priority jobs first. Maximizes throughput but starves low-priority work.

Results

Showcase scenario: 64 GPUs (8 nodes, mixed A100 + H100), 150 jobs with Poisson arrivals, 50% same-node placement requirement.

Metric FIFO Bin Packing Cost-Aware Priority
GPU Utilization 91.7% 86.3% 90.3% 92.7%
Total Cost $1,991 $2,003 $1,994 $1,983
Mean Latency 6,206s 7,466s 6,302s 6,083s
p95 Latency 17,391s 18,822s 17,504s 18,159s
Fairness (Jain) 0.537 0.622 0.543 0.491
Makespan 23,505s 24,955s 23,855s 23,229s

No scheduler dominated. Priority delivered the best throughput and lowest mean latency, but at the cost of fairness. Bin Packing improved fairness by preserving larger contiguous placements, but increased latency. FIFO remained the simplest middle ground, while Cost-Aware behaved similarly to FIFO because placement constraints dominated cost differences in this scenario.

Key findings

Fragmentation dominated cost policy in this scenario.

Cost varied by only ~1% across schedulers ($1,983 to $2,003), while utilization moved by 6.4 percentage points and fairness varied materially (0.491 to 0.622). Under heavy same-node constraints, placement feasibility mattered more than cost optimization. Cost-aware scheduling becomes more meaningful once jobs are easily placeable and GPU price variance is higher.

High utilization does not imply efficient scheduling.

The cluster reported >90% utilization while simultaneously exhibiting stranded capacity. GPUs were busy, but unusable for waiting jobs due to fragmentation. Utilization alone is a misleading health metric.

Scheduler policy determines how tradeoffs are expressed, not whether they exist.

All four schedulers completed the full workload, with cost, makespan, and utilization in a relatively narrow range despite meaningful differences in fairness and queue latency. Arrival rate and same-node constraints shaped outcomes more than policy choice here. Systematic parameter sweeps would be needed to generalize further.

Priority improves responsiveness at the cost of starvation.

Priority had the lowest mean latency (6,083s) but the worst Jain fairness index (0.491). Low-priority jobs accumulated severe queue times. This mirrors a common pattern in systems like Kubernetes and Slurm: priority improves responsiveness for urgent work but can starve lower-tier workloads under contention.

Utilization over time

GPU cluster utilization over time for all four scheduling strategies, showing priority and FIFO maintaining ~92% until queue drains while bin packing dips lower

Priority and FIFO keep the cluster busier for longer, but the utilization chart alone hides fairness and fragmentation effects. Higher sustained utilization does not automatically imply better scheduling outcomes.

Latency distribution

Queue latency distribution box plot showing FIFO and Priority with tighter distributions while Bin Packing has the highest median latency

Priority lowers median latency but creates the worst fairness score, showing how responsiveness for urgent jobs can come at the cost of starvation for lower-priority work.

Why FIFO stalls despite free GPUs

Every scheduling pass is recorded as a JSONL record. This makes it possible to answer questions like "why did FIFO fragment here?" with concrete evidence rather than intuition.

FIFO at t=12,399s: 5 GPUs idle, 4 jobs waiting, 0 placed

{
  "schema_version": "1.0",
  "time": 12398.6,
  "scheduler": "fifo",
  "cluster_state": {
    "total_gpus": 64,
    "idle_gpus": 5,
    "idle_gpus_by_node": { "node-5": 1, "node-6": 4 }
  },
  "candidates": [
    {
      "job_id": "job-15",
      "feasible": false,
      "reason": "requires 8 GPUs with >=40.0GB; only 5 available"
    },
    {
      "job_id": "job-25",
      "feasible": false,
      "reason": "requires 8 GPUs with >=40.0GB; only 5 available"
    }
  ],
  "decisions": []
}

5 GPUs scattered across 2 nodes. The waiting jobs shown here each need 8 with same-node placement. No feasible placement exists. In constrained GPU clusters, nominal free capacity and schedulable capacity are not the same thing.

Outputs

Stack

Python 3.10+  ·  heapq-based discrete-event simulation  ·  matplotlib for charts  ·  PyYAML for scenario config  ·  stdlib-only simulation core for deterministic behavior  ·  124 tests, 98% coverage

View source →