Case Study · Platform Engineering

Cutting a projected $6.5M AWS trajectory to $1.5M while scaling Monte Carlo throughput 10×

A platform architecture turnaround for a mission-critical aerospace program: lower spend, higher throughput, reliability that held across certification-scale campaigns.

Figures are materially accurate but generalized to protect proprietary program details.

$6.5M Projected Spend
$1.5M Actual Spend
77% Program Cost Reduction
10× Throughput Increase

Executive Summary

A simulation program was on track to overrun its approved AWS budget by millions of dollars, with a required 10× throughput increase still ahead. The turnaround did not come from discounts or one-off tuning. It came from redesigning how simulation work was scheduled, executed, decomposed, and merged — then instrumenting the system so cost and performance could be controlled at the workload level. The result was a shift from a projected $6.5M spend to $1.5M actual, while simultaneously delivering the 10× throughput increase that certification demanded.

01

The problem

An autonomous lunar lander program required certification-representative Monte Carlo analysis across many mission segments and campaign configurations. The scale was large, the schedule pressure was real, and the existing infrastructure was not built to handle either cleanly.

At the start of the role, projected amortized AWS spend was trending toward $6.5M against an approved budget of $2.5M. That projection came before a required tenfold increase in simulation throughput needed to support certification evidence.

The root issue was not a single bad decision. It was an accumulation of organic growth, limited ownership, and infrastructure being treated as a side job by domain engineers.

×

No workload-level cost visibility

There was no normalized view of cost per case, cost per study, or cost per campaign. Spend was visible only at the bill, not at the workload.

×

Monolithic post-processing

Post-processing stages were tightly coupled. Large parts of the fleet sat idle while downstream stages waited for full-batch completion.

×

Scale turned inefficiency into cost

Data aggregation and post-processing costs scaled poorly with case count, turning higher throughput into a multiplier on waste.

×

Infrastructure without durable ownership

Critical infrastructure knowledge was concentrated in a few people. When they were busy, campaigns slowed. When they were gone, the system regressed.

02

The scale that broke the old system

The core simulation workload was understandable. The difficulty came from scale, risk posture, and the cost of repeating it thousands of times. At this scale, every unnecessary reload, oversized node, or synchronization barrier became a recurring budget line item.

2,000
Monte Carlo Cases
15
Mission Segments
30,000
Simulation Runs
6
Pipeline Stages

Post-processing had become the main bottleneck. The original workflow repeatedly loaded full datasets for merging, plotting, and analysis. As case counts rose, memory consumption and wall-clock time scaled far worse than linearly.

CPU efficiency was also poor. Without affinity-aware execution, thread scheduling scattered work across cores and inflated runtime through avoidable cache churn and context switching.

03

The turnaround architecture

The recovery plan focused on three things: eliminating machines stranded behind pipeline barriers, improving per-node efficiency, and decoupling the pipeline so work could progress continuously instead of waiting on full-batch synchronization points.

Observability first

Cost and performance instrumentation came before major tuning. That created a repeatable performance and cost baseline, exposed waste at the workload level, and meant architectural decisions were based on workload data instead of guesswork.

Node-aware orchestration

Memory requirements varied significantly across mission segments, driven by segment duration and decimation rate. A best-effort linear model estimated RAM needs based on segment length for initial placement. After each run, historical execution data fed back into the model, refining instance sizing for subsequent campaigns.

This two-pass approach — predictive placement followed by empirical correction — kept the fleet right-sized without requiring manual tuning as campaign configurations changed. CPU affinity pinning further improved per-node efficiency by reducing runtime inflation from cross-core scheduling noise.

Slack-driven pipeline execution

In the original workflow, downstream processing began only after large simulation batches completed, stranding capacity while later stages waited for synchronization points. The redesign treated completed simulations as immediately consumable inputs: as seeds finished, the same fleet began decomposing outputs and preparing downstream work. That reduced idle time, shortened end-to-end campaign latency, and let post-processing absorb otherwise wasted compute capacity. The details of the decomposition are covered in the data pipeline redesign section.

System architecture overview

The key architectural shift was turning campaign execution from a batch-oriented workflow into a continuously draining system. Work was sized before placement, corrected with empirical execution data, and decomposed early enough that downstream stages could consume slack capacity instead of waiting behind full-campaign barriers.

The diagram below shows the campaign flow from initialization and dispersion generation through simulation, initial per-seed post-processing, and the full post-processing pipeline.

Campaign Execution Flow GitLab CI Campaign trigger Job Orchestrator Autoscaling runners Init Config · Env setup Dispersion Gen MC params · All seeds Autoscaled EC2 Fleet m6i general Baseline fleet c6i compute Affinity pinned r6i high-RAM Large segments Spot pool Cost optimized Run Simulation Containerized sims Per-seed execution Initial Post Signal decomposition Slack-driven drain Object Storage (S3) Raw signals · Results · Artifacts Post-Processing Pipeline Plot Grouping Greedy signal match Subset Plots Decimation · Render Plot Merge Subset → Final Result Merge Append analytics Deliver Final artifacts
Disperse
MC parameter gen
All seeds at init
Simulate
Affinity-aware execution
RAM-based assignment
Decompose
Signal extraction
Slack-driven drain
Group
Greedy signal grouping
Subset index splits
Plot
Subset generation
Smart decimation
Merge
Plot assembly
Append-merged results
04

Data pipeline redesign

The post-processing pipeline was the biggest scaling bottleneck in the original system. The redesign broke it into four distinct stages, each designed to do bounded work and feed the next stage incrementally.

Before

  • Large monolithic post-processing passes
  • Repeated full-dataset reloads for visualization
  • Flat sequential merges
  • Poor scaling as case count increased
  • Workflow became memory-bound too early

After

  • Signal-level decomposition per seed
  • Greedy signal-grouped plot subsets
  • Decimated subset plots merged into finals
  • Append-optimized result structures
  • Sustained scaling to certification-scale workloads

Stage 1: Signal decomposition

As simulations completed, slack capacity on the fleet was used to immediately decompose each seed's raw output blob into individual signals needed for downstream processing. Each seed produced the smallest possible result structure for campaign-level analytics, which could be append-merged later. This was the first map-reduce boundary — converting large, monolithic output into granular, parallelizable work units. It reduced downstream data volume and created independently parallelizable units of work.

Stage 2: Plot grouping

Before any plot generation began, the pipeline inspected the full set of requested plots and their signal dependencies. A greedy algorithm grouped plots with similar signal requirements together, minimizing redundant data loads. Each group was then split into subset index ranges — indices 0–100, 101–200, and so on — so plot generation could proceed in bounded, parallel chunks rather than requiring the full dataset in memory. This eliminated repeated signal loads across related plots.

Stage 3: Subset plot generation & decimation

Each subset index group generated its portion of the plots independently. Smart decimation reduced data density where full resolution was unnecessary — straight-line segments didn't need every data point, and smooth curves were decimated where visual fidelity was preserved. This cut rendering time, I/O volume, and final artifact size.

Stage 4: Final merge

Subset plots were merged into final campaign-level visualizations. Result structures from the initial per-seed decomposition were append-merged for set-level analytics. Because each upstream stage produced bounded, well-structured output, the final merge operated on predictable data sizes — keeping aggregation cost flat rather than letting it grow with campaign scale.

05

Reliability at campaign scale

At this workload size, reliability is not a nice-to-have. A small per-stage failure rate compounds into a real campaign problem very quickly.

Each Monte Carlo seed simulated roughly three weeks of continuous flight time across its mission segments. The total campaign delivered over 1.5 billion simulated seconds. If a seed was lost mid-segment — due to a spot interruption, a silent container failure, or a scale-dependent race condition — it could not simply be restarted from the failure point. The seed was lost, reducing the statistical density of the verification evidence and potentially delaying verification closure.

30,000 simulation runs × 6 pipeline stages = 180,000 stage executions. Even tiny error rates create repeated intervention, rerun overhead, and schedule damage — but the real cost was not compute time. It was verification fidelity.

Failure Rate Failures / Campaign Across Dozens of Campaigns
0.01% ~3 ~60–90
0.1% ~30 ~600–900
1.0% ~300 Schedule-breaking

Spot interruptions

Fleet reclaim events need graceful recovery paths that preserve campaign integrity instead of forcing expensive reruns.

Capacity shifts

Instance availability moves over time. Orchestration has to adapt without turning every capacity event into a manual fire drill.

Silent bad output

Exit code zero is not enough. Validation has to check artifact integrity, completeness, and campaign-level correctness.

Scale-only failures

Concurrency bugs that never appear in dev environments can become brutal at fleet scale. They still count in production.

Unowned infrastructure does not stay stable. It drifts, regresses, and eventually turns reliability risk into schedule risk.

— Dustin Gardner

06

Cost controls & observability

The outcome did not come from a single dramatic optimization. It came from continuous month-over-month measurement and many smaller engineering decisions across fleet mix, pricing strategy, orchestration behavior, and workload design.

Monthly AWS (before)
$191K
Monthly AWS (after)
$48K
Campaign throughput
10×

What was measured

Instrumentation covered campaign-level cost attribution, per-segment resource profiles, per-seed cost normalization, instance-family effectiveness across workload types, spot versus on-demand tradeoff tracking, and regression detection against historical baselines. This instrumentation was used to make campaign-level placement, pricing, and regression decisions — not just to visualize spend after the fact.

Measurement made the savings durable

Without measurement, every cost conversation becomes anecdotal. With workload-level visibility, optimization became an operational discipline — month-over-month decisions could be made against real unit economics rather than aggregate billing trends.

The core principle was simple: observability had to come before optimization. The savings were not the result of one dramatic fix. They came from dozens of smaller, data-informed decisions that compounded over time.

07

What I owned

I executed this turnaround as a single platform engineer, owning the entire simulation infrastructure stack outside of the flight dynamics models themselves.

AWS Infrastructure

Architecture and operation of the full AWS environment including compute fleet design, instance strategy, networking, storage layout, and cost controls.

Compute Orchestration

Design and implementation of the autoscaling compute fleet, job scheduling model, node assignment logic, and CPU affinity optimization.

Simulation Data Pipeline

Full redesign of the Monte Carlo data pipeline including decomposition strategy, hierarchical merge architecture, and plotting workflow.

CI/CD & Containers

Containerized simulation execution, pipeline automation, and reproducible execution environments across the compute fleet.

Reliability Engineering

Campaign-scale reliability hardening including recovery paths for spot interruptions, validation of simulation artifacts, and protection against scale-only failures.

Cost Observability

Implementation of workload-normalized cost visibility and monitoring that enabled sustained month-over-month optimization.

The only portion of the system not owned in this scope was the underlying simulation models themselves, which were developed by the guidance, navigation, and control (GNC) team. Everything required to execute those models at scale — infrastructure, orchestration, pipelines, reliability, and cost control — I designed and operated directly.

08

The results

$5M
Avoided versus projected trajectory, while delivering a 10× throughput increase
77%
Program cost reduction vs projected trajectory
$1M
Under the approved $2.5M budget
10×
Increase in Monte Carlo throughput

The program moved from a projected $6.5M trajectory to $1.5M actual — roughly $5M avoided against the projected spend and $1M under the approved $2.5M budget. Monthly run-rate dropped from roughly $191K to $48K. Certification-scale throughput was delivered rather than deferred.

This work spanned orchestration, data pipeline design, CI/CD, containerization, reliability engineering, and cost management. It was not a one-time cleanup. It was sustained platform ownership applied to a system whose economics and operational risk were already breaking down.

Once compute becomes mission-critical, platform ownership is not overhead. It is leverage.

Working on expensive or fragile compute at scale?

I help teams redesign simulation, HPC, and batch compute systems that have become too expensive, too brittle, or too operationally noisy to scale.

Get in touch