No workload-level cost visibility
There was no normalized view of cost per case, cost per study, or cost per campaign. Spend was visible only at the bill, not at the workload.
A platform architecture turnaround for a mission-critical aerospace program: lower spend, higher throughput, reliability that held across certification-scale campaigns.
Figures are materially accurate but generalized to protect proprietary program details.
A simulation program was on track to overrun its approved AWS budget by millions of dollars, with a required 10× throughput increase still ahead. The turnaround did not come from discounts or one-off tuning. It came from redesigning how simulation work was scheduled, executed, decomposed, and merged — then instrumenting the system so cost and performance could be controlled at the workload level. The result was a shift from a projected $6.5M spend to $1.5M actual, while simultaneously delivering the 10× throughput increase that certification demanded.
An autonomous lunar lander program required certification-representative Monte Carlo analysis across many mission segments and campaign configurations. The scale was large, the schedule pressure was real, and the existing infrastructure was not built to handle either cleanly.
At the start of the role, projected amortized AWS spend was trending toward $6.5M against an approved budget of $2.5M. That projection came before a required tenfold increase in simulation throughput needed to support certification evidence.
The root issue was not a single bad decision. It was an accumulation of organic growth, limited ownership, and infrastructure being treated as a side job by domain engineers.
There was no normalized view of cost per case, cost per study, or cost per campaign. Spend was visible only at the bill, not at the workload.
Post-processing stages were tightly coupled. Large parts of the fleet sat idle while downstream stages waited for full-batch completion.
Data aggregation and post-processing costs scaled poorly with case count, turning higher throughput into a multiplier on waste.
Critical infrastructure knowledge was concentrated in a few people. When they were busy, campaigns slowed. When they were gone, the system regressed.
The core simulation workload was understandable. The difficulty came from scale, risk posture, and the cost of repeating it thousands of times. At this scale, every unnecessary reload, oversized node, or synchronization barrier became a recurring budget line item.
Post-processing had become the main bottleneck. The original workflow repeatedly loaded full datasets for merging, plotting, and analysis. As case counts rose, memory consumption and wall-clock time scaled far worse than linearly.
CPU efficiency was also poor. Without affinity-aware execution, thread scheduling scattered work across cores and inflated runtime through avoidable cache churn and context switching.
The recovery plan focused on three things: eliminating machines stranded behind pipeline barriers, improving per-node efficiency, and decoupling the pipeline so work could progress continuously instead of waiting on full-batch synchronization points.
Cost and performance instrumentation came before major tuning. That created a repeatable performance and cost baseline, exposed waste at the workload level, and meant architectural decisions were based on workload data instead of guesswork.
Memory requirements varied significantly across mission segments, driven by segment duration and decimation rate. A best-effort linear model estimated RAM needs based on segment length for initial placement. After each run, historical execution data fed back into the model, refining instance sizing for subsequent campaigns.
This two-pass approach — predictive placement followed by empirical correction — kept the fleet right-sized without requiring manual tuning as campaign configurations changed. CPU affinity pinning further improved per-node efficiency by reducing runtime inflation from cross-core scheduling noise.
In the original workflow, downstream processing began only after large simulation batches completed, stranding capacity while later stages waited for synchronization points. The redesign treated completed simulations as immediately consumable inputs: as seeds finished, the same fleet began decomposing outputs and preparing downstream work. That reduced idle time, shortened end-to-end campaign latency, and let post-processing absorb otherwise wasted compute capacity. The details of the decomposition are covered in the data pipeline redesign section.
The key architectural shift was turning campaign execution from a batch-oriented workflow into a continuously draining system. Work was sized before placement, corrected with empirical execution data, and decomposed early enough that downstream stages could consume slack capacity instead of waiting behind full-campaign barriers.
The diagram below shows the campaign flow from initialization and dispersion generation through simulation, initial per-seed post-processing, and the full post-processing pipeline.
The post-processing pipeline was the biggest scaling bottleneck in the original system. The redesign broke it into four distinct stages, each designed to do bounded work and feed the next stage incrementally.
As simulations completed, slack capacity on the fleet was used to immediately decompose each seed's raw output blob into individual signals needed for downstream processing. Each seed produced the smallest possible result structure for campaign-level analytics, which could be append-merged later. This was the first map-reduce boundary — converting large, monolithic output into granular, parallelizable work units. It reduced downstream data volume and created independently parallelizable units of work.
Before any plot generation began, the pipeline inspected the full set of requested plots and their signal dependencies. A greedy algorithm grouped plots with similar signal requirements together, minimizing redundant data loads. Each group was then split into subset index ranges — indices 0–100, 101–200, and so on — so plot generation could proceed in bounded, parallel chunks rather than requiring the full dataset in memory. This eliminated repeated signal loads across related plots.
Each subset index group generated its portion of the plots independently. Smart decimation reduced data density where full resolution was unnecessary — straight-line segments didn't need every data point, and smooth curves were decimated where visual fidelity was preserved. This cut rendering time, I/O volume, and final artifact size.
Subset plots were merged into final campaign-level visualizations. Result structures from the initial per-seed decomposition were append-merged for set-level analytics. Because each upstream stage produced bounded, well-structured output, the final merge operated on predictable data sizes — keeping aggregation cost flat rather than letting it grow with campaign scale.
At this workload size, reliability is not a nice-to-have. A small per-stage failure rate compounds into a real campaign problem very quickly.
Each Monte Carlo seed simulated roughly three weeks of continuous flight time across its mission segments. The total campaign delivered over 1.5 billion simulated seconds. If a seed was lost mid-segment — due to a spot interruption, a silent container failure, or a scale-dependent race condition — it could not simply be restarted from the failure point. The seed was lost, reducing the statistical density of the verification evidence and potentially delaying verification closure.
30,000 simulation runs × 6 pipeline stages = 180,000 stage executions. Even tiny error rates create repeated intervention, rerun overhead, and schedule damage — but the real cost was not compute time. It was verification fidelity.
| Failure Rate | Failures / Campaign | Across Dozens of Campaigns |
|---|---|---|
| 0.01% | ~3 | ~60–90 |
| 0.1% | ~30 | ~600–900 |
| 1.0% | ~300 | Schedule-breaking |
Fleet reclaim events need graceful recovery paths that preserve campaign integrity instead of forcing expensive reruns.
Instance availability moves over time. Orchestration has to adapt without turning every capacity event into a manual fire drill.
Exit code zero is not enough. Validation has to check artifact integrity, completeness, and campaign-level correctness.
Concurrency bugs that never appear in dev environments can become brutal at fleet scale. They still count in production.
Unowned infrastructure does not stay stable. It drifts, regresses, and eventually turns reliability risk into schedule risk.
— Dustin Gardner
The outcome did not come from a single dramatic optimization. It came from continuous month-over-month measurement and many smaller engineering decisions across fleet mix, pricing strategy, orchestration behavior, and workload design.
Instrumentation covered campaign-level cost attribution, per-segment resource profiles, per-seed cost normalization, instance-family effectiveness across workload types, spot versus on-demand tradeoff tracking, and regression detection against historical baselines. This instrumentation was used to make campaign-level placement, pricing, and regression decisions — not just to visualize spend after the fact.
Without measurement, every cost conversation becomes anecdotal. With workload-level visibility, optimization became an operational discipline — month-over-month decisions could be made against real unit economics rather than aggregate billing trends.
The core principle was simple: observability had to come before optimization. The savings were not the result of one dramatic fix. They came from dozens of smaller, data-informed decisions that compounded over time.
I executed this turnaround as a single platform engineer, owning the entire simulation infrastructure stack outside of the flight dynamics models themselves.
Architecture and operation of the full AWS environment including compute fleet design, instance strategy, networking, storage layout, and cost controls.
Design and implementation of the autoscaling compute fleet, job scheduling model, node assignment logic, and CPU affinity optimization.
Full redesign of the Monte Carlo data pipeline including decomposition strategy, hierarchical merge architecture, and plotting workflow.
Containerized simulation execution, pipeline automation, and reproducible execution environments across the compute fleet.
Campaign-scale reliability hardening including recovery paths for spot interruptions, validation of simulation artifacts, and protection against scale-only failures.
Implementation of workload-normalized cost visibility and monitoring that enabled sustained month-over-month optimization.
The only portion of the system not owned in this scope was the underlying simulation models themselves, which were developed by the guidance, navigation, and control (GNC) team. Everything required to execute those models at scale — infrastructure, orchestration, pipelines, reliability, and cost control — I designed and operated directly.
The program moved from a projected $6.5M trajectory to $1.5M actual — roughly $5M avoided against the projected spend and $1M under the approved $2.5M budget. Monthly run-rate dropped from roughly $191K to $48K. Certification-scale throughput was delivered rather than deferred.
This work spanned orchestration, data pipeline design, CI/CD, containerization, reliability engineering, and cost management. It was not a one-time cleanup. It was sustained platform ownership applied to a system whose economics and operational risk were already breaking down.
Once compute becomes mission-critical, platform ownership is not overhead. It is leverage.
I help teams redesign simulation, HPC, and batch compute systems that have become too expensive, too brittle, or too operationally noisy to scale.