Case Study · Platform Engineering

Cutting a projected $6.5M AWS trajectory to $1.5M while scaling Monte Carlo throughput 10×

A platform architecture turnaround for a mission-critical aerospace program: lower spend, higher throughput, reliability that held across certification-scale campaigns.

Figures are materially accurate but generalized to protect proprietary program details.

$6.5M Projected Spend

$1.5M Actual Spend

77% Program Cost Reduction

10× Throughput Increase

Read the case study Discuss similar work

Executive Summary

Most compute platforms don't fail outright. They degrade: cost climbs, throughput stalls, reliability erodes at scale, and the team spends more time managing infrastructure than using it. That is the failure mode this work addressed.

A simulation program was on track to overrun its approved AWS budget by millions of dollars, with a required 10× throughput increase still ahead. The turnaround did not come from discounts or one-off tuning. It came from redesigning how simulation work was scheduled, executed, decomposed, and merged, then instrumenting the system so cost and performance could be controlled at the workload level. The result was a shift from a projected $6.5M spend to $1.5M actual, while simultaneously delivering the 10× throughput increase that certification demanded.

The problem

An autonomous lunar lander program required certification-representative Monte Carlo analysis across many mission segments and campaign configurations. The scale was large, the schedule pressure was real, and the existing infrastructure was not built to handle either cleanly.

At the start of the role, projected amortized AWS spend was trending toward $6.5M against an approved budget of $2.5M. That projection came before a required tenfold increase in simulation throughput needed to support certification evidence.

The root issue was not a single bad decision. It was an accumulation of organic growth, limited ownership, and infrastructure being treated as a side job by domain engineers.

No workload-level cost visibility

There was no normalized view of cost per case, cost per study, or cost per campaign. Spend was visible only at the bill, not at the workload.

Monolithic post-processing

Post-processing stages were tightly coupled. Large parts of the fleet sat idle while downstream stages waited for full-batch completion.

Scale turned inefficiency into cost

Data aggregation and post-processing costs scaled poorly with case count, turning higher throughput into a multiplier on waste.

Infrastructure without durable ownership

Critical infrastructure knowledge was concentrated in a few people. When they were busy, campaigns slowed. When they were gone, the system regressed.

The scale that broke the old system

The core simulation workload was understandable. The difficulty came from scale, risk posture, and the cost of repeating it thousands of times. At this scale, every unnecessary reload, oversized node, or synchronization barrier became a recurring budget line item.

2,000

Monte Carlo Cases

Mission Segments

30,000

Simulation Runs

Pipeline Stages

Post-processing had become the main bottleneck. The original workflow repeatedly loaded full datasets for merging, plotting, and analysis. As case counts rose, memory consumption and wall-clock time scaled far worse than linearly.

CPU efficiency was also poor. Without affinity-aware execution, thread scheduling scattered work across cores and inflated runtime through avoidable cache churn and context switching.

The turnaround architecture

The recovery plan focused on three things: eliminating machines stranded behind pipeline barriers, improving per-node efficiency, and decoupling the pipeline so work could progress continuously instead of waiting on full-batch synchronization points.

Observability first

Cost and performance instrumentation came before major tuning. That created a repeatable performance and cost baseline, exposed waste at the workload level, and meant architectural decisions were based on workload data instead of guesswork.

Node-aware orchestration

Memory requirements varied widely across mission segments, driven by segment duration and decimation rate. A best-effort linear model estimated RAM needs based on segment length for initial placement. After each run, historical execution data fed back into the model, refining instance sizing for subsequent campaigns.

This two-pass approach (predictive placement followed by empirical correction) kept the fleet right-sized without requiring manual tuning as campaign configurations changed. CPU affinity pinning further improved per-node efficiency by reducing runtime inflation from cross-core scheduling noise.

Slack-driven pipeline execution

In the original workflow, downstream processing began only after large simulation batches completed, stranding capacity while later stages waited for synchronization points. The redesign treated completed simulations as immediately consumable inputs: as seeds finished, the same fleet began decomposing outputs and preparing downstream work. That reduced idle time, shortened end-to-end campaign latency, and let post-processing absorb otherwise wasted compute capacity. The details of the decomposition are covered in the data pipeline redesign section.

System architecture overview

The key architectural shift was turning campaign execution from a batch-oriented workflow into a continuously draining system. Work was sized before placement, corrected with empirical execution data, and decomposed early enough that downstream stages could consume slack capacity instead of waiting behind full-campaign barriers.

The diagram below shows the campaign flow from initialization and dispersion generation through simulation, initial per-seed post-processing, and the full post-processing pipeline.

Disperse

MC parameter gen
All seeds at init

Simulate

Affinity-aware execution
RAM-based assignment

Decompose

Signal extraction
Slack-driven drain

Group

Greedy signal grouping
Subset index splits

Plot

Subset generation
Smart decimation

Merge

Plot assembly
Append-merged results

Data pipeline redesign

The post-processing pipeline was the biggest scaling bottleneck in the original system. The redesign broke it into four distinct stages, each designed to do bounded work and feed the next stage incrementally.

Before

Large monolithic post-processing passes
Repeated full-dataset reloads for visualization
Flat sequential merges
Poor scaling as case count increased
Workflow became memory-bound too early

After

Signal-level decomposition per seed
Greedy signal-grouped plot subsets
Decimated subset plots merged into finals
Append-optimized result structures
Near-linear scaling across 30,000+ runs without memory-bound degradation

Stage 1: Signal decomposition

As simulations completed, slack capacity on the fleet was used to immediately decompose each seed's raw output blob into individual signals needed for downstream processing. Each seed produced the smallest possible result structure for campaign-level analytics, which could be append-merged later. This was the first map-reduce boundary: converting large, monolithic output into granular, parallelizable work units. It reduced downstream data volume and created independently parallelizable units of work.

Stage 2: Plot grouping

Before any plot generation began, the pipeline inspected the full set of requested plots and their signal dependencies. A greedy algorithm grouped plots with similar signal requirements together, minimizing redundant data loads. Each group was then split into subset index ranges (indices 0–100, 101–200, and so on). Plot generation could then proceed in bounded, parallel chunks rather than requiring the full dataset in memory. This eliminated repeated signal loads across related plots.

Stage 3: Subset plot generation & decimation

Each subset index group generated its portion of the plots independently. Smart decimation reduced data density where full resolution was unnecessary. Straight-line segments didn't need every data point, and smooth curves were decimated where visual fidelity was preserved. This cut rendering time, I/O volume, and final artifact size.

Stage 4: Final merge

Subset plots were merged into final campaign-level visualizations. Result structures from the initial per-seed decomposition were append-merged for set-level analytics. Because each upstream stage produced bounded, well-structured output, the final merge operated on predictable data sizes, keeping aggregation cost flat rather than letting it grow with campaign scale.

Reliability at campaign scale

At this scale, reliability is no longer a system property. It becomes a program constraint.

A small per-stage failure rate compounds into a real campaign problem very quickly.

Each Monte Carlo seed simulated roughly three weeks of continuous flight time across its mission segments. The pipelines delivered over 1.5 billion simulated seconds per week. If a seed was lost mid-segment (spot interruption, silent container failure, scale-dependent race condition), it could not simply be restarted from the failure point. The seed was lost, reducing the statistical density of the verification evidence and potentially delaying verification closure.

30,000 simulation runs × 6 pipeline stages = 180,000 stage executions. Even tiny error rates create repeated intervention, rerun overhead, and schedule damage. But the real cost was not compute time. It was verification fidelity.

Failure Rate	Failures / Campaign	Across Dozens of Campaigns
0.01%	~3	~60–90
0.1%	~30	~600–900
1.0%	~300	Schedule-breaking

⚡

Spot interruptions

Fleet reclaim events need graceful recovery paths that preserve campaign integrity instead of forcing expensive reruns.

⚡

Capacity shifts

Instance availability moves over time. Orchestration has to adapt without turning every capacity event into a manual fire drill.

⚡

Silent bad output

Exit code zero is not enough. Validation has to check artifact integrity, completeness, and campaign-level correctness.

⚡

Scale-only failures

Concurrency bugs that never appear in dev environments can become brutal at fleet scale. They still count in production.

Unowned infrastructure does not stay stable. It drifts, regresses, and eventually turns reliability risk into schedule risk.

Cost controls & observability

The outcome did not come from a single dramatic optimization. It came from continuous month-over-month measurement and many smaller engineering decisions across fleet mix, pricing strategy, orchestration behavior, and workload design.

Monthly AWS (before)

$191K

Monthly AWS (after)

$48K

Campaign throughput

10×

What was measured

Instrumentation covered campaign-level cost attribution, per-segment resource profiles, per-seed cost normalization, instance-family effectiveness across workload types, spot versus on-demand tradeoff tracking, and regression detection against historical baselines. This instrumentation was used to make campaign-level placement, pricing, and regression decisions, not just to visualize spend after the fact.

Measurement made the savings durable

Without measurement, every cost conversation becomes anecdotal. With workload-level visibility, optimization became an operational discipline. Month-over-month decisions could be made against real unit economics rather than aggregate billing trends.

The core principle was simple: cost had to be treated as an input to architecture decisions, not a metric observed after the fact. The savings were not the result of one dramatic fix. They came from dozens of smaller, data-informed decisions that compounded over time.

What I owned

This turnaround required unified ownership across layers that are usually split across teams — infrastructure, orchestration, data pipelines, reliability, and cost.

AWS Infrastructure

Architecture and operation of the full AWS environment including compute fleet design, instance strategy, networking, storage layout, and cost controls.

Compute Orchestration

Design and implementation of the autoscaling compute fleet, job scheduling model, node assignment logic, and CPU affinity optimization.

Simulation Data Pipeline

Full redesign of the Monte Carlo data pipeline including decomposition strategy, hierarchical merge architecture, and plotting workflow.

CI/CD & Containers

Containerized simulation execution, pipeline automation, and reproducible execution environments across the compute fleet.

Reliability Engineering

Campaign-scale reliability hardening including recovery paths for spot interruptions, validation of simulation artifacts, and protection against scale-only failures.

Cost Observability

Implementation of workload-normalized cost visibility and monitoring that enabled sustained month-over-month optimization.

The only portion of the system not owned in this scope was the underlying simulation models themselves, which were developed by the guidance, navigation, and control (GNC) team. Everything required to execute those models at scale, I designed and operated directly: infrastructure, orchestration, pipelines, reliability, and cost control.

The results

$5M

Avoided versus projected trajectory, while delivering a 10× throughput increase

77%

Program cost reduction vs projected trajectory

$1M

Under the approved $2.5M budget

10×

Increase in Monte Carlo throughput

The program moved from a projected $6.5M trajectory to $1.5M actual: roughly $5M avoided against the projected spend and $1M under the approved $2.5M budget. Monthly run-rate dropped from roughly $191K to $48K. Certification-scale throughput was delivered rather than deferred.

This work spanned orchestration, data pipeline design, CI/CD, containerization, reliability engineering, and cost management. It was not a one-time cleanup. It was sustained platform ownership applied to a system whose economics and operational risk were already breaking down.

Once compute becomes mission-critical, platform ownership is not overhead. It is leverage.