Dustin Gardner
I fix compute platforms when cost and scale break at the same time.
I fix compute platforms when cost and scale break at the same time. Only infrastructure engineer responsible for a certification-critical simulation platform—failure would have blocked program-level analysis.
Reduced a projected $6.5 M AWS trajectory to $1.5 M while scaling throughput 10× within the first year. Built and owned the entire cost-governed compute platform end-to-end as the sole infrastructure engineer on the program.
Reduced projected AWS spend by 77%
$6.5 M → $1.5 M over first year
1.5 B+ simulated flight seconds per week
Certification-scale throughput
30,000+ Monte Carlo runs per campaign
Across autoscaled EC2 fleets
Sole owner of the full platform stack
Every layer, end to end
Full architecture + cost breakdown →
This failure mode is common.
If your compute costs are rising faster than your output, you are already here.
Compute systems scale before cost control does. At first it looks like progress: more runs, more data, more throughput. Then it flips: cost grows faster than output, pipelines stall behind bottlenecks, and reliability issues appear only at scale.
When cost and scale break at the same time, the problem is not just financial. Teams slow down. Confidence in results drops. Critical decisions get delayed. In programs where simulation output drives certification or product direction, that becomes a schedule risk.
At that point the system is no longer accelerating the team. It is the bottleneck.
Cost-Governed Compute Platform
If you're dealing with this failure mode, this is the system I bring in.
- Monte Carlo orchestration engine: 30,000+ parallel simulation runs per campaign across autoscaled EC2 fleets with RAM-predictive placement and CPU affinity optimization
- Cost governance system with real-time per-workload cost attribution, pricing strategy automation, and regression detection—the mechanism behind the 77% spend reduction
- Map-reduce data pipeline with tree-structured merge architecture and signal-level decomposition, eliminating O(n²) scaling and enabling linear-time aggregation at campaign scale
- Campaign reliability framework covering 180,000+ stage executions per campaign: spot interruption recovery, artifact validation, silent-failure detection
- Flight dynamics compute bridge integrating MATLAB/Simulink autocoded flight software into containerized HPC execution, enabling GNC teams to run certification-scale analysis without infrastructure expertise
Architecture Pattern
- Decouple execution from orchestration. Let scheduling, placement, and pipeline stages evolve independently.
- Make cost a first-class control signal, not an afterthought measured at the billing page.
- Eliminate data movement bottlenecks. Decompose early, merge late, keep intermediate artifacts bounded.
- Design for campaign-scale reliability, not job-scale success. Small failure rates compound into schedule-breaking problems at 10k+ jobs.
This pattern is not unique to aerospace. It shows up anywhere large-scale compute is pushed without tight cost and execution control:
- Monte Carlo and simulation platforms
- ML training infrastructure and GPU fleets
- Batch compute systems with complex pipelines
Experience
Senior HPC Platform Engineer
2022 – PresentBlue Origin, Denver, CO
- Replaced a failing simulation platform under active program pressure, reducing projected AWS spend from $6.5 M to $1.5 M while scaling throughput 10×. Sole infrastructure engineer. Owned every layer.
- Broke apart tightly coupled simulation execution into a distributed orchestration platform bridging MATLAB/Simulink autocoded flight software into large-scale AWS compute. Tens of thousands of parallel runs per campaign.
- Diagnosed and eliminated quadratic merge costs by architecting map-reduce data pipelines with tree-structured merge strategies and signal-level decomposition, sustaining linear performance at certification scale.
- Owned campaign-scale reliability across 180,000+ stage executions: spot interruption recovery, scale-dependent race condition diagnosis, silent container failure detection.
- Built a real-time cost governance system (Datadog, Prometheus) providing per-campaign and per-workload spend visibility; authored proposal to establish a dedicated HPC & Simulation Platform team across multiple spacecraft programs.
- Sole technical bridge between GNC domain engineers and infrastructure, translating flight dynamics requirements into distributed compute architecture decisions.
- Mentor 2 GNC engineers on CS fundamentals, HPC patterns, and cloud infrastructure, with 3–5 additional informal mentees.
DevSecOps Tech Lead / Staff Software Engineer
2017 – 2022Lockheed Martin Space, Denver, CO
- Managed CI/CD and simulation infrastructure for 100+ engineers across 80+ servers, sustaining 40,000 jobs/week at 99.5% uptime. Promoted through three levels from Testbed & Simulation Software Engineer to Staff Software Engineer.
- Migrated 100+ legacy simulation and testbed workflows to cloud-native CI/CD in AWS with autoscaling compute.
- Owned hardware-in-the-loop testbeds through multiple successful qualification events as simulation product owner.
- Led Digital Twin initiative from concept to production as scrum master; delivered customer demonstrations to the Air Force.
- Mentored junior developers as part of day-to-day work and broader team culture.
Software Engineer Intern
Summers 2013–2016Lockheed Martin Space, Huntsville & Denver
Graduate Teaching Assistant
2012 – 2016Tennessee Tech, Cookeville, TN
Skills
- Distributed Compute Architecture
- Large-scale Monte Carlo orchestration, map-reduce pipeline design, CPU affinity optimization, campaign-scale reliability engineering
- Cloud Cost Engineering
- AWS (EC2, S3, EKS), spot fleet strategy, workload-level cost attribution, autoscaling, Terraform, right-sizing
- Platform & Observability
- Datadog, Prometheus, Grafana, GitLab CI, Docker, Kubernetes, Openshift, infrastructure-as-code, Linux (RHEL, Ubuntu)
- Programming
- Python, C/C++, Bash, Git
Teams That Bring Me In
- AWS / cloud compute bills scaling faster than throughput
- Simulation or HPC platforms that can't scale past prototypes
- Monte Carlo / batch compute with poor fleet utilization
- CI/CD bottlenecks in compute-heavy workflows
Education
M.S. Computer Science
May 2017 | 3.9 GPATennessee Tech University
Advanced coursework in High-Performance Computing; Thesis: Autonomic Protection Systems
B.S. Computer Science, cum laude
May 2015 | 3.5 GPATennessee Tech University
Minor: Statistical Mathematics
Other
Eagle Scout | Active in skiing, climbing, and mountaineering
Spending more but shipping less?
Typical engagements: architecture audit, cost reduction, platform redesign.