Dustin Gardner
HPC Platform Architect | Simulation Infrastructure | Cloud Cost Engineering
I rebuild compute platforms when cost and scale break at the same time. Sole engineer responsible for architecture, cost, reliability, and scaling of a certification-critical simulation platform — failure would have blocked program-level certification analysis.
Replaced failing simulation infrastructure with a cost-governed compute platform that cut a projected $6.5M AWS trajectory to $1.5M while scaling Monte Carlo throughput 50,000× to 1.5B+ simulated flight seconds per week. Redirected spend trajectory within the first year. Owned every layer end-to-end — orchestration, data pipelines, reliability, cost governance — as the only infrastructure engineer on the program.
Reduced projected AWS spend by 77% ($6.5M → $1.5M)
Delivered 1.5B+ simulated flight seconds per week
Scaled to 30,000+ Monte Carlo runs per campaign
Sole owner of the full simulation platform stack
Full architecture + cost breakdown →
Cost-Governed Compute Platform
Systems owned end-to-end
- Monte Carlo orchestration engine — 30,000+ parallel simulation runs per campaign across autoscaled EC2 fleets with RAM-predictive placement and CPU affinity optimization
- Cost governance system — real-time per-workload cost attribution, pricing strategy automation, and regression detection that cut a projected $6.5M trajectory to $1.5M
- Map-reduce data pipeline — tree-structured merge architecture with signal-level decomposition, eliminating O(n²) scaling and enabling linear-time aggregation at campaign scale
- Campaign reliability framework — recovery paths for spot interruptions, artifact validation, and silent-failure detection across 180,000+ stage executions per campaign
- Flight dynamics compute bridge — integrated MATLAB/Simulink autocoded flight software into containerized HPC execution, enabling GNC teams to run certification-scale analysis without infrastructure expertise
Architecture Pattern
- Decouple execution from orchestration — let scheduling, placement, and pipeline stages evolve independently
- Make cost a first-class control signal, not an afterthought measured at the billing page
- Eliminate data movement bottlenecks — decompose early, merge late, keep intermediate artifacts bounded
- Design for campaign-scale reliability, not job-scale success — small failure rates compound into schedule-breaking problems at 10k+ jobs
- These patterns apply beyond aerospace — ML training, GPU fleets, and any large-scale batch compute system with the same cost/scale failure mode
Experience
Senior HPC Platform Engineer
2022 – PresentBlue Origin, Denver, CO
- Cut projected AWS spend by $5M ($6.5M → $1.5M) while scaling throughput 10× by replacing failing simulation infrastructure with a cost-governed compute platform. Only infrastructure engineer on the program — owned every layer.
- Scaled Monte Carlo throughput 50,000× from early baselines to 1.5B+ simulated flight seconds per week across campaigns of 30,000+ runs.
- Broke apart tightly coupled simulation execution into a distributed orchestration platform bridging MATLAB/Simulink autocoded flight software into large-scale AWS compute — tens of thousands of parallel runs per campaign.
- Eliminated quadratic merge costs by architecting map-reduce data pipelines with tree-structured merge strategies and signal-level decomposition, sustaining linear performance at certification scale.
- Owned campaign-scale reliability across 180,000+ stage executions — spot interruption recovery, scale-dependent race condition diagnosis, silent container failure detection.
- Built a real-time cost governance system (Datadog, Prometheus) providing per-campaign and per-workload spend visibility; authored proposal to establish a dedicated HPC & Simulation Platform team across multiple spacecraft programs.
- Sole technical bridge between GNC domain engineers and infrastructure, translating flight dynamics requirements into distributed compute architecture decisions.
- Mentor 2 GNC engineers on CS fundamentals, HPC patterns, and cloud infrastructure, with 3–5 additional informal mentees.
DevSecOps Tech Lead / Staff Software Engineer
2017 – 2022Lockheed Martin Space, Denver, CO
- Managed CI/CD and simulation infrastructure for 100+ engineers across 80+ servers, sustaining 40,000 jobs/week at 99.5% uptime. Promoted through three levels from Testbed & Simulation Software Engineer to Staff Software Engineer.
- Migrated 100+ legacy simulation and testbed workflows to cloud-native CI/CD in AWS with autoscaling compute.
- Owned hardware-in-the-loop testbeds through multiple successful qualification events as simulation product owner.
- Led Digital Twin initiative from concept to production as scrum master; delivered customer demonstrations to the Air Force.
- Mentored junior developers as part of day-to-day work and broader team culture.
Software Engineer Intern
Summers 2013–2016Lockheed Martin Space, Huntsville & Denver
Graduate Teaching Assistant
2012 – 2016Tennessee Tech, Cookeville, TN
Skills
- Distributed Compute Architecture
- Large-scale Monte Carlo orchestration, map-reduce pipeline design, CPU affinity optimization, campaign-scale reliability engineering
- Cloud Cost Engineering
- AWS (EC2, S3, EKS), spot fleet strategy, workload-level cost attribution, autoscaling, Terraform, right-sizing
- Platform & Observability
- Datadog, Prometheus, Grafana, GitLab CI, Docker, Kubernetes, Openshift, infrastructure-as-code, Linux (RHEL, Ubuntu)
- Programming
- Python, C/C++, Bash, Git
Where I'm Useful
- AWS / cloud compute bills scaling faster than throughput
- Simulation or HPC platforms that can't scale past prototypes
- Monte Carlo / batch compute with poor fleet utilization
- AI training infrastructure with GPU scheduling problems
- CI/CD bottlenecks in compute-heavy workflows
Education
M.S. Computer Science
May 2017 | 3.9 GPATennessee Tech University
Advanced coursework in High-Performance Computing; Thesis: Autonomic Protection Systems
B.S. Computer Science, cum laude
May 2015 | 3.5 GPATennessee Tech University
Minor: Statistical Mathematics
Other
Eagle Scout | Active in skiing, climbing, and mountaineering
I apply these cost-governance and platform architecture patterns to new systems. Typical engagements focus on architecture audit, cost reduction, and platform redesign. Open to principal-level roles and select consulting.