Dustin Gardner

I fix compute platforms when cost and scale break at the same time.

Denver, CO dustin.r1.gardner@gmail.com linkedin.com/in/drgardner42 gitlab.com/drgardner42

I fix compute platforms when cost and scale break at the same time. Only infrastructure engineer responsible for a certification-critical simulation platform—failure would have blocked program-level analysis.

Reduced a projected $6.5 M AWS trajectory to $1.5 M while scaling throughput 10× within the first year. Built and owned the entire cost-governed compute platform end-to-end as the sole infrastructure engineer on the program.

Reduced projected AWS spend by 77%

$6.5 M → $1.5 M over first year

1.5 B+ simulated flight seconds per week

Certification-scale throughput

30,000+ Monte Carlo runs per campaign

Across autoscaled EC2 fleets

Sole owner of the full platform stack

Every layer, end to end

Full architecture + cost breakdown →

This failure mode is common.

If your compute costs are rising faster than your output, you are already here.

Compute systems scale before cost control does. At first it looks like progress: more runs, more data, more throughput. Then it flips: cost grows faster than output, pipelines stall behind bottlenecks, and reliability issues appear only at scale.

When cost and scale break at the same time, the problem is not just financial. Teams slow down. Confidence in results drops. Critical decisions get delayed. In programs where simulation output drives certification or product direction, that becomes a schedule risk.

At that point the system is no longer accelerating the team. It is the bottleneck.

Cost-Governed Compute Platform

If you're dealing with this failure mode, this is the system I bring in.

Monte Carlo orchestration engine: 30,000+ parallel simulation runs per campaign across autoscaled EC2 fleets with RAM-predictive placement and CPU affinity optimization
Cost governance system with real-time per-workload cost attribution, pricing strategy automation, and regression detection—the mechanism behind the 77% spend reduction
Map-reduce data pipeline with tree-structured merge architecture and signal-level decomposition, eliminating O(n²) scaling and enabling linear-time aggregation at campaign scale
Campaign reliability framework covering 180,000+ stage executions per campaign: spot interruption recovery, artifact validation, silent-failure detection
Flight dynamics compute bridge integrating MATLAB/Simulink autocoded flight software into containerized HPC execution, enabling GNC teams to run certification-scale analysis without infrastructure expertise

Architecture Pattern

Decouple execution from orchestration. Let scheduling, placement, and pipeline stages evolve independently.
Make cost a first-class control signal, not an afterthought measured at the billing page.
Eliminate data movement bottlenecks. Decompose early, merge late, keep intermediate artifacts bounded.
Design for campaign-scale reliability, not job-scale success. Small failure rates compound into schedule-breaking problems at 10k+ jobs.

This pattern is not unique to aerospace. It shows up anywhere large-scale compute is pushed without tight cost and execution control:

Monte Carlo and simulation platforms
ML training infrastructure and GPU fleets
Batch compute systems with complex pipelines

See how this played out at scale →

Experience

Senior HPC Platform Engineer

2022 – Present

Blue Origin, Denver, CO

Replaced a failing simulation platform under active program pressure, reducing projected AWS spend from $6.5 M to $1.5 M while scaling throughput 10×. Sole infrastructure engineer. Owned every layer.
Broke apart tightly coupled simulation execution into a distributed orchestration platform bridging MATLAB/Simulink autocoded flight software into large-scale AWS compute. Tens of thousands of parallel runs per campaign.
Diagnosed and eliminated quadratic merge costs by architecting map-reduce data pipelines with tree-structured merge strategies and signal-level decomposition, sustaining linear performance at certification scale.
Owned campaign-scale reliability across 180,000+ stage executions: spot interruption recovery, scale-dependent race condition diagnosis, silent container failure detection.
Built a real-time cost governance system (Datadog, Prometheus) providing per-campaign and per-workload spend visibility; authored proposal to establish a dedicated HPC & Simulation Platform team across multiple spacecraft programs.
Sole technical bridge between GNC domain engineers and infrastructure, translating flight dynamics requirements into distributed compute architecture decisions.
Mentor 2 GNC engineers on CS fundamentals, HPC patterns, and cloud infrastructure, with 3–5 additional informal mentees.

DevSecOps Tech Lead / Staff Software Engineer

2017 – 2022

Lockheed Martin Space, Denver, CO

Managed CI/CD and simulation infrastructure for 100+ engineers across 80+ servers, sustaining 40,000 jobs/week at 99.5% uptime. Promoted through three levels from Testbed & Simulation Software Engineer to Staff Software Engineer.
Migrated 100+ legacy simulation and testbed workflows to cloud-native CI/CD in AWS with autoscaling compute.
Owned hardware-in-the-loop testbeds through multiple successful qualification events as simulation product owner.
Led Digital Twin initiative from concept to production as scrum master; delivered customer demonstrations to the Air Force.
Mentored junior developers as part of day-to-day work and broader team culture.

Software Engineer Intern

Summers 2013–2016

Lockheed Martin Space, Huntsville & Denver

Graduate Teaching Assistant

2012 – 2016

Tennessee Tech, Cookeville, TN

Skills

Distributed Compute Architecture: Large-scale Monte Carlo orchestration, map-reduce pipeline design, CPU affinity optimization, campaign-scale reliability engineering
Cloud Cost Engineering: AWS (EC2, S3, EKS), spot fleet strategy, workload-level cost attribution, autoscaling, Terraform, right-sizing
Platform & Observability: Datadog, Prometheus, Grafana, GitLab CI, Docker, Kubernetes, Openshift, infrastructure-as-code, Linux (RHEL, Ubuntu)
Programming: Python, C/C++, Bash, Git

Teams That Bring Me In

AWS / cloud compute bills scaling faster than throughput
Simulation or HPC platforms that can't scale past prototypes
Monte Carlo / batch compute with poor fleet utilization
CI/CD bottlenecks in compute-heavy workflows

Education

M.S. Computer Science

May 2017 | 3.9 GPA

Tennessee Tech University

Advanced coursework in High-Performance Computing; Thesis: Autonomic Protection Systems

B.S. Computer Science, cum laude

May 2015 | 3.5 GPA

Tennessee Tech University

Minor: Statistical Mathematics

Other

Eagle Scout | Active in skiing, climbing, and mountaineering

Spending more but shipping less?

Typical engagements: architecture audit, cost reduction, platform redesign.

Get in touch View case study

dustin.r1.gardner@gmail.com linkedin.com/in/drgardner42