ycombinator.comForward Deployed Engineer: AI + HPC at Cedana | Y Combinator
Introducing Cedana
The Problem
AI and HPC infrastructure suffers from scarcity and high costs, so when failures happen they are costly in terms of
time and money. Cluster productivity directly determines research output and revenue. Achieving high utilization and
throughput is increasingly challenging due to the complexity of workloads, hardware, and operations.
Cedana’s Solution
Cedana maximizes AI+HPC cluster utilization and reliability with automated GPU checkpointing infrastructure. We enable
transparent and fast migration of GPU workloads across instances, without losing work. Workloads automatically migrate
to achieve new levels of reliability and throughput while accelerating time to results. Our system is at the kernel/OS
level, requiring no code or config changes, and works seamlessly with Kubernetes, SLURM, and NVIDIA Dynamo. Today, we're
deploying into leading inference platforms, neoclouds, enterprise, and research clusters.
The Team
Cedana's founding team has spent over a decade making computation run fast, productively, and reliably for AI. Our
research appears in NeurIPS and CVPR. We published some of the earliest formal methods for guaranteeing convergence in
distributed training. At Shopify we've deployed warehouse automation and robot fleets building behavior trees, fleet
control planes, and OTA infrastructure that performs reliably over constrained networks. We bring repeat founder
experience having built and exited a healthcare AI company.
The Role
What you’ll own
As a Forward Deployed Engineer at Cedana, you’ll lead and own technical engagement from end to end. You’ll engage with
customers to understand and deploy on their environments: from production SLURM at a university, bare-metal Kubernetes
at an inference provider, hybrid setup at a Fortune 100 Pharma enterprise. You’ll rapidly understand their key pain
points, and use Cedana to solve their problems. For each customer you own everything from the OS up: SLURM plugins,
Kubernetes operators, node configuration, networking, and observability.
This role will expose you to the cutting edge of AI and HPC infrastructure, working with the world’s leading research
and commercial customers to deliver a breakthrough solution.
What You'll Do
- Engineer solutions at client sites: Lead customer integrations. install, configure and deploy Cedana into SLURM,
Kubernetes, and Dynamo environments.
- Drive product innovation from the field: Identify technical gaps while embedded with clients, then provide product
feedback for new capabilities that become core product features.
- Measure and optimize platform performance: Measure reliability, throughput and performance using our internal tools.
Design and implement policy-based migration automations to optimize reliability, throughput and performance
- Own critical deployments: Ensure our platform performs reliably for clients' critical operations, debugging issues
across the full stack. Debug install issues against unfamiliar customer infrastructure, escalate to engineering when
necessary.
- Improve scalability: Build the internal install playbook so the second customer in each segment is faster than the
first.
- Respect our customers: Understand ways to make their life easier, minimize their time and overhead.
What we are looking for
- 3-10 years of software engineering experience with a track record of configuring and managing SLURM deployments.
- A multi-month enterprise or research deployment you led end-to-end, from scoping through signoff. You write
effective status updates to keep your team updated and on schedule.
- Production experience standing up SLURM in a customer or research environment. You've configured slurmctld,
slurmdbd, accounting, cgroup integration, and GPU resource selection.
- Strong Linux fundamentals of systemd, cgroups v2, namespaces, networking, filesystems, kernel module loading, PAM
session modules. You read strace and dmesg output and form a hypothesis.
- Working Kubernetes operations including operators, CRDs, device plugins, node-level debugging. You've debugged a
controller in production even if you haven't written one from scratch.
Bonus if you have
- Experience at an HPC integrator field team
- Client-facing technical experience working directly with customers.
- Background in national lab user services or university research computing
- You’ve developed SLURM plug-ins, and understand their architecture and how they fit into the overall platform.
- Familiarity with CRIU, container runtimes, GPU driver internals, distributed training stacks
- Hands-on with NVIDIA Dynamo, Determined, Ray, Kueue, KServe, or comparable AI orchestration.
- Contributed to open-source schedulers or job systems (SLURM, Flux, Torque, PBS).
- A passion for debugging a weird cgroup issue at 11pm just as much as writing a clean install playbook the next
morning.
Logistics
- Remote, US-based. ~25% travel for customer installs.
- Base $140,000–$180,000 + meaningful early-stage equity.
Benefits
- 100% covered medical, dental, and vision insurance for employees and families
- Unlimited PTO policy
- 401K Plan
Equal Opportunity Employer
Cedana is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without
regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, protected veteran
status, or disability status