Staff Production Engineer, Managed AI
Crusoe
Software Engineering, Data Science
San Francisco, CA, USA
Posted on Mar 8, 2026
Crusoe is on a mission to accelerate the abundance of energy and intelligence. As the only vertically integrated AI infrastructure company built from the ground up, we own and operate each layer of the stack — from electrons to tokens — to power the world's most ambitious AI workloads. When you join Crusoe, you join a team that is building the future, faster.
We're in the midst of the greatest industrial revolution of our time. The demand for AI compute is boundless, and power is a bottleneck. We're solving that — with an energy-first approach that makes AI infrastructure better for the world and faster for the people innovating with AI.
We're looking for problem-solving, opportunity-finding teammates with a sense of urgency, who believe in the scale of our ambition and thrive on a path not fully paved — people who want to grow their careers alongside a team of experts across energy, manufacturing, data center construction, and cloud services.
If you want to do the most meaningful work of your career, help our customers and partners advance their AI strategies, and be part of a high-performing team that believes in each other, come build with us at Crusoe.
About the Role:
At Crusoe, our Production Engineering team ensures the reliability, scalability, and operational excellence of Crusoe’s AI-optimized cloud platform. We’re looking for a Staff Production Engineer with deep experience in distributed systems and hands-on exposure to large language models to help build and operate managed AI services at scale.
This role sits at the intersection of software engineering and infrastructure, focusing on designing, operating, and improving the production systems that power Crusoe’s managed AI platform. You will help ensure highly available, performant, and cost-efficient infrastructure capable of supporting compute-intensive, latency-sensitive AI workloads for customers running large-scale training and inference.
What You’ll Work On:
Compensation will be paid in the range of $204,000 – $247,000 + bonus. Restricted Stock Units are included in all offers. Compensation will be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.
Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.
We're in the midst of the greatest industrial revolution of our time. The demand for AI compute is boundless, and power is a bottleneck. We're solving that — with an energy-first approach that makes AI infrastructure better for the world and faster for the people innovating with AI.
We're looking for problem-solving, opportunity-finding teammates with a sense of urgency, who believe in the scale of our ambition and thrive on a path not fully paved — people who want to grow their careers alongside a team of experts across energy, manufacturing, data center construction, and cloud services.
If you want to do the most meaningful work of your career, help our customers and partners advance their AI strategies, and be part of a high-performing team that believes in each other, come build with us at Crusoe.
About the Role:
At Crusoe, our Production Engineering team ensures the reliability, scalability, and operational excellence of Crusoe’s AI-optimized cloud platform. We’re looking for a Staff Production Engineer with deep experience in distributed systems and hands-on exposure to large language models to help build and operate managed AI services at scale.
This role sits at the intersection of software engineering and infrastructure, focusing on designing, operating, and improving the production systems that power Crusoe’s managed AI platform. You will help ensure highly available, performant, and cost-efficient infrastructure capable of supporting compute-intensive, latency-sensitive AI workloads for customers running large-scale training and inference.
What You’ll Work On:
- Design and operate reliable production systems for managed AI services, with a focus on serving and scaling LLM workloads
- Build automation, tooling, and reliability systems to support distributed AI pipelines and inference platforms
- Define, measure, and improve SLIs and SLOs across AI workloads to ensure performance and reliability targets are consistently met
- Partner with AI, platform, and infrastructure teams to improve reliability, efficiency, and scaling of large-scale training and inference clusters
- Build observability and telemetry systems to monitor latency-sensitive AI services and identify performance bottlenecks
- Investigate and resolve reliability issues in distributed production environments using logs, metrics, tracing, and profiling
- Contribute to the architecture of next-generation AI infrastructure and distributed systems designed for large-scale production environments
- Drive improvements in operational automation, incident response, and system resiliency across Crusoe’s AI platform
- Strong software engineering background, with experience building and operating production-grade systems beyond scripting or basic automation
- Demonstrated experience designing and operating large-scale distributed systems
- Hands-on experience working with LLMs or AI/ML infrastructure, including training or inference systems
- A Production Engineering / SRE mindset, including experience with:
- Defining and measuring SLIs and SLOs
- Building monitoring and observability systems
- Driving performance and reliability improvements in production environments
- Designing fault-tolerant systems and automated testing strategies
- Proficiency in at least one modern programming language such as Python, Go, Java, or C++
- Experience working with Kubernetes or container orchestration platforms
- Strong collaboration and communication skills across engineering teams
- Ability to thrive in a fast-moving, mission-driven environment
- Experience scaling LLM training or inference workloads in production environments
- Experience building or operating AI platforms or managed AI services
- Industry competitive pay
- Restricted Stock Units in a fast growing, well-funded technology company
- Health insurance package options including HDHP and PPO, vision, and dental for you and your dependents
- Employer contributions to HSA accounts
- Paid Parental Leave
- Paid life insurance, short-term and long-term disability
- Teladoc
- 401(k) with a 100% match up to 4% of salary
- Generous paid time off and holiday schedule
- Cell phone reimbursement
- Tuition reimbursement
- Subscription to the Calm app
- MetLife Legal
- Company paid commuter benefit; $300 per month
Compensation will be paid in the range of $204,000 – $247,000 + bonus. Restricted Stock Units are included in all offers. Compensation will be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.
Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.