Crusoe is on a mission to accelerate the abundance of energy and intelligence. As the only vertically integrated AI infrastructure company built from the ground up, we own and operate each layer of the stack — from electrons to tokens — to power the world's most ambitious AI workloads. When you join Crusoe, you join a team that is building the future, faster.
We're in the midst of the greatest industrial revolution of our time. The demand for AI compute is boundless, and power is a bottleneck. We're solving that — with an energy-first approach that makes AI infrastructure better for the world and faster for the people innovating with AI.
We're looking for problem-solving, opportunity-finding teammates with a sense of urgency, who believe in the scale of our ambition and thrive on a path not fully paved — people who want to grow their careers alongside a team of experts across energy, manufacturing, data center construction, and cloud services.
If you want to do the most meaningful work of your career, help our customers and partners advance their AI strategies, and be part of a high-performing team that believes in each other, come build with us at Crusoe.
About This Role:
Crusoe is building the cloud infrastructure that powers the next generation of AI, and we're looking for a Senior Engineering Manager, Production Engineering to lead the team that keeps it running. This is a senior people management role reporting to the Director of Production Engineering — sitting at the intersection of deep technical leadership and organizational impact, with direct ownership over the reliability and operational health of Crusoe's production GPU infrastructure. You'll lead and develop a 24/7 team responsible for incident response, monitoring and alerting, automation, and continuous system improvement across a fast-scaling, high-stakes environment, while also shaping the broader strategy, culture, and structure of the function.
The ideal candidate is a seasoned technical leader who has built, scaled, and managed on-call operations teams in complex environments — someone who brings both rigor and vision to SLOs and postmortems, takes coaching and performance management seriously, and can drive alignment across engineering leadership on reliability strategy. If you're energized by the challenge of building a high-performing team while keeping complex systems reliable at scale, this role offers significant ownership and strategic impact at a critical moment in Crusoe's growth.
What You'll Be Working On:
Team Leadership & Development: Manage, coach, and grow a team of production engineers across shifts and time zones. Run structured 1:1s focused on career development, deliver candid performance feedback, and build a team culture grounded in ownership and continuous improvement.
Hiring & Onboarding: Partner with engineering leadership and recruiting to grow the team — owning the full hiring lifecycle from interview design to offer. Build and continuously improve onboarding and training programs that ramp new engineers quickly and effectively.
Incident Management: Serve as an escalation point for high-severity incidents. Lead postmortems with a focus on systemic fixes, ensure action items are tracked and completed, and drive down MTTR over time.
Reliability & SLO Ownership: Define, monitor, and report on SLIs, SLOs, and SLAs across Crusoe's production systems. Surface trends proactively and partner with engineering teams to address reliability gaps before they become customer issues.
Monitoring & Alerting: Oversee the design and maintenance of alerting and observability systems across bare-metal and cloud infrastructure, ensuring the team has the signal it needs to detect and respond to issues fast.
Automation & Toil Reduction: Identify and prioritize opportunities to automate repetitive operational work, improving team efficiency and system resilience over time.
Cross-Functional Partnership: Collaborate with infrastructure, platform engineering, product, and customer success teams to align on technical escalations, customer impact, and engineering priorities.
Operational Cadence: Own the team's day-to-day operational rhythm — stand-ups, on-call rotations, incident reviews, and sprint planning — ensuring the team runs smoothly across time zones.
What You'll Bring to the Team:
6+ years of experience managing 24/7 technical operations or SRE teams in cloud or data center environments, including demonstrated success developing senior engineers, building organizational capability, and improving operational outcomes at scale.
Strong Linux and infrastructure fundamentals, including hands-on experience with containerization, Kubernetes, and virtualization in production environments.
Observability and monitoring expertise, including experience with Prometheus, VictoriaMetrics, and custom exporters — ideally against bare-metal endpoints.
Familiarity with messaging and workflow systems such as RabbitMQ, Kafka, NATS, or Temporal, and an understanding of how they function in distributed production environments.
Working proficiency in Golang or Python — enough to review production code, contribute meaningfully to technical design discussions, and support your engineers' work.
Demonstrated people management skills, including experience with structured performance management, individualized coaching, and building or improving onboarding and training programs.
SLA/SLO ownership experience — you've set them, measured them, reported on them, and held teams accountable to them in a customer-facing environment.
A track record of influencing cross-functional strategy and driving alignment across engineering leadership on operational priorities.
Bonus Points:
Experience with GPU infrastructure, HPC, or AI/ML cloud environments.
Familiarity with infrastructure-as-code tooling such as Terraform or Ansible.
Experience scaling an operations team and function through a period of rapid headcount or infrastructure growth.
Background in data center operations, including familiarity with physical infrastructure, hardware lifecycle, and network fundamentals.
Benefits:
Crusoe also offers a competitive benefits package designed to support financial security, health, and overall well-being, including pension contributions, private health and dental insurance, income protection, life assurance and more.
Compensation:
Compensation will be paid as salary or hourly. Compensation to be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.
Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.