Data Center Systems Operations Engineer na Lambda
Lambda · San Francisco, Estados Unidos Da América · Hybrid
- Senior
- Escritório em San Francisco
We're here to help the smartest minds on the planet build Superintelligence. The labs pushing the edge? They run on Lambda. Our gear trains and serves their models, our infrastructure scales with them, and we move fast to keep up. If you want to work on massive, world-changing AI deployments with people who love action and hard problems, we're the place to be.
If you'd like to build the world's best deep learning cloud, join us.
*Note: This position prefers presence in our Bay Area office locations, but is open to remote presence for the right candidate.
About the Job
As Lambda continues to scale its AI platform and customer base, infrastructure decisions must be tightly aligned with product roadmaps, platform growth, and fiscal discipline. The Systems Operations Engineer will own availability analysis, long-term improvement of utilization, input into strategic design, and implementation of key programs across the entire Infrastructure Stack.
This role sits within the Data Center Infrastructure (DC Infra) team and will work cross-functionally with Product, Platform Engineering, and Observability to understand overall health, analyze ongoing/potential issues, make recommendations and changes to our overall design, and ownership of key programs to improve the overall business.
This position is a critical link between the HPC/HW systems and DC Infra—and will help ensure our designs and operations most effectively maximize availability and reliability across our entire Platform.
What You’ll Do
Availability Analysis
Own end-to-end unification of availability (number of 9s) calculations across Lambda's data center products and various data center footprints, from the power/BMS/cooling and down into the rack/GPU level, and providing adequate telemetry back to facilities, site operations, and at the platform level
Work with thermal/hardware team to understand AI workload characteristics on mechanical systems and need for different BMS control methodologies as Direct to Liquid Chip (DLC) Cooling technologies improve and densities increase
Coordinate across DC Infra team to calculate estimated availabilities for new data center designs
Work with product teams and capacity forecasting to understand how design decisions effecting availability impact time to market and satisfy customer needs
Utilization Analysis and Oversubscription Strategy
Own end-to-end utilization analysis across Lambda's entire data center infrastructure
Analyze DC designs to understand peak possible capacity under varying conditions
Build oversubscription strategy and lead/own company workstream to maximize available MW w/o impacting GPU reliability and customer experience
Ensure appropriate availability considerations are included
Observability and Analytics
Coordinate with the observability team to ensure appropriate points are monitored to understand data center characteristics loads, especially under AI workloads
Help the team understand where approximate warning/danger levels are
Use observations and warning/danger levels to inform BOD for future Data Centers and suggest upgrades/modifications to current Data Centers
Develop strategy for a data center fleet health dashboard
Help provide structure ensuring overall day-to-day and long-term health can be understood from a 20k foot level with the ability to drill down into the details
Power Capping Strategy and Implementation
Coordinate with Site Operations team to strategize and build out power capping capabilities, related to worst-case scenario response/protection as we start aggressively employing oversubscription
Identify appropriate IT blocks where real-time data is monitored
Analyze, propose, and implement a rigorous testing process that iteratively finds and eliminates stranded power and cooling capacity related to utilization
Site Selection Technical Review
Conduct end-to-end technical evaluations of prospective data center sites, including power sufficiency and stability, cooling infrastructure and mechanical systems, and network topology feasibility
Perform risk assessments and recommend sites based on infrastructure fit and growth capacity.
Coordinate with DC Infra, Legal, and Business Strategy teams to ensure site selections align with workload and deployment timelines.
Cluster-to-Facility Requirements Alignment
Collaborate with HPC Architecture team and Capacity Manager to translate cluster-level hardware and workload requirements into facility-level specifications.
Define infrastructure interface requirements (power, cooling, rack layouts, interconnects, monitoring) to ensure alignment between compute stack and facility capabilities.
Support long-term infrastructure roadmap development to accommodate future hardware designs, density shifts, and workload patterns.
Work with Capacity Manager to understand various levers that can be employed to accelerate growth during demand surges.
You
Self-starter with a proven ability to independently dive into the details to understand and solve hard problems across data center infrastructure and operations
Ability to provide world-class analysis, boiling complex issues into the root cause or few key drivers
10+ years of experience working in directly in or closely with data center infrastructure and HPC/HW operations
Deep familiarity with AI or compute workload patterns, scaling dynamics, and infrastructure cost drivers
Ability to synthesize complex technical and business inputs into clear, actionable strategic recommendations
Excellent communication and collaboration skills across technical, operational, and financial stakeholders
Preferred Experience
Prior experience in hyperscale or cloud infrastructure environments
Familiarity with GPU cluster sizing, workload forecasting, or energy-efficient compute architectures
Working knowledge of typical Data Center Infrastructure designs, topologies, systems and associated reliability/availability calculations
Knowledge of DCIM tools, telemetry systems, or utilization analytics platforms
Engineering degree from university, Masters preferred.
Experience working across multi-disciplinary and non-technical teams to explain findings
Salary Range Information
The annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.
About Lambda
Founded in 2012, ~400 employees (2025) and growing fast
We offer generous cash & equity compensation
Our investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.
We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitability
Our research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG
Health, dental, and vision coverage for you and your dependents
Wellness and Commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible Paid Time Off Plan that we all actually use
A Final Note:
You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.
Equal Opportunity Employer
Lambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
Candidatar-se agora