Superhuman offers a dynamic hybrid working model for this role. This flexible approach gives team members the best of both worlds: plenty of focus time along with in-person collaboration that helps foster trust, innovation, and a strong team culture.
About Superhuman
Grammarly is now part of Superhuman, the AI productivity platform on a mission to unlock the superhuman potential in everyone. The Superhuman suite of apps and agents brings AI wherever people work, integrating with over 1 million applications and websites. The company's products include Grammarly's writing assistance, Coda's collaborative workspaces, Mail's inbox management, and Go, the proactive AI assistant that understands context and delivers help automatically. Founded in 2009, Superhuman empowers over 40 million people, 50,000 organizations, and 3,000 educational institutions worldwide to eliminate busywork and focus on what matters. Learn more at superhuman.com and about our values here.
The Opportunity
To achieve our ambitious goals, we're looking for an SRE to join our infrastructure team. This role will be responsible for building software to ensure the reliability of our back-end systems, working with engineers who develop them, and planning for our future growth. You will work with our existing production engineering teams in the EU as we transition away from a “you build it, you own it” model.
Superhuman's engineers and researchers have the freedom to innovate and uncover breakthroughs-and, in turn, influence our product roadmap. The complexity of our technical challenges is growing rapidly as we scale our interfaces, algorithms, and infrastructure. You can hear more from our team on our technical blog.
As an SRE, you will
Scale our Kubernetes-based control plane that processes billions of events per day.
Improve our automation mechanisms that react to our workload.
Deploy ML systems across the company.
Qualifications
Has 5+ years of relevant experience as an SRE or DevOps engineer.
Experience in participating in incident management processes.
Familiarity with docker, linux, and terraform.
Have used AWS, Azure, or GCP.
Java and Kubernetes skills preferred, but not required.
Has a demonstrated ability to work independently with minimal guidance, proactively manages tasks and priorities across multiple projects, analyzes and executes work efficiently, collaborates effectively with cross‑functional teams, and thrives in fast‑paced, results‑driven environments.
Embodies our EAGER values-is ethical, adaptable, gritty, empathetic, and remarkable.
Is inspired by our MOVE principles: move fast and learn faster; obsess about creating customer value; value impact over activity; and embrace healthy disagreement rooted in trust.
Compensation and Benefits
Superhuman offers all team members competitive pay along with a benefits package encompassing the following and more:
Excellent health care (including a wide range of medical, dental, vision, mental health, and fertility benefits)
Disability and life insurance options
401(k) and RRSP matching
Paid parental leave
20 days of paid time off per year, 12 days of paid holidays per year, two floating holidays per year, and flexible sick time
Generous stipends (including those for caregiving, pet care, wellness, your home office, and more)
Annual professional development budget and opportunities
Superhuman takes a market-based approach to compensation, which means base pay may vary depending on your location. Our US locations are categorized into two compensation zones based on proximity to our hub locations.
Base pay may vary considerably depending on job-related knowledge, skills, and experience. The expected salary ranges for this position are outlined below by compensation zone and may be modified in the future.
United States:
Zone 1: $214,000 - $260,000 /year (USD)
We encourage you to apply
At Superhuman, we value our differences, and we encourage all to apply-especially those whose identities are traditionally underrepresented in tech organizations. We do not discriminate on the basis of race, religion, color, gender expression or identity, sexual orientation, ancestry, national origin, citizenship, age, marital status, veteran status, disability status, political belief, or any other characteristic protected by law. Superhuman is an equal opportunity employer and a participant in the US federal E-Verify program (US). We also abide by the Employment Equity Act (Canada).
#J-18808-Ljbffr
$214k-260k yearly 2d ago
Looking for a job?
Let Zippia find it for you.
Site Reliability Engineer III
Veeam 4.1
San Francisco, CA jobs
Veeam, the #1 global market leader in data resilience, believes businesses should control all their data whenever and wherever they need it. Veeam provides data resilience through data backup, data recovery, data portability, data security, and data intelligence. Based in Seattle, Veeam protects over 550,000 customers worldwide who trust Veeam to keep their businesses running. Join us as we move forward together, growing, learning, and making a real impact for some of the world's biggest brands. The future of data resilience is here - go fearlessly forward with us.
About The Role
We are looking for an experienced Senior Site Reliability Engineer to join the Veeam Data Cloud (VDC) engineering team. You will be working with a global team to build the world's next modern data protection platform for Veeam. This is an excellent opportunity for someone with SaaS experience to work with a cutting‑edge technology stack based on containers, serverless infrastructure, Golang, public cloud services in the SaaS domain.
What You'll Do
Design, implementation and maintenance of scalable and reliable infrastructure solutions on Microsoft Azure and additional cloud platforms in the future
Automation of the deployments, maintenance of a resilient, secure, and efficient SaaS application platform to meet established service levels
Upkeep and support of delivery and release pipelines
Continuous evaluation and improvement of the reliability, performance, and scalability of our systems
Development of comprehensive monitoring and alerting solutions
Incident response for distributed applications in production environments, including a mandatory participation in on-call rotations
Proactively meet standards for information security and compliance, such as ISO (International Standards Organization), SOX (Sarbanes Oxley), SSAE (Standards for Attestation Engagements) 16, etc.
Shepherd the definition, documentation, and improvement of our internal standards for style and maintainability
Technologies We Work With
Microsoft TFS, Azure DevOps, Git, BitBucket
Azure (Entra ID, API Management, Cosmos DB, Storage services, Azure Functions, static website hosting, Azure security, etc.)
IaC tools (Azure ARM templates, AWS CloudFormation, Terraform, the Serverless Framework, etc.)
Observability (Azure Monitor, AppInsights, Elastic Stack)
What You'll Bring
3+ years of experience in 24x7 production operations for a SaaS (Software as a Service) or cloud service provider
Experience with implementation and maintenance of leading infrastructure and application monitoring tools (Azure Monitor, AppInsights, Elastic Cloud)
Experience managing Azure IaaS (Infrastructure as a Service) and PaaS (Platform as a Service) solutions
Strong problem‑solving skills and the ability to troubleshoot complex issues in a distributed, multi‑tenant environment
Experience with container orchestration and management platforms
Possess system programming skills in Python, PowerShell, Bash, Go, etc.
Experience with implementation, maintenance, and support of CI/CD practices and tools (Azure DevOps or similar)
Experienced with distributed, event‑based messaging architectures (Azure Event Hub, Azure Service Bus, Kafka, etc.)
English proficiency level sufficient to communicate with international teams
Bonus Skills
Industry‑recognized certifications in the relevant field (e.g. AZ‑400, AWS Certified DevOps Engineer, DCA)
Experience with migrating and adapting on‑premises products to cloud infrastructure
Experience with AWS (ECS, RDS, DynamoDB, VPCs, Step Functions, Lambda, IAM, EC2, S3, etc.)
Experience with C# and .NET
Remote work is only possible for employees located in the United States.
What You'll Get
Unlimited paid time off, 12 paid holidays, plus 4 extra global Veeam Days for self‑care and 24 paid volunteer hours annually through Veeam Cares
Paid parental leave: 8 weeks for all parents, 16 weeks for birthing parents
Medical, dental, and vision coverage starting on your first day
Mental health support, therapy sessions, and digital wellness tools via our Employee Assistance Program
401(k) retirement plan with company matching contributions
Fertility, adoption, and surrogacy support through Maven, plus paid volunteer time
AirVet: 24/7 virtual veterinary care at no cost
Legal services, identity protection, and supplemental health insurance options
Tax‑advantaged spending accounts for healthcare, dependent care, and commuting
Opportunities to learn and grow through on‑demand libraries (LinkedIn Learning, O'Reilly), mentoring, workshops, and learning events like our annual Global Day of Learning
Compensation Transparency
Veeam is committed to pay transparency and equitable compensation. For this role, the compensation range below reflects the expected total target compensation (TTC), inclusive of base pay and a competitive performance‑based bonus. For roles with a commission plan, the compensation range represents On Target Earnings (OTE), which includes base salary plus variable commission. When determining compensation, Veeam takes into consideration factors such as experience, education, skills, and geographic zone. Offers are typically made below the midpoint of the range.
In addition to compensation, Veeam provides a comprehensive benefits package, including health coverage, retirement plans, and unlimited time off.
Zone 1: San Francisco Bay Area, New York City Boroughs - $151,500 - $252,500 USD
Zone 2: Washington, California (excluding San Francisco Bay Area) - $138,900 - $231,400 USD
Zone 3: Texas, Illinois, North Carolina, Colorado, Massachusetts, Pennsylvania, Virginia, Oregon, Nevada, Hawaii, New York (excluding NYC boroughs); Sales roles located in Georgia, Ohio, and Arizona - $126,300 - $210,400 USD
Zone 4: All other US locations - $109,800 - $183,000 USD
Veeam Software is an equal opportunity employer and does not tolerate discrimination in any form on the basis of race, color, religion, gender, age, national origin, citizenship, disability, veteran status or any other classification protected by federal, state or local law. All your information will be kept confidential.
#J-18808-Ljbffr
$151.5k-252.5k yearly 2d ago
Staff Site Reliability Engineer
Veeam 4.1
San Francisco, CA jobs
Veeam, the #1 global market leader in data resilience, believes businesses should control all their data whenever and wherever they need it. Veeam provides data resilience through data backup, data recovery, data portability, data security, and data intelligence. Based in Seattle, Veeam protects over 550,000 customers worldwide who trust Veeam to keep their businesses running. Join us as we move forward together, growing, learning, and making a real impact for some of the world's biggest brands. The future of data resilience is here - go fearlessly forward with us.
About the Role
Veeam is launching a global Site Reliability Engineering (SRE) function to support the rollout and operation of our new SaaS offering: the Veeam Data Cloud. As a Staff Site Reliability Engineer, you will serve as a hands‑on technical leader within the SRE team, guiding senior engineers, influencing product development teams, and ensuring the systems we operate are built to be reliable, scalable, and observable from the ground up. You will drive strategic initiatives, mentor others in the practice of SRE, and help define architectural best practices across our platform. This role is pivotal in aligning teams, enforcing high standards, and scaling SRE principles globally within Veeam.
Reliability Engineering & Resilience
Act as a technical authority in your area, mentoring senior engineers and guiding design choices that improve service reliability and resilience.
Lead the definition and enforcement of SLIs, SLOs, and error budgets; drive adherence across engineering teams.
Collaborate with Staff peers across teams to align strategy and champion shared reliability standards and goals.
Partner with development and product teams to proactively design for failure, build resilient architecture, and operationalize reliability from the start.
Observability & Operational Excellence
Drive company‑wide adoption of observability best practices and tooling.
Ensure metrics, logs, and traces provide deep, actionable insights across systems.
Lead complex incident responses, post‑mortems, and systemic reliability improvements.
Promote and enforce a blameless culture of learning and continuous improvement.
Engineering at Scale
Lead initiatives in infrastructure as code, deployment automation, and resilience testing.
Influence the development and adoption of chaos engineering practices and release validation frameworks.
Partner with platform and security teams to ensure production readiness.
Collaboration & Culture
Work closely with your peer Staff Engineers to plan, align, and deliver against reliability goals.
Provide architectural guidance and advocate for engineering rigor and consistency.
Represent the SRE team in technical leadership forums and product planning discussions.
What We Are Looking For Required
8+ years of experience in a Software Engineering or SRE role, including technical leadership.
Demonstrated experience mentoring and guiding senior engineers.
Deep expertise in building distributed systems on public cloud (Azure preferred).
Strong skills in programming (e.g., JS, Go, Typescript, Java, or C#).
Hands‑on experience with observability tooling (e.g., Prometheus, Grafana, OpenTelemetry).
Mastery of infrastructure automation tools (Terraform, Pulumi) and container orchestration (Kubernetes).
Ability to communicate clearly across geographies and disciplines.
Preferred
Experience leading SRE initiatives across multiple product teams.
Background in chaos engineering, incident learning, or performance and load testing.
Familiarity with global compliance standards (ISO, SOC 2, GDPR, FedRAMP, CMMC).
Why Join Veeam?
Be a core architect in the rollout of Veeam's first global SaaS offering - the Veeam Data Cloud.
Help shape a modern, engineering‑driven SRE practice from the ground up.
Influence long‑term reliability and architecture across a global product portfolio.
Work in a collaborative environment with engineering leaders who value strategic thinking, hands‑on problem solving, and customer empathy.
Enjoy competitive pay and benefits, flexible work arrangements, and a team culture built on learning, ownership, and impact.
Benefits
Unlimited PTO
Paid Holidays
Veeam Care Days: 24 hours paid time for volunteering
Medical, dental, and vision coverage starting on day one (multiple plan options)
Flexible Spending Accounts (FSA) and Health Savings Account (HSA) options
Employer HSA contributions (for HDHP participants)
Life and AD&D insurance (employee, spouse/partner, and child options)
Company‑paid short‑term and long‑term disability insurance
Supplemental individual disability insurance (IDI)
Family planning support: fertility, adoption, surrogacy, and parental resources
Paid parental leave
Employee Assistance Program
Additional voluntary benefits: accident, critical illness, hospital indemnity, legal, identity theft protection, commuter benefits, pet care
Mental health support
401(k) plan
Professional training and education, on‑demand learning libraries (LinkedIn Learning, O'Reilly), mentoring, workshops, and Global Day of Learning
Compensation Transparency
Veeam is committed to pay transparency and equitable compensation. For this role, the compensation range below reflects the expected total target compensation (TTC), inclusive of base pay and a competitive performance‑based bonus. For roles with a commission plan, the compensation range represents On Target Earnings (OTE), which includes base salary plus variable commission. Offers are typically made below the midpoint of the range.
Zone 1: San Francisco Bay Area, New York City Boroughs - $293,100 - $544,200 USD
Zone 2: Washington, California (excluding San Francisco Bay Area) - $268,600 - $498,900 USD
Zone 3: Texas, Illinois, North Carolina, Colorado, Massachusetts, Pennsylvania, Virginia, Oregon, Nevada, Hawaii, New York (excluding NYC boroughs); Sales roles located in Georgia, Ohio, and Arizona - $244,200 - $453,500 USD
Zone 4: All other US locations - $212,500 - $394,600 USD
Veeam Software is an equal opportunity employer and does not tolerate discrimination in any form on the basis of race, color, religion, gender, age, national origin, citizenship, disability, veteran status or any other classification protected by federal, state or local law. All your information will be kept confidential.
Please note that any personal data collected from you during the recruitment process will be processed in accordance with our Recruiting Privacy Notice.
The Privacy Notice sets out the basis on which the personal data collected from you, or that you provide to us, will be processed by us in connection with our recruitment processes.
By submitting your application, you acknowledge that the information provided in your job application and any supporting documents is complete and accurate to the best of your knowledge. Any misrepresentation, omission, or falsification of information may result in disqualification from consideration for employment or, if discovered after employment begins, termination of employment.
#J-18808-Ljbffr
$118k-159k yearly est. 3d ago
Site Reliability Engineer US - San Francisco
Near Inc. 4.6
San Francisco, CA jobs
The NEAR AI engineering team is developing decentralized and confidential machine learning infrastructure to power user owned AI. We currently focus on building infrastructure to enable private and confidential inference that works across different compute providers, as well as a blockchain-based coordination layer that incentivizes computer providers to join the decentralized inference network.
You will own various components and drive critical decisions throughout their life cycles, including architecture, implementation, and maintenance. You will collaborate with highly knowledgeable and skilled colleagues who are passionate about solving hard problems that can disrupt the industry.
What You'll Be Doing:
End-to-end infrastructure ownership (for handling telemetry data, for performing testing, etc)
Design and implementation of infrastructure components that manage clusters of GPU with special configurations
Performance tuning and optimizations
Create and maintain runbooks that support the on-call rotation
Participate in the on-call rotation.
Support code releases and delivery
Plan and implement infrastructure cost and security strategies
Plan and implement effective CI/CD Pipelines to facilitate development processes
What We're Looking For:
Agility to quickly learn new programming languages and technologies
Ability to write clean and efficient code
Ability to transform ambiguous problems into tangible solutions or prototypes
Experience with software concurrency or parallelism
Experience in building, operating, and scaling Cloud infrastructure (GCP, AWS, etc)
Experience with data visualization and observability tooling (Grafana, Graphite, Zabbix, etc)
Detail-oriented mindset with a focus on setting priorities and progressing towards objectives
Excellent communication and teamwork skills
Bachelor's Degree in Computer Science or a related field
We'd Love If You Have:
Experience with NEAR or other blockchain internals
Experience with GPUs
Experience with Trusted Execution Environments
Experience debugging and troubleshooting complex concurrent systems
Professional experience with Rust
Locations:
onsite, San Francisco office
#J-18808-Ljbffr
$126k-176k yearly est. 6d ago
Site Reliability Engineer - AI Inference Infra & GPU Clusters
Near Inc. 4.6
San Francisco, CA jobs
A tech company specializing in AI infrastructure based in San Francisco is looking for a candidate to own the development of decentralized machine learning infrastructure. The role involves designing components, performance tuning, and collaboration with skilled colleagues. The ideal candidate should have experience in Cloud infrastructure and software concurrency, along with a Bachelor's degree in Computer Science. Excellent communication skills and the ability to learn quickly are essential. The position is onsite at the San Francisco office.
#J-18808-Ljbffr
$126k-176k yearly est. 6d ago
Low-Voltage Reliability Engineer for EV Electronics
Rivian 4.1
Palo Alto, CA jobs
A leading electric vehicle manufacturer in Palo Alto is seeking a Design-for-Reliability Engineer to enhance the reliability of low voltage electronics in their vehicles. The role involves monitoring product performance, utilizing statistical methods, and collaborating with manufacturing teams. The ideal candidate has a Bachelor's degree in Engineering and over five years in reliability engineering. Salaries are competitive, ranging from $146,900 to $194,610 based on experience.
#J-18808-Ljbffr
$146.9k-194.6k yearly 4d ago
Site Reliability Engineer - Observability
Rivian 4.1
Palo Alto, CA jobs
About Us
Rivian and Volkswagen Group Technologies is a joint venture between two industry leaders with a clear vision for automotive's next chapter. From operating systems to zonal controllers to cloud and connectivity solutions, we're addressing the challenges of electric vehicles through technology that will set the standards for software-defined vehicles around the world.
The road to the future is uncharted. By combining our expertise across connectivity, AI, security and more, we'll map a new way forward. Working together, we'll create a future that's more connected, more intelligent, more sustainable for everyone.
Role Summary
We are seeking a Senior Site Reliability Engineer (SRE) specializing in Observability to join RivianVW's Data Platform - Production Engineering team. In this role, you will design, implement, and scale robust observability systems to ensure the health, performance, and reliability of our production environment. You will collaborate closely with cross-functional teams to create telemetry solutions that provide actionable insights into our distributed systems.
Responsibilities
Observability Platform Design: Architect, implement, and maintain observability systems, leveraging tools like Datadog, LGTM stack, OpenTelemetry, and Vector to enable real-time performance monitoring, logging, and alerting.
Telemetry Optimization: Evolve and scale telemetry pipelines to ensure low latency and high availability for metrics, logs, and traces across multi-cloud environments.
Performance Engineering: Proactively identify performance bottlenecks, optimize systems, and provide recommendations for reliability improvements.
Scalable Automation: Implement automation solutions to scale systems sustainably while driving improvements in reliability and deployment velocity.
Incident Management: Collaborate with the incident response team to establish data-driven debugging and troubleshooting processes using observability data.
Tooling Development: Create and maintain self-service observability tools and dashboards to empower teams across the organization.
Cross-functional Collaboration: Partner with development, DevOps, and infrastructure teams to define SLOs/SLIs and ensure observability is embedded throughout the software lifecycle.
Qualifications
Educational Background: Bachelor's degree in Computer Science, Engineering, or equivalent practical experience.
Experience: 5+ years in Site Reliability Engineering or a related role with a strong emphasis on observability.
Technical Expertise:
Proficiency in designing and operating observability platforms with tools like Prometheus, Grafana, Loki, Jaeger, or Datadog.
Experience with OpenTelemetry and distributed tracing in microservices architectures.
Deep knowledge of Kubernetes (e.g., EKS), ArgoCD, and Crossplane.
Programming Skills: Strong proficiency in Python, Go, or similar languages for building automation and custom telemetry solutions.
Cloud & Systems: Familiarity with multi-cloud setups, containerization (Docker), and Linux system fundamentals.
Soft Skills: Exceptional problem-solving, communication, and a data-driven approach to decision-making.
Pay Disclosure
Salary Range/Hourly Rate for California Based Applicants: $146,900 - $194,610 USD
Actual Compensation will be determined based on experience, location, and other factors permitted by law.
Benefits Summary: Rivian and Volkswagen Group Technologies provides robust medical, prescription, dental and vision insurance packages for full-time employees, their spouse or domestic partner, and their children up to age 26. Coverage is effective on the first day of employment.
Equal Opportunity
Rivian and Volkswagen Group Technologies is committed to creating a diverse environment and is proud to be an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, national origin, ancestry, sex, sexual orientation, gender, gender expression, gender identity, genetic information or characteristics, physical or mental disability, marital/domestic partner status, age, military/veteran status, medical condition, or any other characteristic protected by law. We are also committed to ensuring compliance with all applicable fair employment practice laws regarding citizenship and immigration status.
Rivian and Volkswagen Group Technologies is committed to ensuring that our hiring process is accessible for persons with disabilities. If you have a disability or limitation, such as those covered by the Americans with Disabilities Act, that requires accommodations to assist you in the search and application process, please email us at candidateaccommodations@rivian.com.
Candidate Data Privacy
Rivian and VW Group Technologies ("Rivian and Volkswagen Group Technologies") may collect, use and disclose your personal information or personal data (within the meaning of the applicable data protection laws) when you apply for employment and/or participate in our recruitment processes ("Candidate Personal Data"). This data includes contact, demographic, communications, educational, professional, employment, social media/website, network/device, recruiting system usage/interaction, security and preference information. Rivian and Volkswagen Group Technologies may use your Candidate Personal Data for the purposes of (i) tracking interactions with our recruiting system; (ii) carrying out, analyzing and improving our application and recruitment process, including assessing you and your application and conducting employment, background and reference checks; (iii) establishing an employment relationship or entering into an employment contract with you; (iv) complying with our legal, regulatory and corporate governance obligations; (v) recordkeeping; (vi) ensuring network and information security and preventing fraud; and (vii) as otherwise required or permitted by applicable law.
Rivian and Volkswagen Group Technologies may share your Candidate Personal Data with (i) internal personnel who have a need to know such information in order to perform their duties, including individuals on our People Team, Finance, Legal, and the team(s) with the position(s) for which you are applying; (ii) Rivian and Volkswagen Group Technologies affiliates; and (iii) Rivian and Volkswagen Group Technologies' service providers, including providers of background checks, staffing services, and cloud services.
Rivian and Volkswagen Group Technologies may transfer or store internationally your Candidate Personal Data, including to or in the United States, Canada, and the European Union and in the cloud, and this data may be subject to the laws and accessible to the courts, law enforcement and national security authorities of such jurisdictions.
Please see our Candidate Data Privacy Notice (English) and Candidate Data Privacy Notice (Serbian) for more information.
Please note that we are currently not accepting applications from third party application services.
#J-18808-Ljbffr
$146.9k-194.6k yearly 2d ago
Reliability/DFX Engineer
Openai 4.2
San Francisco, CA jobs
About the Team
OpenAI's Hardware organization develops silicon and system-level solutions designed for the unique demands of advanced AI workloads. The team is responsible for building the next generation of AI-native silicon while working closely with software and research partners to co-design hardware tightly integrated with AI models. In addition to delivering production-grade silicon for OpenAI's supercomputing infrastructure, the team also creates custom design tools and methodologies that accelerate innovation and enable hardware optimized specifically for AI.
About the Role
We are seeking a highly skilled cross-stack engineer with deep expertise in making ML systems reliable at scale. This hands-on individual contributor will sit within our hardware team and work closely with chip design, platform design, hardware health, and the broader industry ecosystem to architect, implement, and deploy reliable next-generation AI accelerator systems. This engineer will evaluate system and chip architecture holistically, identify high-ROI opportunities to improve reliability and availability across the stack, and translate those opportunities into strategy and silicon features.
In this role, you will
Oversee DFX architecture, implementation, and execution in silicon from concept to high-volume deployment, and propose high-ROI features to enhance reliability and fault tolerance. DFX includes design for testability, reliability, availability, and serviceability of high-performance AI hardware.
Build system-level reliability models grounded in empirical data to guide organization-wide DFX and reliability strategy. This requires a detailed understanding of chip and system architecture, design, implementation, and component-level reliability.
Collaborate with chip and platform architecture/design teams to explore and implement DFX features, including the specification and implementation of digital/mixed-signal IP, firmware/system software, and DFX methodology (in partnership with engineering teams).
Partner with hardware health and platform design teams to continuously improve reliability and fault tolerance in NPI and HVM. This includes optimizing operating conditions, designing experiments, and performing data analysis to drive continuous, data-driven improvements across the stack.
Serve as the DFX/reliability champion and evangelist to align the broader industry ecosystem with OpenAI's requirements and roadmap.
Qualifications
BS with 15+ years, MS with 10+ years, or PhD with 3+ years of relevant industry experience focused on reliability across the chip/platform stack.
Hands-on experience with RTL design and DFT is required; physical implementation and/or silicon ATE experience is preferred.
Detailed understanding of ML chip and platform architecture and ML workload characteristics is required.
Strong fundamentals in reliability modeling, with hands-on skills in empirical data analysis.
About OpenAI
OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.
We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic.
For additional information, please see OpenAI's Affirmative Action and Equal Employment Opportunity Policy Statement.
Qualified applicants with arrest or conviction records will be considered for employment in accordance with applicable law, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act. For unincorporated Los Angeles County workers: we reasonably believe that criminal history may have a direct, adverse and negative relationship with the following job duties, potentially resulting in the withdrawal of a conditional offer of employment: protect computer hardware entrusted to you from theft, loss or damage; return all computer hardware in your possession (including the data contained therein) upon termination of employment or end of assignment; and maintain the confidentiality of proprietary, confidential, and non-public information. In addition, job duties require access to secure and protected information technology systems and related data security obligations.
To notify OpenAI that you believe this job posting is non-compliant, please submit a report through this form. No response will be provided to inquiries unrelated to job posting compliance.
We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link.
OpenAI Global Applicant Privacy Policy
At OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology.
#J-18808-Ljbffr
A leading AI research company based in San Francisco is seeking experienced reliability engineers to scale their infrastructure and ensure system performance and reliability. This role involves collaborating with diverse teams to develop resilient systems and enhance operations. Candidates should have strong cloud proficiency, experience in containerization technologies, and a bachelor's degree in a related field.
#J-18808-Ljbffr
$127k-176k yearly est. 4d ago
Site Reliability Engineer, Managed AI
Crusoe Energy Systems LLC 4.1
San Francisco, CA jobs
Crusoe's mission is to accelerate the abundance of energy and intelligence. We're crafting the engine that powers a world where people can create ambitiously with AI - without sacrificing scale, speed, or sustainability.
Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that's setting the pace for responsible, transformative cloud infrastructure.
About the Role
At Crusoe, our Site Reliability Engineering team ensures the reliability and scalability of Crusoe's AI-optimized cloud platform. We're looking for an SRE with a strong background in distributed systems and hands-on experience with large language models to help us build and operate managed AI services at scale. This role is central to delivering highly available, performant, and cost-efficient AI infrastructure that powers compute-intensive, latency-sensitive workloads for our customers.
What You'll Work On:
Design and operate reliable managed AI services with a focus on serving and scaling LLM workloads
Build automation and reliability tooling to support distributed AI pipelines and inference services
Define, measure, and improve SLIs/SLOs across AI workloads to ensure performance and reliability targets are met
Collaborate with AI, platform, and infrastructure teams to optimize large-scale training and inference clusters
Automate observability by building telemetry and performance tuning strategies for latency-sensitive AI services
Investigate and resolve reliability issues in distributed AI systems using telemetry, logs, and profiling
Contribute to the architecture of next-generation distributed systems purpose-built for AI-first environments
What You'll Bring:
Strong software engineering background - experience building production-grade systems beyond scripting or Bash
Demonstrated experience in distributed systems design and implementation
Hands-on work with large language models (LLMs) or AI/ML infrastructure
SRE mindset and experience (whether or not under the SRE title) including:
Defining and measuring SLIs/SLOs
Building monitoring and observability systems
Driving performance and reliability improvements
Designing fault‑tolerant systems and automated testing strategies
Proficiency in at least one modern programming language (Python, Go, Java, C++)
Familiarity with Kubernetes or container orchestration platforms
Strong collaboration and communication skills
Ability to thrive in a fast‑paced, mission‑driven environment
Bonus Points:
Experience scaling inference or training workloads for LLMs
Benefits:
Industry competitive pay
Restricted Stock Units in a fast growing, well‑funded technology company
Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
Employer contributions to HSA accounts
Paid Parental Leave
Paid life insurance, short‑term and long‑term disability
Teladoc
401(k) with a 100% match up to 4% of salary
Generous paid time off and holiday schedule
Cell phone reimbursement
Tuition reimbursement
Subscription to the Calm app
MetLife Legal
Company paid commuter benefit; $300 per month
Compensation:
Compensation will be paid in the range of $204,000 - $247,000 + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant's education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.
Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.
#J-18808-Ljbffr
$124k-174k yearly est. 3d ago
Prediction & Planning ML Engineer for Autonomous Driving
Zoox 3.4
Foster City, CA jobs
An innovative technology company in California seeks a Prediction and Planning Machine Learning Engineer to develop cutting-edge machine learning models for autonomous vehicles. The ideal candidate holds a PhD or MSc with significant experience and expertise in reinforcement learning. The role includes designing driving plans, analyzing performance metrics, and collaborating across engineering teams. Competitive salary range of $214,000 - $257,000 along with benefits including stock options and insurance.
#J-18808-Ljbffr
$214k-257k yearly 5d ago
Senior Site Reliability Engineer - Cloud SaaS & Automation
Veeam 4.1
San Francisco, CA jobs
A leading data resilience company is seeking a Senior Site Reliability Engineer to join their Veeam Data Cloud engineering team in San Francisco. This position involves working with a global team to build a modern data protection platform, focusing on scalable cloud infrastructure, automation, and system monitoring. Candidates should have robust SaaS experience and expertise in Azure platforms. The role offers extensive benefits including unlimited paid time off and competitive compensation based on experience.
#J-18808-Ljbffr
$139k-181k yearly est. 2d ago
Senior+ Site Reliability Engineer
Crusoe Energy Systems LLC 4.1
San Francisco, CA jobs
Crusoe's mission is to accelerate the abundance of energy and intelligence. We're crafting the engine that powers a world where people can create ambitiously with AI - without sacrificing scale, speed, or sustainability.
Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that's setting the pace for responsible, transformative cloud infrastructure.
About This Role:
Crusoe is building the most reliable, energy-efficient, AI-optimized cloud platform - and operational excellence is at the heart of that mission. As a Site Reliability Engineer focused on Operational Excellence, you will help ensure the stability, resilience, and performance of Crusoe's GPU cloud.
This role is ideal for engineers who thrive in fast-paced environments, enjoy solving operational problems, and want to grow their technical career while supporting incident response, reliability, and continuous improvement across a large-scale distributed platform.
You'll partner closely with senior SREs, infrastructure engineers, and platform teams to improve reliability, reduce operational toil, and strengthen Crusoe's incident management practices.
What You'll Be Working On:
Collaborate with cross-functional teams to define and refine availability metrics for Crusoe's cloud infrastructure, including establishing, tracking, and improving SLIs and SLOs.
Assist in incident response by identifying, diagnosing, and resolving service disruptions, and support post-incident processes through RCA documentation and participation in post-incident reviews.
Build, operate, and monitor infrastructure health using Crusoe's observability stack (Prometheus, Grafana, Alertmanager, OpenTelemetry).
Identify and communicate reliability risks, performance bottlenecks, and early indicators of potential incidents that could impact service availability.
Develop automation and tooling to reduce operational toil, minimize manual intervention, and enhance service recovery and self‑healing capabilities.
Partner with compute, network, storage, and platform teams to improve service resilience and strengthen disaster recovery readiness.
Contribute to knowledge sharing, process improvements, and the development of operational best practices across the organization.
Participate in ongoing training, mentorship, and professional development to grow into advanced SRE responsibilities.
What You'll Bring to the Team:
5+ years of experience in cloud operations, SRE, or related roles
Understanding of cloud platforms and infrastructure fundamentals (Kubernetes, AWS/GCP, virtualization, distributed systems)
Familiarity with incident management practices and operational frameworks (SRE/ITIL/etc.)
Experience with monitoring and alerting tools (Prometheus, Grafana) or a strong willingness to learn
Familiarity with infrastructure-as-code and configuration management tools such as Terraform and Ansible
Basic Scripting and automation experience (Go, Python, C, C++, or similar)
Strong communication skills, with the ability to clearly articulate technical issues to diverse stakeholders
Ability to stay calm, focused, and effective in fast-moving or high-pressure situations
A growth mindset with enthusiasm for operational excellence, reliability engineering, and continuous improvement
Bonus Points:
Experience with Kubernetes, container orchestration, or large-scale distributed systems
Exposure to change management, operational readiness reviews, or structured RCAs
Familiarity with self‑healing systems, automated remediation, or event‑driven operations
Interest in scaling AI/HPC infrastructure and solving reliability challenges in GPU‑heavy environments
Passion for learning, mentorship, and developing deeper SRE capabilities over time
Benefits:
Industry competitive pay
Restricted Stock Units in a fast growing, well‑funded technology company
Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
Employer contributions to HSA accounts
Paid Parental Leave
Paid life insurance, short-term and long-term disability
Teladoc
401(k) with a 100% match up to 4% of salary
Generous paid time off and holiday schedule
Cell phone reimbursement
Tuition reimbursement
Subscription to the Calm app
MetLife Legal
Company paid commuter benefit; $300 per month
Compensation:
Compensation will be paid in the range of $172,000 - $209,000 + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant's education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.
Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.
#J-18808-Ljbffr
A leading energy technology firm seeks a Site Reliability Engineer to enhance its reliable, energy-efficient, AI-optimized cloud platform. In this role, you'll collaborate with cross-functional teams to improve system performance and incident management. Ideal candidates will have a strong background in cloud operations and automation, alongside critical problem-solving skills. Join this innovative team to drive sustainable technology and contribute to a cutting-edge infrastructure focused on operational excellence.
#J-18808-Ljbffr
$142k-189k yearly est. 2d ago
Sr. Reliability Engineer/ Sustaining
Rivian 4.1
Palo Alto, CA jobs
Rivian is on a mission to keep the world adventurous forever. This goes for the emissions‑free Electric Adventure Vehicles we build, and the curious, courageous souls we seek to attract.
As a company, we constantly challenge what's possible, never simply accepting what has always been done. We reframe old problems, seek new solutions and operate comfortably in areas that are unknown. Our backgrounds are diverse, but our team shares a love of the outdoors and a desire to protect it for future generations.
Role Summary
Reliability is at the core of Electric Adventure Vehicles. We are looking for engineers who have experience in bringing products from the early concept stages to production by working closely with design, validation and manufacturing teams on solving technical problems along the way to make the most robust products.
Responsibilities
As a Design‑for‑Reliability Engineer focusing on low voltage electronics, you will play a key role in monitoring the performance and health of vehicle control, connected car, and self‑driving products in the field. Essential responsibilities in this role include:
Apply reliability engineering concepts to continuously improve reliability for low voltage electrical systems across all products.
Use statistics to identify reliability trends in the field and provide a clear risk snapshot.
Develop tools and processes to track fleet health metrics for electronics.
Facilitate root cause investigations, with a major focus on physics of failure and a data‑driven approach.
Review electronic Contract Manufacturer test and end‑of‑line data to identify potential risks for downstream assembly and field.
Qualifications
Minimum requirement is a Bachelor of Science in an engineering discipline or equivalent. Five or more years of industry experience in a reliability engineering role. Level of Role depends on experience and qualifications. A strong technical background is needed in the following areas:
Experience with applied statistics in the field of Reliability.
Experience with a coding language, preferably Python.
Understanding of Highly Accelerated Testing methods and governing equations for different types of failure mechanisms.
Technical knowledge of one or more aspects related to PCBA reliability: failure modes, reliability specification development, design guidelines, water‑cooled electronics, interconnects.
Experience working with Contract Manufacturers in Asia and North America, including performing on‑site audits of manufacturing facilities.
Experience with deploying reliability testing guidelines and inventing new ways of testing.
Pay Disclosure
Salary Range/Hourly Rate for Palo Alto, California Based Applicants: $146,900.00 - $194,610.00 (actual compensation will be determined based on experience, location, and other factors permitted by law).
Salary Range/Hourly Rate for Irvine, California Based Applicants: $135,100.00 - $179,040.00 (actual compensation will be determined based on experience, location, and other factors permitted by law).
Benefits Summary: Rivian and Volkswagen Group Technologies provides robust medical/Rx, dental and vision insurance packages for full‑time employees, their spouse or domestic partner, and children up to age 26. Coverage is effective on the first day of employment.
Equal Opportunity
Rivian is an equal opportunity employer and complies with all applicable federal, state, and local fair employment practices laws. All qualified applicants will receive consideration for employment without regard to race, color, religion, national origin, ancestry, sex, sexual orientation, gender, gender expression, gender identity, genetic information or characteristics, physical or mental disability, marital/domestic partner status, age, military/veteran status, medical condition, or any other characteristic protected by law.
Rivian is committed to ensuring that our hiring process is accessible for persons with disabilities. If you have a disability or limitation, such as those covered by the Americans with Disabilities Act, that requires accommodations to assist you in the search and application process, please email us at candidateaccommodations@rivian.com.
Candidate Data Privacy
Rivian may collect, use and disclose your personal information or personal data (within the meaning of the applicable data protection laws) when you apply for employment and/or participate in our recruitment processes (“Candidate Personal Data”). This data includes contact, demographic, communications, educational, professional, employment, social media/website, network/device, recruiting system usage/interaction, security and preference information. Rivian may use your Candidate Personal Data for the purposes of (i) tracking interactions with our recruiting system; (ii) carrying out, analyzing and improving our application and recruitment process, including assessing you and your application and conducting employment, background and reference checks; (iii) establishing an employment relationship or entering into an employment contract with you; (iv) complying with our legal, regulatory and corporate governance obligations; (v) recordkeeping; (vi) ensuring network and information security and preventing fraud; and (vii) as otherwise required or permitted by applicable law.
Rivian may share your Candidate Personal Data with (i) internal personnel who have a need to know such information in order to perform their duties, including individuals on our People Team, Finance, Legal, and the team(s) with the position(s) for which you are applying; (ii) Rivian affiliates; and (iii) Rivian's service providers, including providers of background checks, staffing services, and cloud services.
Rivian may transfer or store internationally your Candidate Personal Data, including to or in the United States, Canada, the United Kingdom, and the European Union and in the cloud, and this data may be subject to the laws and accessible to the courts, law enforcement and national security authorities of such jurisdictions.
Please note that we are currently not accepting applications from third party application services.
Candidate Data Privacy
Rivian may collect, use and disclose your personal information or personal data (within the meaning of the applicable data protection laws) when you apply for employment and/or participate in our recruitment processes (“Candidate Personal Data”). This data includes contact, demographic, communications, educational, professional, employment, social media/website, network/device, recruiting system usage/interaction, security and preference information. Rivian may use your Candidate Personal Data for the purposes of (i) tracking interactions with our recruiting system; (ii) carrying out, analyzing and improving our application and recruitment process, including assessing you and your application and conducting employment, background and reference checks; (iii) establishing an employment relationship or entering into an employment contract with you; (iv) complying with our legal, regulatory and corporate governance obligations; (v) recordkeeping; (vi) ensuring network and information security and preventing fraud; and (vii) as otherwise required or permitted by applicable law.
Rivian may share your Candidate Personal Data with (i) internal personnel who have a need to know such information in order to perform their duties, including individuals on our People Team, Finance, Legal, and the team(s) with the position(s) for which you are applying; (ii) Rivian affiliates; and (iii) Rivian's service providers, including providers of background checks, staffing services and cloud services.
Rivian may store internationally your Candidate Personal Data, including to or in the United States, Canada, the United Kingdom, and the European Union and in the cloud, and this data is subject to the laws and accessible to the courts, law enforcement and national security authorities of such jurisdictions.
Please note that we are currently not accepting applications from third party application services.
#J-18808-Ljbffr
$146.9k-194.6k yearly 4d ago
AI Cloud Storage SRE - Scale & Reliability
Crusoe Energy Systems LLC 4.1
San Francisco, CA jobs
A technology firm based in California is seeking a Site Reliability Engineer to optimize their AI-optimized cloud infrastructure. The role involves building automation tools, driving reliability initiatives, and collaborating with engineers to ensure high-performance storage systems. Candidates should have strong experience in SRE, distributed storage systems, and programming languages. The position offers competitive compensation and various benefits, including health insurance and stock options.
#J-18808-Ljbffr
$92k-119k yearly est. 2d ago
Production Engineer, Storage
Crusoe Energy Systems LLC 4.1
San Francisco, CA jobs
Crusoe's mission is to accelerate the abundance of energy and intelligence. We're crafting the engine that powers a world where people can create ambitiously with AI - without sacrificing scale, speed, or sustainability.
Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that's setting the pace for responsible, transformative cloud infrastructure.
About This Role:
At Crusoe Energy Systems, our Site Reliability Engineering (SRE) team plays a mission-critical role in maintaining the performance and reliability of our AI-optimized cloud infrastructure. The Storage-focused SRE role is responsible for ensuring the availability, performance, and scalability of Crusoe's cloud storage products and services, which power compute-intensive, latency-sensitive workloads for AI and HPC use cases. This role directly supports our vertically integrated, sustainable cloud platform by building and optimizing distributed, fault-tolerant storage systems at scale.
What You'll Be Working On:
In this role, you will build automation and self-healing tools to monitor and maintain Crusoe's distributed cloud storage infrastructure, which includes block, file, and object storage systems. You will drive reliability initiatives focused on data replication, encryption, backup and restore strategies, and robust failover mechanisms. Collaborating closely with storage engineers, you will help implement and maintain high-performance NVMe- and SSD-backed volumes that support large-scale AI compute clusters. Your responsibilities will also include supporting user-facing storage services with a focus on availability, performance tuning, and adherence to error budgets. You'll investigate and resolve storage-related incidents using deep telemetry, logs, and performance profiling, while also partnering with hardware and kernel teams to diagnose low-level I/O issues and optimize I/O paths, cache policies, and file systems. Additionally, you will contribute to the architecture of fault-tolerant, scalable storage backends tailored for AI-first cloud environments.
What You'll Bring to the Team:
5+ years of professional experience in SRE, systems, or storage engineering.
Hands-on experience with distributed storage systems (e.g., Ceph, GlusterFS, OpenEBS) and deep understanding of object, block, and file storage paradigms.
Proficiency in a programming language such as Python, Go, Java, or C.
Experience with Infrastructure as Code and deployment tooling such as Terraform, Ansible, or Puppet.
Deep knowledge of Linux internals with a focus on I/O subsystems, memory management, and storage scheduling.
Familiarity with storage protocols like NFS, SMB, iSCSI, or NVMe-oF.
Strong experience working with containerized workloads and orchestration platforms (e.g., Kubernetes, Docker).
Excellent incident response, troubleshooting, and documentation practices.
Experience with building and operating managed services at scale such as object, file and block storage (AWS, GCP, Azure)
Excellent communication skills
Must be able to pass a background check
Embody the Company values
Bonus Points:
Contributions to open-source storage projects or the Linux storage stack.
Experience with hybrid storage models across on-prem and cloud environments.
Familiarity with high-throughput network topologies for storage backplanes (e.g., RoCE, RDMA, InfiniBand)..
Benefits:
Industry competitive pay
Restricted Stock Units in a fast growing, well-funded technology company
Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
Employer contributions to HSA accounts
Paid Parental Leave
Paid life insurance, short-term and long-term disability
Teladoc
401(k) with a 100% match up to 4% of salary
Generous paid time off and holiday schedule
Cell phone reimbursement
Tuition reimbursement
Subscription to the Calm app
MetLife Legal
Company paid commuter benefit; $300 per month
Compensation:
Compensation will be paid in the range of $166,000 - $201,000 a year + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant's education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.
Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.
#J-18808-Ljbffr
$166k-201k yearly 2d ago
Production Engineer, Compute
Crusoe Energy Systems LLC 4.1
San Francisco, CA jobs
Crusoe's mission is to accelerate the abundance of energy and intelligence. We're crafting the engine that powers a world where people can create ambitiously with AI - without sacrificing scale, speed, or sustainability.
Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that's setting the pace for responsible, transformative cloud infrastructure.
About This Role:
At Crusoe, we are building the most sustainable, AI-first cloud infrastructure, and our Compute-focused Site Reliability Engineers are the backbone of that mission. This role is centered on supporting virtualization, hypervisor, and kernel-level performance for Crusoe's compute infrastructure. You'll play a vital role in deploying and optimizing bare-metal and virtualized compute platforms, ensuring performance, security, and scale for modern AI and HPC workloads.
What You'll Be Working On:
In this role, you will develop automation and observability tools to monitor Crusoe's compute infrastructure, spanning from the kernel to orchestration layers. You will support and scale the company's virtualization stack, including technologies such as KVM, QEMU, and other hypervisors. Collaborating with Linux kernel and hardware teams, you'll help identify and resolve performance bottlenecks, driver issues, and optimize hardware offloads. A key focus will be on optimizing performance for AI and HPC workloads across CPU, GPU, and DPU/NIC resources. You will participate in root cause analysis for kernel crashes, hardware-software integration problems, and performance regressions, while also integrating hypervisor-level enhancements to improve guest VM reliability and workload isolation. The role involves tuning kernel subsystems such as the process scheduler, NUMA configuration, memory management, and interrupt handling. Additionally, you will work closely with platform teams to implement and validate support for emerging compute hardware, including SmartNICs, BlueField devices, and TPUs
What You'll Bring to the Team:
5+ years of professional experience in Compute SRE, Linux system engineering, or compute infrastructure roles.
Strong proficiency in Linux kernel internals, with exposure to scheduler, memory allocation, and driver subsystems.
Experience with virtualization architectures and technologies such as KVM, Xen, QEMU, or VMware.
Familiarity with SmartNICs/DPUs (e.g., NVIDIA CX6/7, BlueField-3) and kernel bypass techniques.
Expert-level skills in at least one programming language: Go, C or Rust.
Experience with system-level debugging, including kdump, kexec, and kernel panic analysis.
Proficiency in Infrastructure as Code tooling and CI/CD practices for bare-metal or cloud infrastructure.
Strong understanding of compute scheduling, resource management, and high-throughput networking.
Benefits:
Industry competitive pay
Restricted Stock Units in a fast growing, well-funded technology company
Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
Employer contributions to HSA accounts
Paid Parental Leave
Paid life insurance, short-term and long-term disability
Teladoc
401(k) with a 100% match up to 4% of salary
Generous paid time off and holiday schedule
Cell phone reimbursement
Tuition reimbursement
Subscription to the Calm app
MetLife Legal
Company paid commuter benefit; $300 per pay period
Compensation Range:
Compensation will be paid in the range of $166,000 - $201,000 a year + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant's education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.
Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.
#J-18808-Ljbffr
$166k-201k yearly 2d ago
Production Engineer
Crusoe Energy Systems LLC 4.1
San Francisco, CA jobs
Crusoe's mission is to accelerate the abundance of energy and intelligence. We're crafting the engine that powers a world where people can create ambitiously with AI - without sacrificing scale, speed, or sustainability.
Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that's setting the pace for responsible, transformative cloud infrastructure.
About the Role:
We are seeking a Production Engineer to play a critical role in managing Crusoe's fleet operations, focusing on foundational tools for provisioning and reprovisioning servers with a strong emphasis on Infrastructure as code. The role includes building automation tools, troubleshooting hardware, and scaling operations to support high growth. The candidate will be integral in transitioning to Kubernetes and optimizing Crusoe's infrastructure.
This position offers the opportunity to work on cutting‑edge technologies within a world‑class team and contribute directly to the success of a rapidly growing company while making a significant impact on the global energy landscape.
What You'll Be Working On:
Manage and maintain day‑to‑day operations of Crusoe's cloud infrastructure.
Develop automation tools to streamline server provisioning and reduce SLA times.
Scale infrastructure to support mass deployments (80-100 servers simultaneously).
Troubleshoot hardware issues, especially with GPUs, and liaise with vendors.
Transition Crusoe's environment to Kubernetes and containerized workflows.
What You'll Bring to the Team:
Solid hardware experience and GPU troubleshooting expertise.
Strong Linux background
Knowledge of PXE booting and server provisioning (bare metal)
Experience with BMC/IPMI, BIOS, and enterprise‑grade server management.
Kubernetes proficiency (admin or developer).
Familiarity with containerization technologies (Docker preferred).
Experience with version control systems ( Gitlab )
Problem solving skills - able to analyze complex technical issues and develop effective solutions
Strong communication and collaboration skills to work effectively with cross‑functional teams
Values: Embody the Company values
Experience with MAAS (nice to have)
Proficiency in Python or Golang (preferred language) (nice to have)
Kubernetes administration and deployment experience (nice to have)
Experience with Ansible and Terraform (nice to have)
Benefits:
Industry competitive pay
Restricted Stock Units in a fast growing, well‑funded technology company
Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
Employer contributions to HSA accounts
Paid Parental Leave
Paid life insurance, short‑term and long‑term disability
Teladoc
401(k) with a 100% match up to 4% of salary
Generous paid time off and holiday schedule
Cell phone reimbursement
Tuition reimbursement
Subscription to the Calm app
MetLife Legal
Company paid commuter benefit; $300 per month
Compensation:
Compensation will be paid in the range of $172,000 - $209,000 + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant's education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.
Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.
#J-18808-Ljbffr
$92k-130k yearly est. 6d ago
Production Engineer - AI Infra, Automation & Kubernetes
Crusoe Energy Systems LLC 4.1
San Francisco, CA jobs
A leading technology firm in San Francisco is seeking a Production Engineer to manage fleet operations and develop automation tools for server provisioning. This role is essential for transitioning the infrastructure to Kubernetes and includes troubleshooting hardware issues. The ideal candidate will have solid hardware experience and a strong Linux background, along with Kubernetes proficiency. Competitive pay and benefits are offered including stock options and health insurance.
#J-18808-Ljbffr