Post job

Reliability engineer jobs in Santa Clara, CA - 1,311 jobs

All
Reliability Engineer
Senior Reliability Engineer
  • Site Reliability Engineer

    The Voleon Group 4.1company rating

    Reliability engineer job in Berkeley, CA

    Voleon is a technology company that applies state‑of‑the‑art AI and machine learning techniques to real‑world problems in finance. For nearly two decades, we have led our industry and worked at the frontier of applying AI/ML to investment management. We have become a multibillion‑dollar asset manager, and we have ambitious goals for the future. Your colleagues will include internationally recognized experts in artificial intelligence and machine learning research as well as highly experienced finance and technology professionals. The people who shape our company come from other backgrounds, including concert music performances, humanitarian aid, opera singing, sports writing, and BMX racing. You will be part of a team that loves to succeed together. In addition to our enriching and collegial working environment, we offer highly competitive compensation and benefits packages, technology talks by our experts, a beautiful modern office, daily catered lunches, and more. As a Site Reliability Engineer (SRE), you will work at the intersection of production operations and software development as you improve, manage, and monitor production‑critical infrastructure and data pipelines. At Voleon, many SREs serve together on a Production Operations team tasked with improving shared production infrastructure. Others are embedded with teams of software engineers to improve specific production systems owned by those teams. Voleon SREs work on important real‑world problems and collaborate with passionate and talented colleagues in an empowering, results‑driven environment. This role is a way to make a real difference: your contributions will make our critical systems more reliable, lower operational risk, and increase the efficiency of our engineering effort. Responsibilities Improve fault‑tolerance and maintainability of code in proprietary data pipelines and trading systems Diagnose and fix bugs in code Lead complex deployments Automate manual workflows Track and prioritize outstanding production‑related issues Share an on‑call rotation responding to incidents to ensure the continuous operation of production‑critical systems Requirements Experience with coding and debugging Python Experience with Linux Familiarity with Relational Databases & SQL Sharp analytical and problem‑solving skills and a persistent drive to make things work (better) Strong growth mindset and a passion for learning Strong technical communication skills Attention to detail 2 years of relevant industry experience An undergraduate degree or comparable training in a quantitative field or equivalent, relevant industry experience Preferred Qualifications Familiarity with best practices concerning code maintainability, documentation, quality assurance, continuous integration and deployment Experience supporting production systems Experience with any of the following: gRPC microservices, Postgres, Pandas, Golang, R, Git, Jenkins, Bazel, Prometheus, Grafana, Airflow, Kubernetes The base salary for this position is $120,000 to $160,000 in the location(s) of this posting. Individual salaries are determined through a variety of factors, including, but not limited to, education, experience, knowledge, skills, and geography. Base salary does not include other forms of total compensation such as bonus compensation and other benefits. Our benefits package includes medical, dental and vision coverage, life and AD&D insurance, 20 days of paid time off, 9 sick days, and a 401(k) plan with a company match. Friends of Voleon Candidate Referral Program If you have a great candidate in mind for this role and would like to have the potential to earn $7,500 - $15,000 if your referred candidate is successfully hired and employed by The Voleon Group, please use this to submit your referral. For more details regarding eligibility, terms and conditions please make sure to review the Voleon Referral Bonus Program. Equal Opportunity Employer The Voleon Group is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law. #J-18808-Ljbffr
    $120k-160k yearly 4d ago
  • Job icon imageJob icon image 2

    Looking for a job?

    Let Zippia find it for you.

  • Site Reliability Engineer US - San Francisco

    Near Inc. 4.6company rating

    Reliability engineer job in San Francisco, CA

    The NEAR AI engineering team is developing decentralized and confidential machine learning infrastructure to power user owned AI. We currently focus on building infrastructure to enable private and confidential inference that works across different compute providers, as well as a blockchain-based coordination layer that incentivizes computer providers to join the decentralized inference network. You will own various components and drive critical decisions throughout their life cycles, including architecture, implementation, and maintenance. You will collaborate with highly knowledgeable and skilled colleagues who are passionate about solving hard problems that can disrupt the industry. What You'll Be Doing: End-to-end infrastructure ownership (for handling telemetry data, for performing testing, etc) Design and implementation of infrastructure components that manage clusters of GPU with special configurations Performance tuning and optimizations Create and maintain runbooks that support the on-call rotation Participate in the on-call rotation. Support code releases and delivery Plan and implement infrastructure cost and security strategies Plan and implement effective CI/CD Pipelines to facilitate development processes What We're Looking For: Agility to quickly learn new programming languages and technologies Ability to write clean and efficient code Ability to transform ambiguous problems into tangible solutions or prototypes Experience with software concurrency or parallelism Experience in building, operating, and scaling Cloud infrastructure (GCP, AWS, etc) Experience with data visualization and observability tooling (Grafana, Graphite, Zabbix, etc) Detail-oriented mindset with a focus on setting priorities and progressing towards objectives Excellent communication and teamwork skills Bachelor's Degree in Computer Science or a related field We'd Love If You Have: Experience with NEAR or other blockchain internals Experience with GPUs Experience with Trusted Execution Environments Experience debugging and troubleshooting complex concurrent systems Professional experience with Rust Locations: onsite, San Francisco office #J-18808-Ljbffr
    $126k-176k yearly est. 2d ago
  • Site Reliability Engineer - Cybersecurity

    Pantera Capital

    Reliability engineer job in Palo Alto, CA

    About xAI xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company's mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates. About the Role The Cybersecurity / SRE team is focused on ensuring the security and reliability of X Money. This role will primarily focus on the X Money platform but will also cross over with the X Social platform. The ideal candidate will have experience in the banking, money transmission, and P2P payments industry. We emphasize working with large distributed systems and security platforms at scale, with an automation-first mindset. You'll be responsible for securing and maintaining the reliability of X Money's infrastructure. You'll work closely with cross-functional teams to enhance security measures, improve system resilience, and implement best practices. Your role will include: Responsibilities Building and securing mission-critical applications within AWS. Ensuring proper identity and role management within AWS. Implementing and maintaining KMS for data management in RDS and DynamoDB. Strengthening Kubernetes and container security. Writing and maintaining infrastructure code using Python and Terraform. Integrating and maintaining code scanning platforms. Taking ownership of cybersecurity projects, identifying problems, and implementing solutions. Conducting critical analysis and applying strong problem-solving skills. Minimum qualifications: Proficiency in Python and Terraform. Hands-on experience with code scanning platforms. A proactive, problem-solving mindset with a strong sense of ownership. Excellent critical thinking and analytical skills. AWS experience, particularly with identity management and security. Expertise in Kubernetes and container security & experience with self-managed Kubernetes or EKS on AWS. Be based in the SF Bay Area, or willing to relocate here. Annual Salary Range $180,000 - $360,000 USD Benefits Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks. xAI is an equal opportunity employer. CCPA Notice #J-18808-Ljbffr
    $112k-159k yearly est. 4d ago
  • Site Reliability Engineer

    Fluix

    Reliability engineer job in Palo Alto, CA

    FLUIX is building the AI operating system that plans, designs, and optimizes AI infrastructure. We are based in Silicon Valley. We specialize in providing AI-driven solutions for data centers and power providers, leveraging cutting-edge Machine Learning (ML) and Artificial Intelligence (AI) technologies. Our mission is to double America's compute capacity without building new data centers. We are seeking a skilled Site Reliability Engineer to join our growing team. The ideal candidate will help ensure the reliability, scalability, and performance of our hybrid-based (Cloud & On-Prem) platform while supporting our AI/ML infrastructure. You will work closely with our engineering, AI, and operations teams to build and maintain robust systems that support our cutting-edge solutions. Your expertise in ML/AI and experience with data center sites will be crucial in driving the success of our platform. Who you'll work closely with Founder & CEO Chase Overcash CTO What you'll do Design, implement, and maintain scalable systems while optimizing performance, ensuring high availability and disaster recovery, and assisting with codebase refactoring for modular deployment. Develop and maintain automation tools to streamline operations, improve efficiency, and automate repetitive tasks to enhance system reliability. Collaborate with engineering and data science teams to integrate ML and AI models into production environments, while ensuring seamless integration and high performance of cutting-edge models within our technology stack. Identify areas for improvement and drive initiatives to enhance system reliability and performance, while staying updated on industry trends and advancements in SRE practices, ML, and AI technologies. Respond to and resolve incidents to minimize impact and ensure timely resolution, while conducting post-incident reviews and implementing improvements to prevent recurrence. Create and manage multiple cloud instances (dev, staging, test), optimize cloud infrastructure and data center operations, and ensure the security and compliance of both infrastructure and applications. Your background Bachelorʼs degree in Computer Science, Engineering, or a related field (or equivalent experience). Proven experience as a Site Reliability Engineer or similar role in a SaaS environment, with a strong background in managing and optimizing cloud infrastructure (AWS preferred, or GCP, Azure), experience with ML and AI technologies, and familiarity with data center operations integrations. Proficiency in programming and scripting languages (e.g., Python), experience with containerization and orchestration tools (Kubernetes), a strong understanding of networking, security, and performance optimization, and knowledge of CI/CD pipelines and DevOps practices. Excellent problem-solving skills with attention to detail, strong communication and collaboration abilities, and the capacity to thrive in a fast-paced, dynamic startup environment. Culture Fit We are looking for obsessed individuals who want to give it their all. We are not afraid to get our hands dirty with physical and software systems. We are eager to visit and work with clients and understand the importance and gravitas of their mission-critical work. We are eager to come into the office and on-site, as our work directly affects physical environments. Due to our mission-critical work, we understand and our eager to help our teammates and co-workers during holidays, weekends, and emergencies. We are cordial and over-communicate with teammates, co-workers, and management. Attractive compensation package, including equity options. Comprehensive health, dental, and vision insurance, along with other standard benefits. A dynamic and collaborative San Francisco Bay Area work environment. Opportunities for professional growth and development, with the chance to shape the future of technology in the industry. #J-18808-Ljbffr
    $112k-159k yearly est. 1d ago
  • Reliability Engineer for AI-Driven Materials Lab

    Periodiclabs

    Reliability engineer job in Menlo Park, CA

    A cutting-edge materials research lab in Menlo Park seeks a Reliability Engineer to lead maintenance operations and optimize experimental systems. The ideal candidate will have a relevant engineering degree and extensive experience in reliability and systems engineering, particularly in materials science. Responsibilities include establishing maintenance programs, managing a machine shop, and designing labware for automated workflows. Join this innovative team to contribute to groundbreaking scientific discoveries. #J-18808-Ljbffr
    $112k-160k yearly est. 1d ago
  • Role Site Reliability Engineer Department FullTime San Francisco

    Latent, Inc.

    Reliability engineer job in San Francisco, CA

    SRE You are the infrastructure expert who enables our rapid product development and guarantees 99.9%+ stability and performance of our clinical AI platform for major health systems. Your focus on operational excellence is directly tied to a patient's access to life-saving treatment. What We Look for in a Great Engineer You have the intensity and technical mastery to own mission-critical infrastructure. You hold yourself and others to high standards and thrive in a high-energy, in-office culture where everyone is in it to win it. Tool Proficiency: You are highly proficient with your tools-you speak command line fluently and have mastered keyboard shortcuts. Ownership: You thrive on owning complex systems and have a proven track record of scaling mission-critical deployments. Automation Drive: You love automating things, always finding new ways to increase your own leverage, and defining standards for operational excellence. Problem Solver: You won't wait for someone else to solve a problem that you're in a position to solve; you are willing to jump into whatever needs to get done. What You'll Work On (Responsibilities) As our SRE, you will own the entire production environment and improve the development experience: Infrastructure Ownership: Design, implement, and maintain the production environment, having previously handled 500+ machine deployments. Kubernetes Mastery: Own our containerized infrastructure, leveraging deep expertise in Kubernetes and Helm to manage deployment, scaling, and operational health. CI/CD & Deployment Optimization: Optimize and streamline both the TypeScript and Python/ML deployment pipelines to support high-velocity feature release while maintaining the highest reliability. DevX Support: Support Developer Experience (DevX) work to streamline developer workflows, enhance tool proficiency, and improve CI/CD systems. Infrastructure as Code (IaC): Manage and maintain infrastructure definitions using Terraform. Technical Qualifications & Environment IaC & Orchestration: Deep, demonstrable experience with Kubernetes, Helm, and Terraform. Scaling Systems: Proven ability to architect and maintain complex, distributed systems with high-availability requirements. Deployment Experience: Hands-on experience optimizing deployment pipelines for both application code (TypeScript) and machine learning models (Python/ML). Also PostgreSQL, Redis, Kakfa. Core Team Member: Excitement about working five days per week in our San Francisco office. #J-18808-Ljbffr
    $113k-160k yearly est. 5d ago
  • Site Reliability Engineer - Scale Resilience & Observability

    Happyrobot Inc.

    Reliability engineer job in San Francisco, CA

    A leading AI startup based in San Francisco is seeking a Site Reliability Engineer to enhance operational resilience. In this role, you will oversee stability, observability, and debugging workflows, transforming complex failures into seamless operations. Ideal candidates have 3+ years in debugging production systems and are comfortable with coding in Python and Go. Excellent problem-solving skills and ability to communicate clearly under pressure are essential. Join a passionate team and help shape the future of AI-driven enterprises. #J-18808-Ljbffr
    $113k-160k yearly est. 3d ago
  • Site Reliability Engineer - Scale & Observability

    Gamma.App

    Reliability engineer job in San Francisco, CA

    A dynamic tech firm located in San Francisco is seeking a Site Reliability Engineer to enhance operational health across their production systems. This high-impact role demands expertise in AWS and strong programming skills. You will manage production systems' reliability and lead incident response efforts to prevent issues, all while contributing to the scalability and efficiency of their services. Ideal candidates will have 5+ years of relevant experience and a passion for leveraging technology to drive outcomes. #J-18808-Ljbffr
    $113k-160k yearly est. 2d ago
  • Site Reliability Engineer

    Mercor, Inc.

    Reliability engineer job in San Francisco, CA

    About Mercor Mercor is at the intersection of labor markets and AI research. We partner with leading AI labs and enterprises to provide the human intelligence essential to AI development. Our vast talent network trains frontier AI models in the same way teachers teach students: by sharing knowledge, experience, and context that can't be captured in code alone. Today, more than 30,000 experts in our network collectively earn over $1.5 million a day. Mercor is creating a new category of work where expertise powers AI advancement. Achieving this requires an ambitious, fast‑paced and deeply committed team. You'll work alongside researchers, operators, and AI companies at the forefront of shaping the systems that are redefining society. Mercor is a profitable Series C company valued at $10 billion. We work in‑person five days a week in our new San Francisco headquarters. About the Role As a Site Reliability Engineer (SRE) at Mercor, you'll own production reliability across our most critical systems, partnering directly with infrastructure leadership. You'll play a foundational role in building our SRE function from the ground up and shaping how Mercor operates large‑scale, high‑availability systems. What You'll Do Own reliability and production safety for core shared services and customer‑facing systems. Partner directly with infrastructure leadership to define SRE priorities, reliability standards, and production safety roadmap. Repair and improve how our production systems are structured so they are stable, resource‑efficient, isolated, and well‑observed. Introduce and champion modern SRE practices (e.g., incident response, postmortems, SLIs/SLOs) across engineering teams. Collaborate with leverage engineering and applied AI teams to ensure sustainable growth. Represent SRE best practices internally and help teams onboard onto production in a way that is safe, scalable, and consistent with SRE principles. What We're Looking For Experience doing true SRE work (not just operations) across multiple roles or companies. Deep familiarity with SRE practices as popularized by Google (e.g., error budgets, reliability vs. risk trade‑offs, large‑scale distributed systems). 5+ years of SRE experience; 15+ years of overall experience is ideal for this first SRE hire. Proven success operating systems at scale, with a strong understanding of the challenges of large, distributed production environments. Strong collaboration skills; able to work efficiently with cross‑functional engineering teams. Ability to drive cultural change around reliability while remaining hands‑on in building and fixing systems. Comfort working in high‑intensity, high‑availability environments where uptime and production quality are critical. Nice to Haves Experience as a founding SRE or early SRE hire, standing up SRE practices and orgs from scratch. Hands‑on experience in the AWS ecosystem, Kubernetes, and modern IaC tooling (Terraform, Spacelift, etc.). #J-18808-Ljbffr
    $113k-160k yearly est. 5d ago
  • Founding Site Reliability Engineer

    Relevance Ai

    Reliability engineer job in San Francisco, CA

    About Us 🚀 At Relevance AI, our mission is to empower anyone to delegate work to the AI workforce. We're building a new category of AI automation, enabling teams to create and deploy intelligent AI agents that replicate human-quality work, decision-making, and collaboration at scale. We're scaling fast backed by top global investors including Bessemer Venture Partners, Insight Partners, Peak XV, and King River Capital and our platform is already trusted by industry leaders like Canva, Databricks, Confluent, KMPG, Autodesk, and more. With offices in Sydney 🇦🇺 and San Francisco 🇺🇸 (and a new hub launching in Barcelona 🇪🇸), this is your chance to shape the future of work on a global stage. The Role 🧠 We're looking for a Founding Site Reliability Engineer to join us as our first SRE hire in San Francisco. We are open to hiring someone who is Senior, Lead or Principal level and will be candidate led. This role is perfect for someone ready to establish and scale the SRE discipline from the ground up in one of the fastest-growing AI companies globally. You'll own the reliability, scalability, and security of our platform as we power tens of thousands of multi-agent workloads across multiple regions. You'll partner closely with our founders, engineering leads, and product teams to define our reliability culture, shape long-term strategy, and build world-class infrastructure for enterprise scale. What You'll Do 💪 Own SRE establishing best practices, tooling, and culture Tackle reliability challenges unique to multi-agent orchestration at enterprise scale Guarantee >99.9% uptime of production systems, ensuring reliability at global scale Architect and automate AWS infrastructure with Terraform and CI/CD pipelines Design observability systems across microservices, APIs, and vector infrastructure (metrics, tracing, logging) Drive down incidents and MTTR through runbooks, alerting, and incident response excellence Help scale infra to support hundreds of thousands of agents and billions of API calls Partner with engineering teams to embed SRE principles into the SDLC and shape org-wide reliability strategy Act as a founding voice in our SF office, influencing product direction and engineering culture What We're Looking For 🧠 5+ years in SRE/DevOps/Infrastructure roles, with experience in enterprise SaaS environments. Deep AWS expertise (EC2, ECS/EKS, Lambda, RDS, VPC, IAM). Proven track record with Infrastructure as Code (Terraform, Kubernetes/EKS, CDK, or CloudFormation). Hands-on with observability stacks (CloudWatch, Grafana, Prometheus, Datadog). Incident management experience in production SaaS systems, including on-call, postmortems, and reliability improvements. Bonus: Prior exposure to AI/ML platforms, data-heavy systems, or multi-agent workloads. Tech Stack 🧰 AWS, Kubernetes/EKS, Terraform, GitHub Actions, Postgres/Mongo, Prometheus/Grafana, CloudWatch, PagerDuty/BetterStack #J-18808-Ljbffr
    $113k-160k yearly est. 5d ago
  • Site Reliability Engineer, Managed AI

    Crusoe Energy Systems LLC 4.1company rating

    Reliability engineer job in San Francisco, CA

    Crusoe's mission is to accelerate the abundance of energy and intelligence. We're crafting the engine that powers a world where people can create ambitiously with AI - without sacrificing scale, speed, or sustainability. Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that's setting the pace for responsible, transformative cloud infrastructure. About the Role At Crusoe, our Site Reliability Engineering team ensures the reliability and scalability of Crusoe's AI-optimized cloud platform. We're looking for an SRE with a strong background in distributed systems and hands-on experience with large language models to help us build and operate managed AI services at scale. This role is central to delivering highly available, performant, and cost-efficient AI infrastructure that powers compute-intensive, latency-sensitive workloads for our customers. What You'll Work On: Design and operate reliable managed AI services with a focus on serving and scaling LLM workloads Build automation and reliability tooling to support distributed AI pipelines and inference services Define, measure, and improve SLIs/SLOs across AI workloads to ensure performance and reliability targets are met Collaborate with AI, platform, and infrastructure teams to optimize large-scale training and inference clusters Automate observability by building telemetry and performance tuning strategies for latency-sensitive AI services Investigate and resolve reliability issues in distributed AI systems using telemetry, logs, and profiling Contribute to the architecture of next-generation distributed systems purpose-built for AI-first environments What You'll Bring: Strong software engineering background - experience building production-grade systems beyond scripting or Bash Demonstrated experience in distributed systems design and implementation Hands-on work with large language models (LLMs) or AI/ML infrastructure SRE mindset and experience (whether or not under the SRE title) including: Defining and measuring SLIs/SLOs Building monitoring and observability systems Driving performance and reliability improvements Designing fault‑tolerant systems and automated testing strategies Proficiency in at least one modern programming language (Python, Go, Java, C++) Familiarity with Kubernetes or container orchestration platforms Strong collaboration and communication skills Ability to thrive in a fast‑paced, mission‑driven environment Bonus Points: Experience scaling inference or training workloads for LLMs Benefits: Industry competitive pay Restricted Stock Units in a fast growing, well‑funded technology company Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents Employer contributions to HSA accounts Paid Parental Leave Paid life insurance, short‑term and long‑term disability Teladoc 401(k) with a 100% match up to 4% of salary Generous paid time off and holiday schedule Cell phone reimbursement Tuition reimbursement Subscription to the Calm app MetLife Legal Company paid commuter benefit; $300 per month Compensation: Compensation will be paid in the range of $204,000 - $247,000 + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant's education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data. Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation. #J-18808-Ljbffr
    $124k-174k yearly est. 4d ago
  • Reliability/DFX Engineer

    Openai 4.2company rating

    Reliability engineer job in San Francisco, CA

    About the Team OpenAI's Hardware organization develops silicon and system-level solutions designed for the unique demands of advanced AI workloads. The team is responsible for building the next generation of AI-native silicon while working closely with software and research partners to co-design hardware tightly integrated with AI models. In addition to delivering production-grade silicon for OpenAI's supercomputing infrastructure, the team also creates custom design tools and methodologies that accelerate innovation and enable hardware optimized specifically for AI. About the Role We are seeking a highly skilled cross-stack engineer with deep expertise in making ML systems reliable at scale. This hands-on individual contributor will sit within our hardware team and work closely with chip design, platform design, hardware health, and the broader industry ecosystem to architect, implement, and deploy reliable next-generation AI accelerator systems. This engineer will evaluate system and chip architecture holistically, identify high-ROI opportunities to improve reliability and availability across the stack, and translate those opportunities into strategy and silicon features. In this role, you will Oversee DFX architecture, implementation, and execution in silicon from concept to high-volume deployment, and propose high-ROI features to enhance reliability and fault tolerance. DFX includes design for testability, reliability, availability, and serviceability of high-performance AI hardware. Build system-level reliability models grounded in empirical data to guide organization-wide DFX and reliability strategy. This requires a detailed understanding of chip and system architecture, design, implementation, and component-level reliability. Collaborate with chip and platform architecture/design teams to explore and implement DFX features, including the specification and implementation of digital/mixed-signal IP, firmware/system software, and DFX methodology (in partnership with engineering teams). Partner with hardware health and platform design teams to continuously improve reliability and fault tolerance in NPI and HVM. This includes optimizing operating conditions, designing experiments, and performing data analysis to drive continuous, data-driven improvements across the stack. Serve as the DFX/reliability champion and evangelist to align the broader industry ecosystem with OpenAI's requirements and roadmap. Qualifications BS with 15+ years, MS with 10+ years, or PhD with 3+ years of relevant industry experience focused on reliability across the chip/platform stack. Hands-on experience with RTL design and DFT is required; physical implementation and/or silicon ATE experience is preferred. Detailed understanding of ML chip and platform architecture and ML workload characteristics is required. Strong fundamentals in reliability modeling, with hands-on skills in empirical data analysis. About OpenAI OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity. We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic. For additional information, please see OpenAI's Affirmative Action and Equal Employment Opportunity Policy Statement. Qualified applicants with arrest or conviction records will be considered for employment in accordance with applicable law, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act. For unincorporated Los Angeles County workers: we reasonably believe that criminal history may have a direct, adverse and negative relationship with the following job duties, potentially resulting in the withdrawal of a conditional offer of employment: protect computer hardware entrusted to you from theft, loss or damage; return all computer hardware in your possession (including the data contained therein) upon termination of employment or end of assignment; and maintain the confidentiality of proprietary, confidential, and non-public information. In addition, job duties require access to secure and protected information technology systems and related data security obligations. To notify OpenAI that you believe this job posting is non-compliant, please submit a report through this form. No response will be provided to inquiries unrelated to job posting compliance. We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link. OpenAI Global Applicant Privacy Policy At OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology. #J-18808-Ljbffr
    $127k-176k yearly est. 2d ago
  • Senior PostgreSQL DBRE - Scale, Reliability & Automation

    Okta, Inc. 4.3company rating

    Reliability engineer job in San Francisco, CA

    A leading identity management firm is looking for a Senior Database Reliability Engineer (DBRE) in San Francisco, California. The ideal candidate will have over 4 years of experience specifically with PostgreSQL and will be responsible for designing and optimizing data persistence layers for mission-critical systems. Key responsibilities include leading database incidents, working cross-functionally with platform teams, developing automation for tasks, and ensuring high availability across database environments. This position is essential for operational excellence in a hybrid environment. #J-18808-Ljbffr
    $157k-199k yearly est. 4d ago
  • Site Reliability Engineer

    Clay 4.0company rating

    Reliability engineer job in San Francisco, CA

    Clay is a creative tool for growth. Our mission is to help businesses grow - without huge investments in tooling or manual labor. We're already helping over 100,000 people grow their business with Clay. From local pizza shops to enterprises like Anthropic and Notion, our tool lets you instantly translate any idea that you have for growing your company into reality. We believe that modern GTM teams win by finding GTM alpha - a unique competitive edge powered by data, experimentation, and automation. Clay is the platform they use to uncover hidden signals, build custom plays, and launch faster than their competitors. We're looking for sharp, low-ego people to help teams find their GTM alpha. Why is Clay the best place to work? Customers love the product (100K+ users and growing) We're growing a lot (6x YoY last year, and 10x YoY the two years before that) Incredible culture (our customers keep applying to work here) Well-resourced - We raised a $100M Series C in 2025 at a $3.1B valuation and are backed by world-class investors like Capital G (Google), Sequoia and Meritech Read more about why people love working at Clay here and explore our wall of love to learn more about the product. SRE @ Clay In this role, you'll join our growing infrastructure team in building and fine-tuning our infrastructure to keep our services running smoothly. We're looking for someone who's excited about automation and continuous improvement. While your main focus will be on infrastructure, coding skills are a must. As a growing startup, we all jump in where needed, so you'll need to be comfortable taking on a variety of roles. What You'll Do Architect, design, implement, and manage robust, scalable, and secure infrastructure solutions. Develop, maintain, and enforce best practices for CI/CD, infrastructure as code, and automation. Oversee the management and optimization of cloud infrastructure, ensuring high availability, performance, and cost-efficiency. Implement monitoring, logging, and alerting solutions to maintain system health and quickly resolve issues. Lead incident response efforts, troubleshooting and resolving complex issues in a timely manner. Participate in an oncall rotation. Work with teams across the company to ensure we achieve the right balance of developer velocity, reliability and performance, and cost efficiency. What You'll Bring 5+ years of experience Experience with containerization and orchestration tools Strong understanding of CI/CD concepts and tools Knowledge of infrastructure automation tools Experience with oncall and incident response Proficiency in one or more programming languages Familiarity with our stack or ability to learn unfamiliar technologies quickly: Aurora Postgres RDS, Elasticache Redis, Docker + ECS, Lambda, OpenSearch Terraform and Atlantis CircleCI, Netlify, Playwright Cloudwatch, Datadog, Mezmo Typescript, Python #J-18808-Ljbffr
    $109k-150k yearly est. 5d ago
  • Founding SRE Engineer - Reliability & Growth

    Asana 4.6company rating

    Reliability engineer job in San Francisco, CA

    A leading software company is seeking experienced Software Engineers to join the new Site Reliability Engineering team. This role focuses on building reliable, scalable systems and leading projects across infrastructure. Candidates should have strong software engineering skills and a passion for reliability. The position offers a hybrid work model and generous compensation packages with additional benefits. #J-18808-Ljbffr
    $147k-189k yearly est. 2d ago
  • Senior Technology Site Reliability Engineer

    Cooley LLP 4.8company rating

    Reliability engineer job in San Francisco, CA

    Senior Technology Site Reliability Engineer page is loaded## Senior Technology Site Reliability Engineerlocations: San Francisco: New York: Santa Monica: Los Angeles: Palo Altotime type: Full timeposted on: Posted Yesterdayjob requisition id: Req 4348Senior Technology Site Reliability EngineerCooley is seeking a Senior Site Reliability Engineer to join the Infrastructure & Development Operations team The Senior Technology Site Reliability Engineer (“SRE”) is responsible for ensuring the reliability, scalability, and performance of the firm's critical infrastructure and applications. The SRE blends software engineering with systems engineering to build and maintain automated, resilient, and observable systems that support high availability and operational excellence. In addition to being technically advanced, the SRE will have a high degree of emotional intelligence and the ability to work as a team towards complex and layered objectives. Specific duties and responsibilities include, but are not limited to, the following:**Position responsibilities:*** Monitor and maintain production systems to ensure high availability and performance* Implement and manage service-level indicators (SLIs), objectives (SLO's), agreements (SLA's), and error budgets* Participate in on-call rotations and incident response, including root cause analysis and postmortems* Develop and maintain infrastructure as code (IaC) using Terraform* Automate deployment, scaling, and recovery processes to reduce manual intervention* Partner with DevOps to build and maintain CI/CD pipelines to support safe and efficient software delivery* Implement observability solutions using metrics, logs, traces, and alerting systems (Prometheus, Grafana, DataDog, etc.)* Proactively identify and resolve system bottlenecks and reliability risks* Work closely with Infrastructure, DevOps, Development, and security teams to embed reliability into the development lifecycle* Contribute to a culture of blameless post-mortems and continuous improvement* Document operational procedures and share knowledge across teams* All other duties as assigned or required**Skills and experience****:**Required:* After orientation at Cooley LLP, exhibit proficiency in the Microsoft Office suite, iManage and other firm applications* Ability to work extended and/or weekend hours, as required* Ability to travel, as required* 6+ years direct applicable experience (e.g. site reliability engineering or related field)* Proficiency in Terraform and programming languages such as Python, Go, or Java* Deep expertise in cloud platforms, particularly AWS, and container orchestration* Strong background in distributed systems, performance tuning, and automation* Hands-on experience with configuration management tools such as Puppet, Chef, or SaltPreferred:* Bachelor's Degree in Computer Science, Information Technology, Engineering, or associated discipline* Experience working with advanced ETL data workflows including technologies such as AWS EMR, Azure Synapse, Azure Data Factory, or Apache Hive/Spark/Airflow* Experience with IaC deployment of AKS/EKS/GKE architecture* Experience with enterprise Data Lake environments using technologies such as DataBricks or Snowflake**Competencies****:*** Expert analytical/quantitative, problem-solving, and deductive reasoning skills, experience performing advanced troubleshooting and root cause analysis of complex technical issues* Excellent organizational, planning, and time management skills and ability to work independently and in a team environment to manage competing priorities and meet deadlines* Advanced verbal and written communication skills with the ability to present findings, conclusions, alternatives, and information clearly and concisely* Experience working with all levels of business professionals, management, stakeholders, and vendors with the ability to build effective relationships through trust and diplomacy Cooley offers a competitive compensation and excellent benefits package and is committed to fair and equitable employment practices.EOE.The expected annual pay range for this position with a full-time schedule is $140,000 - $205,000. Please note that final offer amount will be dependent on geographic location, applicable experience and skillset of the candidate.We offer a full range of elective benefits including medical, health savings account (with applicable medical plan), dental, vision, health and/or dependent care flexible spending accounts, pre-tax commuter benefits, life insurance, AD&D, long-term care coverage, backup care for children and/or adults and other parental support benefits. In addition to elective benefit options, benefited employees receive firm-paid life insurance, AD&D, LTD, short term medical benefits as well as 21 days of Paid Time Off (“PTO”) and 10 paid holidays each year. We provide generous parental leave and fertility benefits. New employees will attend a detailed benefit orientation to learn more about our many benefits and resources.Welcome to Cooley. We are counselors, strategists and advocates for today's and tomorrow's leaders of the business economy. We seek to meet the evolving needs of our clients by building a community of professionals of the highest caliber who share our vision and embrace our values.Working at Cooley provides an opportunity to work in an environment of collaboration, challenge and reward. We are all part of one firm dedicated to maintaining a diverse workplace that values and celebrates differences-from the way we relate to and support each other, to the way we work together to meet the needs of our clients. It is the unique abilities and perspectives of every individual at Cooley that creates a rewarding workplace.For Cooley, this means offering all employees the tools, training and mentoring they need to succeed. It enables every individual to balance work and family obligations. It looks beyond the Firm's four walls, fostering community involvement. It includes becoming leaders and contributors in our communities.Our cooperative spirit is the trademark of the Cooley Culture and every employee in every department is instrumental to the success of the Firm. We invite you to take a look at our open positions. #J-18808-Ljbffr
    $140k-205k yearly 4d ago
  • Site Reliability Engineer - AI Inference Infra & GPU Clusters

    Near Inc. 4.6company rating

    Reliability engineer job in San Francisco, CA

    A tech company specializing in AI infrastructure based in San Francisco is looking for a candidate to own the development of decentralized machine learning infrastructure. The role involves designing components, performance tuning, and collaboration with skilled colleagues. The ideal candidate should have experience in Cloud infrastructure and software concurrency, along with a Bachelor's degree in Computer Science. Excellent communication skills and the ability to learn quickly are essential. The position is onsite at the San Francisco office. #J-18808-Ljbffr
    $126k-176k yearly est. 2d ago
  • Reliability Engineer: Scale Systems, Observe & Automate

    Openai 4.2company rating

    Reliability engineer job in San Francisco, CA

    A leading AI research company based in San Francisco is seeking experienced reliability engineers to scale their infrastructure and ensure system performance and reliability. This role involves collaborating with diverse teams to develop resilient systems and enhance operations. Candidates should have strong cloud proficiency, experience in containerization technologies, and a bachelor's degree in a related field. #J-18808-Ljbffr
    $127k-176k yearly est. 5d ago
  • Site Reliability Engineer: Scale, Automate & Own Cloud Infra

    Clay 4.0company rating

    Reliability engineer job in San Francisco, CA

    A leading technology firm in San Francisco is looking for a Site Reliability Engineer to design and manage scalable infrastructure solutions. You will ensure high performance and availability while collaborating across teams. Candidates should have at least 5 years of experience, strong coding skills, and familiarity with various cloud services and automation tools. Join us in a culture of continuous improvement and innovation. #J-18808-Ljbffr
    $109k-150k yearly est. 5d ago
  • Senior+ Site Reliability Engineer

    Crusoe Energy Systems LLC 4.1company rating

    Reliability engineer job in San Francisco, CA

    Crusoe's mission is to accelerate the abundance of energy and intelligence. We're crafting the engine that powers a world where people can create ambitiously with AI - without sacrificing scale, speed, or sustainability. Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that's setting the pace for responsible, transformative cloud infrastructure. About This Role: Crusoe is building the most reliable, energy-efficient, AI-optimized cloud platform - and operational excellence is at the heart of that mission. As a Site Reliability Engineer focused on Operational Excellence, you will help ensure the stability, resilience, and performance of Crusoe's GPU cloud. This role is ideal for engineers who thrive in fast-paced environments, enjoy solving operational problems, and want to grow their technical career while supporting incident response, reliability, and continuous improvement across a large-scale distributed platform. You'll partner closely with senior SREs, infrastructure engineers, and platform teams to improve reliability, reduce operational toil, and strengthen Crusoe's incident management practices. What You'll Be Working On: Collaborate with cross-functional teams to define and refine availability metrics for Crusoe's cloud infrastructure, including establishing, tracking, and improving SLIs and SLOs. Assist in incident response by identifying, diagnosing, and resolving service disruptions, and support post-incident processes through RCA documentation and participation in post-incident reviews. Build, operate, and monitor infrastructure health using Crusoe's observability stack (Prometheus, Grafana, Alertmanager, OpenTelemetry). Identify and communicate reliability risks, performance bottlenecks, and early indicators of potential incidents that could impact service availability. Develop automation and tooling to reduce operational toil, minimize manual intervention, and enhance service recovery and self‑healing capabilities. Partner with compute, network, storage, and platform teams to improve service resilience and strengthen disaster recovery readiness. Contribute to knowledge sharing, process improvements, and the development of operational best practices across the organization. Participate in ongoing training, mentorship, and professional development to grow into advanced SRE responsibilities. What You'll Bring to the Team: 5+ years of experience in cloud operations, SRE, or related roles Understanding of cloud platforms and infrastructure fundamentals (Kubernetes, AWS/GCP, virtualization, distributed systems) Familiarity with incident management practices and operational frameworks (SRE/ITIL/etc.) Experience with monitoring and alerting tools (Prometheus, Grafana) or a strong willingness to learn Familiarity with infrastructure-as-code and configuration management tools such as Terraform and Ansible Basic Scripting and automation experience (Go, Python, C, C++, or similar) Strong communication skills, with the ability to clearly articulate technical issues to diverse stakeholders Ability to stay calm, focused, and effective in fast-moving or high-pressure situations A growth mindset with enthusiasm for operational excellence, reliability engineering, and continuous improvement Bonus Points: Experience with Kubernetes, container orchestration, or large-scale distributed systems Exposure to change management, operational readiness reviews, or structured RCAs Familiarity with self‑healing systems, automated remediation, or event‑driven operations Interest in scaling AI/HPC infrastructure and solving reliability challenges in GPU‑heavy environments Passion for learning, mentorship, and developing deeper SRE capabilities over time Benefits: Industry competitive pay Restricted Stock Units in a fast growing, well‑funded technology company Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents Employer contributions to HSA accounts Paid Parental Leave Paid life insurance, short-term and long-term disability Teladoc 401(k) with a 100% match up to 4% of salary Generous paid time off and holiday schedule Cell phone reimbursement Tuition reimbursement Subscription to the Calm app MetLife Legal Company paid commuter benefit; $300 per month Compensation: Compensation will be paid in the range of $172,000 - $209,000 + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant's education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data. Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation. #J-18808-Ljbffr
    $142k-189k yearly est. 3d ago

Learn more about reliability engineer jobs

How much does a reliability engineer earn in Santa Clara, CA?

The average reliability engineer in Santa Clara, CA earns between $95,000 and $187,000 annually. This compares to the national average reliability engineer range of $76,000 to $144,000.

Average reliability engineer salary in Santa Clara, CA

$133,000

What are the biggest employers of Reliability Engineers in Santa Clara, CA?

The biggest employers of Reliability Engineers in Santa Clara, CA are:
  1. ByteDance
  2. TikTok
  3. Apple
  4. Google
  5. Walmart
  6. Fortinet
  7. NVIDIA
  8. Freelance Computer Services
  9. Cerebras
  10. Vantage Data Centers
Job type you want
Full Time
Part Time
Internship
Temporary