Post job

Reliability engineer jobs in San Mateo, CA - 1,396 jobs

All
Reliability Engineer
Senior Reliability Engineer
  • Site Reliability Engineer US - San Francisco

    Near Inc. 4.6company rating

    Reliability engineer job in San Francisco, CA

    The NEAR AI engineering team is developing decentralized and confidential machine learning infrastructure to power user owned AI. We currently focus on building infrastructure to enable private and confidential inference that works across different compute providers, as well as a blockchain-based coordination layer that incentivizes computer providers to join the decentralized inference network. You will own various components and drive critical decisions throughout their life cycles, including architecture, implementation, and maintenance. You will collaborate with highly knowledgeable and skilled colleagues who are passionate about solving hard problems that can disrupt the industry. What You'll Be Doing: End-to-end infrastructure ownership (for handling telemetry data, for performing testing, etc) Design and implementation of infrastructure components that manage clusters of GPU with special configurations Performance tuning and optimizations Create and maintain runbooks that support the on-call rotation Participate in the on-call rotation. Support code releases and delivery Plan and implement infrastructure cost and security strategies Plan and implement effective CI/CD Pipelines to facilitate development processes What We're Looking For: Agility to quickly learn new programming languages and technologies Ability to write clean and efficient code Ability to transform ambiguous problems into tangible solutions or prototypes Experience with software concurrency or parallelism Experience in building, operating, and scaling Cloud infrastructure (GCP, AWS, etc) Experience with data visualization and observability tooling (Grafana, Graphite, Zabbix, etc) Detail-oriented mindset with a focus on setting priorities and progressing towards objectives Excellent communication and teamwork skills Bachelor's Degree in Computer Science or a related field We'd Love If You Have: Experience with NEAR or other blockchain internals Experience with GPUs Experience with Trusted Execution Environments Experience debugging and troubleshooting complex concurrent systems Professional experience with Rust Locations: onsite, San Francisco office #J-18808-Ljbffr
    $126k-176k yearly est. 3d ago
  • Job icon imageJob icon image 2

    Looking for a job?

    Let Zippia find it for you.

  • Site Reliability Engineer

    Mercor, Inc.

    Reliability engineer job in San Francisco, CA

    About Mercor Mercor is at the intersection of labor markets and AI research. We partner with leading AI labs and enterprises to provide the human intelligence essential to AI development. Our vast talent network trains frontier AI models in the same way teachers teach students: by sharing knowledge, experience, and context that can't be captured in code alone. Today, more than 30,000 experts in our network collectively earn over $1.5 million a day. Mercor is creating a new category of work where expertise powers AI advancement. Achieving this requires an ambitious, fast‑paced and deeply committed team. You'll work alongside researchers, operators, and AI companies at the forefront of shaping the systems that are redefining society. Mercor is a profitable Series C company valued at $10 billion. We work in‑person five days a week in our new San Francisco headquarters. About the Role As a Site Reliability Engineer (SRE) at Mercor, you'll own production reliability across our most critical systems, partnering directly with infrastructure leadership. You'll play a foundational role in building our SRE function from the ground up and shaping how Mercor operates large‑scale, high‑availability systems. What You'll Do Own reliability and production safety for core shared services and customer‑facing systems. Partner directly with infrastructure leadership to define SRE priorities, reliability standards, and production safety roadmap. Repair and improve how our production systems are structured so they are stable, resource‑efficient, isolated, and well‑observed. Introduce and champion modern SRE practices (e.g., incident response, postmortems, SLIs/SLOs) across engineering teams. Collaborate with leverage engineering and applied AI teams to ensure sustainable growth. Represent SRE best practices internally and help teams onboard onto production in a way that is safe, scalable, and consistent with SRE principles. What We're Looking For Experience doing true SRE work (not just operations) across multiple roles or companies. Deep familiarity with SRE practices as popularized by Google (e.g., error budgets, reliability vs. risk trade‑offs, large‑scale distributed systems). 5+ years of SRE experience; 15+ years of overall experience is ideal for this first SRE hire. Proven success operating systems at scale, with a strong understanding of the challenges of large, distributed production environments. Strong collaboration skills; able to work efficiently with cross‑functional engineering teams. Ability to drive cultural change around reliability while remaining hands‑on in building and fixing systems. Comfort working in high‑intensity, high‑availability environments where uptime and production quality are critical. Nice to Haves Experience as a founding SRE or early SRE hire, standing up SRE practices and orgs from scratch. Hands‑on experience in the AWS ecosystem, Kubernetes, and modern IaC tooling (Terraform, Spacelift, etc.). #J-18808-Ljbffr
    $113k-160k yearly est. 1d ago
  • Founding Site Reliability Engineer

    Relevance Ai

    Reliability engineer job in San Francisco, CA

    About Us 🚀 At Relevance AI, our mission is to empower anyone to delegate work to the AI workforce. We're building a new category of AI automation, enabling teams to create and deploy intelligent AI agents that replicate human-quality work, decision-making, and collaboration at scale. We're scaling fast backed by top global investors including Bessemer Venture Partners, Insight Partners, Peak XV, and King River Capital and our platform is already trusted by industry leaders like Canva, Databricks, Confluent, KMPG, Autodesk, and more. With offices in Sydney 🇦🇺 and San Francisco 🇺🇸 (and a new hub launching in Barcelona 🇪🇸), this is your chance to shape the future of work on a global stage. The Role 🧠 We're looking for a Founding Site Reliability Engineer to join us as our first SRE hire in San Francisco. We are open to hiring someone who is Senior, Lead or Principal level and will be candidate led. This role is perfect for someone ready to establish and scale the SRE discipline from the ground up in one of the fastest-growing AI companies globally. You'll own the reliability, scalability, and security of our platform as we power tens of thousands of multi-agent workloads across multiple regions. You'll partner closely with our founders, engineering leads, and product teams to define our reliability culture, shape long-term strategy, and build world-class infrastructure for enterprise scale. What You'll Do 💪 Own SRE establishing best practices, tooling, and culture Tackle reliability challenges unique to multi-agent orchestration at enterprise scale Guarantee >99.9% uptime of production systems, ensuring reliability at global scale Architect and automate AWS infrastructure with Terraform and CI/CD pipelines Design observability systems across microservices, APIs, and vector infrastructure (metrics, tracing, logging) Drive down incidents and MTTR through runbooks, alerting, and incident response excellence Help scale infra to support hundreds of thousands of agents and billions of API calls Partner with engineering teams to embed SRE principles into the SDLC and shape org-wide reliability strategy Act as a founding voice in our SF office, influencing product direction and engineering culture What We're Looking For 🧠 5+ years in SRE/DevOps/Infrastructure roles, with experience in enterprise SaaS environments. Deep AWS expertise (EC2, ECS/EKS, Lambda, RDS, VPC, IAM). Proven track record with Infrastructure as Code (Terraform, Kubernetes/EKS, CDK, or CloudFormation). Hands-on with observability stacks (CloudWatch, Grafana, Prometheus, Datadog). Incident management experience in production SaaS systems, including on-call, postmortems, and reliability improvements. Bonus: Prior exposure to AI/ML platforms, data-heavy systems, or multi-agent workloads. Tech Stack 🧰 AWS, Kubernetes/EKS, Terraform, GitHub Actions, Postgres/Mongo, Prometheus/Grafana, CloudWatch, PagerDuty/BetterStack #J-18808-Ljbffr
    $113k-160k yearly est. 1d ago
  • Site Reliability Engineer - Scale Resilience & Observability

    Happyrobot Inc.

    Reliability engineer job in San Francisco, CA

    A leading AI startup based in San Francisco is seeking a Site Reliability Engineer to enhance operational resilience. In this role, you will oversee stability, observability, and debugging workflows, transforming complex failures into seamless operations. Ideal candidates have 3+ years in debugging production systems and are comfortable with coding in Python and Go. Excellent problem-solving skills and ability to communicate clearly under pressure are essential. Join a passionate team and help shape the future of AI-driven enterprises. #J-18808-Ljbffr
    $113k-160k yearly est. 4d ago
  • Site Reliability Engineer

    The Voleon Group 4.1company rating

    Reliability engineer job in Berkeley, CA

    Voleon is a technology company that applies state‑of‑the‑art AI and machine learning techniques to real‑world problems in finance. For nearly two decades, we have led our industry and worked at the frontier of applying AI/ML to investment management. We have become a multibillion‑dollar asset manager, and we have ambitious goals for the future. Your colleagues will include internationally recognized experts in artificial intelligence and machine learning research as well as highly experienced finance and technology professionals. The people who shape our company come from other backgrounds, including concert music performances, humanitarian aid, opera singing, sports writing, and BMX racing. You will be part of a team that loves to succeed together. In addition to our enriching and collegial working environment, we offer highly competitive compensation and benefits packages, technology talks by our experts, a beautiful modern office, daily catered lunches, and more. As a Site Reliability Engineer (SRE), you will work at the intersection of production operations and software development as you improve, manage, and monitor production‑critical infrastructure and data pipelines. At Voleon, many SREs serve together on a Production Operations team tasked with improving shared production infrastructure. Others are embedded with teams of software engineers to improve specific production systems owned by those teams. Voleon SREs work on important real‑world problems and collaborate with passionate and talented colleagues in an empowering, results‑driven environment. This role is a way to make a real difference: your contributions will make our critical systems more reliable, lower operational risk, and increase the efficiency of our engineering effort. Responsibilities Improve fault‑tolerance and maintainability of code in proprietary data pipelines and trading systems Diagnose and fix bugs in code Lead complex deployments Automate manual workflows Track and prioritize outstanding production‑related issues Share an on‑call rotation responding to incidents to ensure the continuous operation of production‑critical systems Requirements Experience with coding and debugging Python Experience with Linux Familiarity with Relational Databases & SQL Sharp analytical and problem‑solving skills and a persistent drive to make things work (better) Strong growth mindset and a passion for learning Strong technical communication skills Attention to detail 2 years of relevant industry experience An undergraduate degree or comparable training in a quantitative field or equivalent, relevant industry experience Preferred Qualifications Familiarity with best practices concerning code maintainability, documentation, quality assurance, continuous integration and deployment Experience supporting production systems Experience with any of the following: gRPC microservices, Postgres, Pandas, Golang, R, Git, Jenkins, Bazel, Prometheus, Grafana, Airflow, Kubernetes The base salary for this position is $120,000 to $160,000 in the location(s) of this posting. Individual salaries are determined through a variety of factors, including, but not limited to, education, experience, knowledge, skills, and geography. Base salary does not include other forms of total compensation such as bonus compensation and other benefits. Our benefits package includes medical, dental and vision coverage, life and AD&D insurance, 20 days of paid time off, 9 sick days, and a 401(k) plan with a company match. Friends of Voleon Candidate Referral Program If you have a great candidate in mind for this role and would like to have the potential to earn $7,500 - $15,000 if your referred candidate is successfully hired and employed by The Voleon Group, please use this to submit your referral. For more details regarding eligibility, terms and conditions please make sure to review the Voleon Referral Bonus Program. Equal Opportunity Employer The Voleon Group is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law. #J-18808-Ljbffr
    $120k-160k yearly 5d ago
  • Site Reliability Engineer

    Neara

    Reliability engineer job in Palo Alto, CA

    Job type: Full Time · Department: Backend Engineer · Work type: Remote About A rchetype AI Archetype AI is developing the world's first AI platform to bring AI into the real world. Formed by an exceptionally high-caliber team from Google, Archetype AI is building a foundation model for the physical world, a real-time multimodal LLM for real life, transforming real-world data into valuable insights and knowledge that people will be able to interact with naturally. It will help people in their real lives, not just online, because it understands the real-time physical environment and everything that happens in it. Supported by deep tech venture funds in Silicon Valley, Archetype AI is currently pre-Series A, progressing rapidly to develop technology for their next stage. This presents a unique and once-in-a-lifetime opportunity to be part of an exciting AI team at the beginning of their journey, located in the heart of Silicon Valley. Our team is headquartered in Palo Alto, California, with team members throughout the US and Europe. We are actively growing, so if you are an exceptional candidate excited to work on the cutting edge of physical AI and don't see a role that exactly fits you below you can contact us directly with your resume via jobsarchetypeaiio. About the Role As a Site Reliability Engineer (SRE) at Archetype AI, you will be responsible for designing, scaling, and maintaining the infrastructure that powers our AI-driven products. You will collaborate with backend engineers and ML researchers to ensure that our distributed platforms are fault-tolerant, performant, and highly available. Core Responsibilities Design, build, and operate highly available distributed systems. Collaborate with engineering and ML teams to ensure reliable deployment of backend services (in Rust, C++ or similar). Implement monitoring, alerting, and observability solutions across infrastructure. Automate deployments, scaling, and infrastructure provisioning using infrastructure-as-code. Diagnose and resolve performance bottlenecks, system outages, and production incidents. Support AI/ML infrastructure for training and serving models at scale, including GPU clusters, pipelines, and inference services. Contribute to infrastructure architecture, standards, and operational best practices. Minimum Qualifications 5+ years of experience as SRE, DevOps, or Systems Engineer. Strong expertise in distributed systems, fault-tolerant architectures, and large-scale production environments. Proficiency in Rust, C++, or other backend languages with willingness to learn. Solid experience with Kubernetes, containers, and cloud platforms (AWS, GCP, Azure). Hands‑on experience with monitoring and observability tools (Prometheus, Grafana, ELK, OpenTelemetry). Experience with data pipelines, messaging systems, and streaming technologies (Kafka, Pulsar, etc.). Familiarity with AI/ML infrastructure (training pipelines, GPU clusters, inference systems). Strong debugging, problem‑solving, and automation mindset (Terraform, Ansible, Pulumi, scripting). Excellent communication and collaboration skills. Preferred Qualifications Experience with real‑time or low‑latency systems. Open‑source contributions to distributed systems or infrastructure projects. Knowledge of security best practices for distributed environments. Experience with edge or embedded systems and sensor‑based infrastructure. Background in multimodal data fusion or physical‑world perception systems. What We Value Ownership - You take initiative, follow through, and care deeply about quality and outcomes. Motivation - You're driven to solve complex problems and continuously raise the bar for yourself and your team. Excellence - You bring discipline, clarity, and rigor to your craft-and help others do the same. Collaboration - You work well with others, mentor generously, and contribute to a high‑trust, high‑performance culture. #J-18808-Ljbffr
    $112k-159k yearly est. 2d ago
  • Site Reliability Engineer

    Fluix

    Reliability engineer job in Palo Alto, CA

    FLUIX is building the AI operating system that plans, designs, and optimizes AI infrastructure. We are based in Silicon Valley. We specialize in providing AI-driven solutions for data centers and power providers, leveraging cutting-edge Machine Learning (ML) and Artificial Intelligence (AI) technologies. Our mission is to double America's compute capacity without building new data centers. We are seeking a skilled Site Reliability Engineer to join our growing team. The ideal candidate will help ensure the reliability, scalability, and performance of our hybrid-based (Cloud & On-Prem) platform while supporting our AI/ML infrastructure. You will work closely with our engineering, AI, and operations teams to build and maintain robust systems that support our cutting-edge solutions. Your expertise in ML/AI and experience with data center sites will be crucial in driving the success of our platform. Who you'll work closely with Founder & CEO Chase Overcash CTO What you'll do Design, implement, and maintain scalable systems while optimizing performance, ensuring high availability and disaster recovery, and assisting with codebase refactoring for modular deployment. Develop and maintain automation tools to streamline operations, improve efficiency, and automate repetitive tasks to enhance system reliability. Collaborate with engineering and data science teams to integrate ML and AI models into production environments, while ensuring seamless integration and high performance of cutting-edge models within our technology stack. Identify areas for improvement and drive initiatives to enhance system reliability and performance, while staying updated on industry trends and advancements in SRE practices, ML, and AI technologies. Respond to and resolve incidents to minimize impact and ensure timely resolution, while conducting post-incident reviews and implementing improvements to prevent recurrence. Create and manage multiple cloud instances (dev, staging, test), optimize cloud infrastructure and data center operations, and ensure the security and compliance of both infrastructure and applications. Your background Bachelorʼs degree in Computer Science, Engineering, or a related field (or equivalent experience). Proven experience as a Site Reliability Engineer or similar role in a SaaS environment, with a strong background in managing and optimizing cloud infrastructure (AWS preferred, or GCP, Azure), experience with ML and AI technologies, and familiarity with data center operations integrations. Proficiency in programming and scripting languages (e.g., Python), experience with containerization and orchestration tools (Kubernetes), a strong understanding of networking, security, and performance optimization, and knowledge of CI/CD pipelines and DevOps practices. Excellent problem-solving skills with attention to detail, strong communication and collaboration abilities, and the capacity to thrive in a fast-paced, dynamic startup environment. Culture Fit We are looking for obsessed individuals who want to give it their all. We are not afraid to get our hands dirty with physical and software systems. We are eager to visit and work with clients and understand the importance and gravitas of their mission-critical work. We are eager to come into the office and on-site, as our work directly affects physical environments. Due to our mission-critical work, we understand and our eager to help our teammates and co-workers during holidays, weekends, and emergencies. We are cordial and over-communicate with teammates, co-workers, and management. Attractive compensation package, including equity options. Comprehensive health, dental, and vision insurance, along with other standard benefits. A dynamic and collaborative San Francisco Bay Area work environment. Opportunities for professional growth and development, with the chance to shape the future of technology in the industry. #J-18808-Ljbffr
    $112k-159k yearly est. 2d ago
  • Site Reliability Engineer - Kubernetes

    Theklicker

    Reliability engineer job in Palo Alto, CA

    theklicker is an online platform specializing in electronic product price comparison, enabling users to browse prices across multiple booking sites effortlessly. We are dedicated to being a one-stop solution for purchasing electronic products. With a focus on delivering the best user experience, theklicker empowers users to make informed purchasing decisions quickly and efficiently. Role Description This is a full-time on-site role for a Site Reliability Engineer - Kubernetes, based in Palo Alto, CA. The role involves maintaining and optimizing system reliability, managing infrastructure, troubleshooting technical issues, and supporting software deployments. The Site Reliability Engineer will work closely with development and operations teams to ensure seamless operations and robust technology solutions. Qualifications Proven expertise in Site Reliability Engineering and troubleshooting complex system issues Experience in Software Development with a strong understanding of coding best practices Proficiency in System Administration, managing Linux/Unix environments, and implementing automation scripts Knowledge of Kubernetes and infrastructure management Strong problem-solving, analytical, and communication skills Experience with monitoring and incident management tools is a plus Bachelor's degree in Computer Science, Engineering, or a related field #J-18808-Ljbffr
    $112k-159k yearly est. 4d ago
  • Reliability Engineer

    Periodiclabs

    Reliability engineer job in Menlo Park, CA

    About Periodic Labs We are an AI + physical sciences lab building state of the art models to make novel scientific discoveries. We are well funded and growing rapidly. Team members are owners who identify and solve problems without boundaries or bureaucracy. We eagerly learn new tools and new science to push forward our mission. About the Role Join a world-class team of scientists and engineers pushing the boundaries of materials research in a groundbreaking lab where AI and automation unlock discoveries at unprecedented speed and scale. The Periodic Labs team is developing AI that can both simulate science as well as verify its predictions to train on the full scientific method. As part of this mission, the team is building a high throughput experimental materials science lab. For this facility, we are seeking a hands‑on Reliability Engineer to drive uptime and throughput across our experimental platforms. You'll lead maintenance operations, plan and manage the associated machine shop buildout, and design and integrate custom labware and fixtures to improve experimental efficiency and repeatability. Responsibilities Establish preventive and predictive maintenance programs for lab and automation systems and associated CMMS. Lead root cause analysis and develop corrective actions for system failures or bottlenecks. Build and manage a machine shop for fabricating custom components. Design, prototype, and test custom labware tailored to automated experimental workflows. Collaborate with automation, controls, and scientific teams to integrate fixtures into processes. Track and improve equipment uptime, MTBF, and OEE. Build and train a world‑class maintenance team of technicians to support the above systems Qualifications Bachelor's or Master's degree in an engineering field. 5+ years experience in reliability, mechanical, or systems engineering, especially related to supporting materials science/solid‑state chemistry and/or thin‑film processing research equipment Familiarity with precision machining, mechatronics, or custom part design. Experience with CAD (SolidWorks/Fusion), shop tooling, and rapid prototyping. Strong knowledge of preventive maintenance systems. Comfortable collaborating across engineering, automation, and scientific teams. Certification and preferably contribution to accredited standards or bodies relevant to your field - e.g. SMRP, IEEE Reliability Society, relevant ASTMs. Bonus Qualifications We'd love to hear about your accomplishments in globally recognized competitions relevant to your field (e.g. NASA Lunabotics, ASME's competitions, etc.) #J-18808-Ljbffr
    $112k-160k yearly est. 2d ago
  • Site Reliability Engineer - Observability

    Rivian 4.1company rating

    Reliability engineer job in Palo Alto, CA

    About Us Rivian and Volkswagen Group Technologies is a joint venture between two industry leaders with a clear vision for automotive's next chapter. From operating systems to zonal controllers to cloud and connectivity solutions, we're addressing the challenges of electric vehicles through technology that will set the standards for software-defined vehicles around the world. The road to the future is uncharted. By combining our expertise across connectivity, AI, security and more, we'll map a new way forward. Working together, we'll create a future that's more connected, more intelligent, more sustainable for everyone. Role Summary We are seeking a Senior Site Reliability Engineer (SRE) specializing in Observability to join RivianVW's Data Platform - Production Engineering team. In this role, you will design, implement, and scale robust observability systems to ensure the health, performance, and reliability of our production environment. You will collaborate closely with cross-functional teams to create telemetry solutions that provide actionable insights into our distributed systems. Responsibilities Observability Platform Design: Architect, implement, and maintain observability systems, leveraging tools like Datadog, LGTM stack, OpenTelemetry, and Vector to enable real-time performance monitoring, logging, and alerting. Telemetry Optimization: Evolve and scale telemetry pipelines to ensure low latency and high availability for metrics, logs, and traces across multi-cloud environments. Performance Engineering: Proactively identify performance bottlenecks, optimize systems, and provide recommendations for reliability improvements. Scalable Automation: Implement automation solutions to scale systems sustainably while driving improvements in reliability and deployment velocity. Incident Management: Collaborate with the incident response team to establish data-driven debugging and troubleshooting processes using observability data. Tooling Development: Create and maintain self-service observability tools and dashboards to empower teams across the organization. Cross-functional Collaboration: Partner with development, DevOps, and infrastructure teams to define SLOs/SLIs and ensure observability is embedded throughout the software lifecycle. Qualifications Educational Background: Bachelor's degree in Computer Science, Engineering, or equivalent practical experience. Experience: 5+ years in Site Reliability Engineering or a related role with a strong emphasis on observability. Technical Expertise: Proficiency in designing and operating observability platforms with tools like Prometheus, Grafana, Loki, Jaeger, or Datadog. Experience with OpenTelemetry and distributed tracing in microservices architectures. Deep knowledge of Kubernetes (e.g., EKS), ArgoCD, and Crossplane. Programming Skills: Strong proficiency in Python, Go, or similar languages for building automation and custom telemetry solutions. Cloud & Systems: Familiarity with multi-cloud setups, containerization (Docker), and Linux system fundamentals. Soft Skills: Exceptional problem-solving, communication, and a data-driven approach to decision-making. Pay Disclosure Salary Range/Hourly Rate for California Based Applicants: $146,900 - $194,610 USD Actual Compensation will be determined based on experience, location, and other factors permitted by law. Benefits Summary: Rivian and Volkswagen Group Technologies provides robust medical, prescription, dental and vision insurance packages for full-time employees, their spouse or domestic partner, and their children up to age 26. Coverage is effective on the first day of employment. Equal Opportunity Rivian and Volkswagen Group Technologies is committed to creating a diverse environment and is proud to be an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, national origin, ancestry, sex, sexual orientation, gender, gender expression, gender identity, genetic information or characteristics, physical or mental disability, marital/domestic partner status, age, military/veteran status, medical condition, or any other characteristic protected by law. We are also committed to ensuring compliance with all applicable fair employment practice laws regarding citizenship and immigration status. Rivian and Volkswagen Group Technologies is committed to ensuring that our hiring process is accessible for persons with disabilities. If you have a disability or limitation, such as those covered by the Americans with Disabilities Act, that requires accommodations to assist you in the search and application process, please email us at candidateaccommodations@rivian.com. Candidate Data Privacy Rivian and VW Group Technologies ("Rivian and Volkswagen Group Technologies") may collect, use and disclose your personal information or personal data (within the meaning of the applicable data protection laws) when you apply for employment and/or participate in our recruitment processes ("Candidate Personal Data"). This data includes contact, demographic, communications, educational, professional, employment, social media/website, network/device, recruiting system usage/interaction, security and preference information. Rivian and Volkswagen Group Technologies may use your Candidate Personal Data for the purposes of (i) tracking interactions with our recruiting system; (ii) carrying out, analyzing and improving our application and recruitment process, including assessing you and your application and conducting employment, background and reference checks; (iii) establishing an employment relationship or entering into an employment contract with you; (iv) complying with our legal, regulatory and corporate governance obligations; (v) recordkeeping; (vi) ensuring network and information security and preventing fraud; and (vii) as otherwise required or permitted by applicable law. Rivian and Volkswagen Group Technologies may share your Candidate Personal Data with (i) internal personnel who have a need to know such information in order to perform their duties, including individuals on our People Team, Finance, Legal, and the team(s) with the position(s) for which you are applying; (ii) Rivian and Volkswagen Group Technologies affiliates; and (iii) Rivian and Volkswagen Group Technologies' service providers, including providers of background checks, staffing services, and cloud services. Rivian and Volkswagen Group Technologies may transfer or store internationally your Candidate Personal Data, including to or in the United States, Canada, and the European Union and in the cloud, and this data may be subject to the laws and accessible to the courts, law enforcement and national security authorities of such jurisdictions. Please see our Candidate Data Privacy Notice (English) and Candidate Data Privacy Notice (Serbian) for more information. Please note that we are currently not accepting applications from third party application services. #J-18808-Ljbffr
    $146.9k-194.6k yearly 4d ago
  • Senior PostgreSQL DBRE - Scale, Reliability & Automation

    Okta, Inc. 4.3company rating

    Reliability engineer job in San Francisco, CA

    A leading identity management firm is looking for a Senior Database Reliability Engineer (DBRE) in San Francisco, California. The ideal candidate will have over 4 years of experience specifically with PostgreSQL and will be responsible for designing and optimizing data persistence layers for mission-critical systems. Key responsibilities include leading database incidents, working cross-functionally with platform teams, developing automation for tasks, and ensuring high availability across database environments. This position is essential for operational excellence in a hybrid environment. #J-18808-Ljbffr
    $157k-199k yearly est. 5d ago
  • Reliability Engineer: Scale Systems, Observe & Automate

    Openai 4.2company rating

    Reliability engineer job in San Francisco, CA

    A leading AI research company based in San Francisco is seeking experienced reliability engineers to scale their infrastructure and ensure system performance and reliability. This role involves collaborating with diverse teams to develop resilient systems and enhance operations. Candidates should have strong cloud proficiency, experience in containerization technologies, and a bachelor's degree in a related field. #J-18808-Ljbffr
    $127k-176k yearly est. 1d ago
  • Founding SRE Engineer - Reliability & Growth

    Asana 4.6company rating

    Reliability engineer job in San Francisco, CA

    A leading software company is seeking experienced Software Engineers to join the new Site Reliability Engineering team. This role focuses on building reliable, scalable systems and leading projects across infrastructure. Candidates should have strong software engineering skills and a passion for reliability. The position offers a hybrid work model and generous compensation packages with additional benefits. #J-18808-Ljbffr
    $147k-189k yearly est. 3d ago
  • Senior Site Reliability Engineer, Compute

    Crusoe Energy Systems LLC 4.1company rating

    Reliability engineer job in San Francisco, CA

    Crusoe's mission is to accelerate the abundance of energy and intelligence. We're crafting the engine that powers a world where people can create ambitiously with AI - without sacrificing scale, speed, or sustainability. Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that's setting the pace for responsible, transformative cloud infrastructure. About This Role: At Crusoe, we are building the most sustainable, AI-first cloud infrastructure, and our Compute-focused Site Reliability Engineers are the backbone of that mission. This role is centered on supporting virtualization, hypervisor, and kernel-level performance for Crusoe's compute infrastructure. You'll play a vital role in deploying and optimizing bare-metal and virtualized compute platforms, ensuring performance, security, and scale for modern AI and HPC workloads. What You'll Be Working On: In this role, you will develop automation and observability tools to monitor Crusoe's compute infrastructure, spanning from the kernel to orchestration layers. You will support and scale the company's virtualization stack, including technologies such as KVM, QEMU, and other hypervisors. Collaborating with Linux kernel and hardware teams, you'll help identify and resolve performance bottlenecks, driver issues, and optimize hardware offloads. A key focus will be on optimizing performance for AI and HPC workloads across CPU, GPU, and DPU/NIC resources. You will participate in root cause analysis for kernel crashes, hardware-software integration problems, and performance regressions, while also integrating hypervisor-level enhancements to improve guest VM reliability and workload isolation. The role involves tuning kernel subsystems such as the process scheduler, NUMA configuration, memory management, and interrupt handling. Additionally, you will work closely with platform teams to implement and validate support for emerging compute hardware, including SmartNICs, BlueField devices, and TPUs What You'll Bring to the Team: 5+ years of professional experience in Compute SRE, Linux system engineering, or compute infrastructure roles. Strong proficiency in Linux kernel internals, with exposure to scheduler, memory allocation, and driver subsystems. Experience with virtualization architectures and technologies such as KVM, Xen, QEMU, or VMware. Familiarity with SmartNICs/DPUs (e.g., NVIDIA CX6/7, BlueField-3) and kernel bypass techniques. Expert-level skills in at least one programming language: Go, C or Rust. Experience with system-level debugging, including kdump, kexec, and kernel panic analysis. Proficiency in Infrastructure as Code tooling and CI/CD practices for bare-metal or cloud infrastructure. Strong understanding of compute scheduling, resource management, and high-throughput networking. Benefits: Industry competitive pay Restricted Stock Units in a fast growing, well-funded technology company Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents Employer contributions to HSA accounts Paid Parental Leave Paid life insurance, short-term and long-term disability Teladoc 401(k) with a 100% match up to 4% of salary Generous paid time off and holiday schedule Cell phone reimbursement Tuition reimbursement Subscription to the Calm app MetLife Legal Company paid commuter benefit; $300 per pay period Compensation Range: Compensation will be paid in the range of $166,000 - $201,000 a year + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant's education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data. Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation. #J-18808-Ljbffr
    $166k-201k yearly 3d ago
  • Site Reliability Engineer

    Clay 4.0company rating

    Reliability engineer job in San Francisco, CA

    Clay is a creative tool for growth. Our mission is to help businesses grow - without huge investments in tooling or manual labor. We're already helping over 100,000 people grow their business with Clay. From local pizza shops to enterprises like Anthropic and Notion, our tool lets you instantly translate any idea that you have for growing your company into reality. We believe that modern GTM teams win by finding GTM alpha - a unique competitive edge powered by data, experimentation, and automation. Clay is the platform they use to uncover hidden signals, build custom plays, and launch faster than their competitors. We're looking for sharp, low-ego people to help teams find their GTM alpha. Why is Clay the best place to work? Customers love the product (100K+ users and growing) We're growing a lot (6x YoY last year, and 10x YoY the two years before that) Incredible culture (our customers keep applying to work here) Well-resourced - We raised a $100M Series C in 2025 at a $3.1B valuation and are backed by world-class investors like Capital G (Google), Sequoia and Meritech Read more about why people love working at Clay here and explore our wall of love to learn more about the product. SRE @ Clay In this role, you'll join our growing infrastructure team in building and fine-tuning our infrastructure to keep our services running smoothly. We're looking for someone who's excited about automation and continuous improvement. While your main focus will be on infrastructure, coding skills are a must. As a growing startup, we all jump in where needed, so you'll need to be comfortable taking on a variety of roles. What You'll Do Architect, design, implement, and manage robust, scalable, and secure infrastructure solutions. Develop, maintain, and enforce best practices for CI/CD, infrastructure as code, and automation. Oversee the management and optimization of cloud infrastructure, ensuring high availability, performance, and cost-efficiency. Implement monitoring, logging, and alerting solutions to maintain system health and quickly resolve issues. Lead incident response efforts, troubleshooting and resolving complex issues in a timely manner. Participate in an oncall rotation. Work with teams across the company to ensure we achieve the right balance of developer velocity, reliability and performance, and cost efficiency. What You'll Bring 5+ years of experience Experience with containerization and orchestration tools Strong understanding of CI/CD concepts and tools Knowledge of infrastructure automation tools Experience with oncall and incident response Proficiency in one or more programming languages Familiarity with our stack or ability to learn unfamiliar technologies quickly: Aurora Postgres RDS, Elasticache Redis, Docker + ECS, Lambda, OpenSearch Terraform and Atlantis CircleCI, Netlify, Playwright Cloudwatch, Datadog, Mezmo Typescript, Python #J-18808-Ljbffr
    $109k-150k yearly est. 1d ago
  • Senior Technology Site Reliability Engineer

    Cooley LLP 4.8company rating

    Reliability engineer job in San Francisco, CA

    Senior Technology Site Reliability Engineer page is loaded## Senior Technology Site Reliability Engineerlocations: San Francisco: New York: Santa Monica: Los Angeles: Palo Altotime type: Full timeposted on: Posted Yesterdayjob requisition id: Req 4348Senior Technology Site Reliability EngineerCooley is seeking a Senior Site Reliability Engineer to join the Infrastructure & Development Operations team The Senior Technology Site Reliability Engineer (“SRE”) is responsible for ensuring the reliability, scalability, and performance of the firm's critical infrastructure and applications. The SRE blends software engineering with systems engineering to build and maintain automated, resilient, and observable systems that support high availability and operational excellence. In addition to being technically advanced, the SRE will have a high degree of emotional intelligence and the ability to work as a team towards complex and layered objectives. Specific duties and responsibilities include, but are not limited to, the following:**Position responsibilities:*** Monitor and maintain production systems to ensure high availability and performance* Implement and manage service-level indicators (SLIs), objectives (SLO's), agreements (SLA's), and error budgets* Participate in on-call rotations and incident response, including root cause analysis and postmortems* Develop and maintain infrastructure as code (IaC) using Terraform* Automate deployment, scaling, and recovery processes to reduce manual intervention* Partner with DevOps to build and maintain CI/CD pipelines to support safe and efficient software delivery* Implement observability solutions using metrics, logs, traces, and alerting systems (Prometheus, Grafana, DataDog, etc.)* Proactively identify and resolve system bottlenecks and reliability risks* Work closely with Infrastructure, DevOps, Development, and security teams to embed reliability into the development lifecycle* Contribute to a culture of blameless post-mortems and continuous improvement* Document operational procedures and share knowledge across teams* All other duties as assigned or required**Skills and experience****:**Required:* After orientation at Cooley LLP, exhibit proficiency in the Microsoft Office suite, iManage and other firm applications* Ability to work extended and/or weekend hours, as required* Ability to travel, as required* 6+ years direct applicable experience (e.g. site reliability engineering or related field)* Proficiency in Terraform and programming languages such as Python, Go, or Java* Deep expertise in cloud platforms, particularly AWS, and container orchestration* Strong background in distributed systems, performance tuning, and automation* Hands-on experience with configuration management tools such as Puppet, Chef, or SaltPreferred:* Bachelor's Degree in Computer Science, Information Technology, Engineering, or associated discipline* Experience working with advanced ETL data workflows including technologies such as AWS EMR, Azure Synapse, Azure Data Factory, or Apache Hive/Spark/Airflow* Experience with IaC deployment of AKS/EKS/GKE architecture* Experience with enterprise Data Lake environments using technologies such as DataBricks or Snowflake**Competencies****:*** Expert analytical/quantitative, problem-solving, and deductive reasoning skills, experience performing advanced troubleshooting and root cause analysis of complex technical issues* Excellent organizational, planning, and time management skills and ability to work independently and in a team environment to manage competing priorities and meet deadlines* Advanced verbal and written communication skills with the ability to present findings, conclusions, alternatives, and information clearly and concisely* Experience working with all levels of business professionals, management, stakeholders, and vendors with the ability to build effective relationships through trust and diplomacy Cooley offers a competitive compensation and excellent benefits package and is committed to fair and equitable employment practices.EOE.The expected annual pay range for this position with a full-time schedule is $140,000 - $205,000. Please note that final offer amount will be dependent on geographic location, applicable experience and skillset of the candidate.We offer a full range of elective benefits including medical, health savings account (with applicable medical plan), dental, vision, health and/or dependent care flexible spending accounts, pre-tax commuter benefits, life insurance, AD&D, long-term care coverage, backup care for children and/or adults and other parental support benefits. In addition to elective benefit options, benefited employees receive firm-paid life insurance, AD&D, LTD, short term medical benefits as well as 21 days of Paid Time Off (“PTO”) and 10 paid holidays each year. We provide generous parental leave and fertility benefits. New employees will attend a detailed benefit orientation to learn more about our many benefits and resources.Welcome to Cooley. We are counselors, strategists and advocates for today's and tomorrow's leaders of the business economy. We seek to meet the evolving needs of our clients by building a community of professionals of the highest caliber who share our vision and embrace our values.Working at Cooley provides an opportunity to work in an environment of collaboration, challenge and reward. We are all part of one firm dedicated to maintaining a diverse workplace that values and celebrates differences-from the way we relate to and support each other, to the way we work together to meet the needs of our clients. It is the unique abilities and perspectives of every individual at Cooley that creates a rewarding workplace.For Cooley, this means offering all employees the tools, training and mentoring they need to succeed. It enables every individual to balance work and family obligations. It looks beyond the Firm's four walls, fostering community involvement. It includes becoming leaders and contributors in our communities.Our cooperative spirit is the trademark of the Cooley Culture and every employee in every department is instrumental to the success of the Firm. We invite you to take a look at our open positions. #J-18808-Ljbffr
    $140k-205k yearly 5d ago
  • Low-Voltage Reliability Engineer for EV Electronics

    Rivian 4.1company rating

    Reliability engineer job in Palo Alto, CA

    A leading electric vehicle manufacturer in Palo Alto is seeking a Design-for-Reliability Engineer to enhance the reliability of low voltage electronics in their vehicles. The role involves monitoring product performance, utilizing statistical methods, and collaborating with manufacturing teams. The ideal candidate has a Bachelor's degree in Engineering and over five years in reliability engineering. Salaries are competitive, ranging from $146,900 to $194,610 based on experience. #J-18808-Ljbffr
    $146.9k-194.6k yearly 1d ago
  • Reliability/DFX Engineer

    Openai 4.2company rating

    Reliability engineer job in San Francisco, CA

    About the Team OpenAI's Hardware organization develops silicon and system-level solutions designed for the unique demands of advanced AI workloads. The team is responsible for building the next generation of AI-native silicon while working closely with software and research partners to co-design hardware tightly integrated with AI models. In addition to delivering production-grade silicon for OpenAI's supercomputing infrastructure, the team also creates custom design tools and methodologies that accelerate innovation and enable hardware optimized specifically for AI. About the Role We are seeking a highly skilled cross-stack engineer with deep expertise in making ML systems reliable at scale. This hands-on individual contributor will sit within our hardware team and work closely with chip design, platform design, hardware health, and the broader industry ecosystem to architect, implement, and deploy reliable next-generation AI accelerator systems. This engineer will evaluate system and chip architecture holistically, identify high-ROI opportunities to improve reliability and availability across the stack, and translate those opportunities into strategy and silicon features. In this role, you will Oversee DFX architecture, implementation, and execution in silicon from concept to high-volume deployment, and propose high-ROI features to enhance reliability and fault tolerance. DFX includes design for testability, reliability, availability, and serviceability of high-performance AI hardware. Build system-level reliability models grounded in empirical data to guide organization-wide DFX and reliability strategy. This requires a detailed understanding of chip and system architecture, design, implementation, and component-level reliability. Collaborate with chip and platform architecture/design teams to explore and implement DFX features, including the specification and implementation of digital/mixed-signal IP, firmware/system software, and DFX methodology (in partnership with engineering teams). Partner with hardware health and platform design teams to continuously improve reliability and fault tolerance in NPI and HVM. This includes optimizing operating conditions, designing experiments, and performing data analysis to drive continuous, data-driven improvements across the stack. Serve as the DFX/reliability champion and evangelist to align the broader industry ecosystem with OpenAI's requirements and roadmap. Qualifications BS with 15+ years, MS with 10+ years, or PhD with 3+ years of relevant industry experience focused on reliability across the chip/platform stack. Hands-on experience with RTL design and DFT is required; physical implementation and/or silicon ATE experience is preferred. Detailed understanding of ML chip and platform architecture and ML workload characteristics is required. Strong fundamentals in reliability modeling, with hands-on skills in empirical data analysis. About OpenAI OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity. We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic. For additional information, please see OpenAI's Affirmative Action and Equal Employment Opportunity Policy Statement. Qualified applicants with arrest or conviction records will be considered for employment in accordance with applicable law, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act. For unincorporated Los Angeles County workers: we reasonably believe that criminal history may have a direct, adverse and negative relationship with the following job duties, potentially resulting in the withdrawal of a conditional offer of employment: protect computer hardware entrusted to you from theft, loss or damage; return all computer hardware in your possession (including the data contained therein) upon termination of employment or end of assignment; and maintain the confidentiality of proprietary, confidential, and non-public information. In addition, job duties require access to secure and protected information technology systems and related data security obligations. To notify OpenAI that you believe this job posting is non-compliant, please submit a report through this form. No response will be provided to inquiries unrelated to job posting compliance. We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link. OpenAI Global Applicant Privacy Policy At OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology. #J-18808-Ljbffr
    $127k-176k yearly est. 3d ago
  • Site Reliability Engineer: Scale, Automate & Own Cloud Infra

    Clay 4.0company rating

    Reliability engineer job in San Francisco, CA

    A leading technology firm in San Francisco is looking for a Site Reliability Engineer to design and manage scalable infrastructure solutions. You will ensure high performance and availability while collaborating across teams. Candidates should have at least 5 years of experience, strong coding skills, and familiarity with various cloud services and automation tools. Join us in a culture of continuous improvement and innovation. #J-18808-Ljbffr
    $109k-150k yearly est. 1d ago
  • Senior SRE - AI-Driven Cloud Reliability & Automation

    Crusoe Energy Systems LLC 4.1company rating

    Reliability engineer job in San Francisco, CA

    A leading energy technology firm seeks a Site Reliability Engineer to enhance its reliable, energy-efficient, AI-optimized cloud platform. In this role, you'll collaborate with cross-functional teams to improve system performance and incident management. Ideal candidates will have a strong background in cloud operations and automation, alongside critical problem-solving skills. Join this innovative team to drive sustainable technology and contribute to a cutting-edge infrastructure focused on operational excellence. #J-18808-Ljbffr
    $142k-189k yearly est. 4d ago

Learn more about reliability engineer jobs

How much does a reliability engineer earn in San Mateo, CA?

The average reliability engineer in San Mateo, CA earns between $96,000 and $187,000 annually. This compares to the national average reliability engineer range of $76,000 to $144,000.

Average reliability engineer salary in San Mateo, CA

$134,000

What are the biggest employers of Reliability Engineers in San Mateo, CA?

The biggest employers of Reliability Engineers in San Mateo, CA are:
  1. Replit
  2. Robust.Ai
  3. Zoox
  4. Skydio
  5. Visa
  6. Verkada
  7. Coinbase
  8. Attain
  9. Oneco 1
Job type you want
Full Time
Part Time
Internship
Temporary