Post job

Reliability engineer jobs in Daly City, CA - 1,298 jobs

All
Reliability Engineer
Senior Reliability Engineer
  • Reliability Engineer

    Periodiclabs

    Reliability engineer job in Menlo Park, CA

    About Periodic Labs We are an AI + physical sciences lab building state of the art models to make novel scientific discoveries. We are well funded and growing rapidly. Team members are owners who identify and solve problems without boundaries or bureaucracy. We eagerly learn new tools and new science to push forward our mission. About the Role Join a world-class team of scientists and engineers pushing the boundaries of materials research in a groundbreaking lab where AI and automation unlock discoveries at unprecedented speed and scale. The Periodic Labs team is developing AI that can both simulate science as well as verify its predictions to train on the full scientific method. As part of this mission, the team is building a high throughput experimental materials science lab. For this facility, we are seeking a hands‑on Reliability Engineer to drive uptime and throughput across our experimental platforms. You'll lead maintenance operations, plan and manage the associated machine shop buildout, and design and integrate custom labware and fixtures to improve experimental efficiency and repeatability. Responsibilities Establish preventive and predictive maintenance programs for lab and automation systems and associated CMMS. Lead root cause analysis and develop corrective actions for system failures or bottlenecks. Build and manage a machine shop for fabricating custom components. Design, prototype, and test custom labware tailored to automated experimental workflows. Collaborate with automation, controls, and scientific teams to integrate fixtures into processes. Track and improve equipment uptime, MTBF, and OEE. Build and train a world‑class maintenance team of technicians to support the above systems Qualifications Bachelor's or Master's degree in an engineering field. 5+ years experience in reliability, mechanical, or systems engineering, especially related to supporting materials science/solid‑state chemistry and/or thin‑film processing research equipment Familiarity with precision machining, mechatronics, or custom part design. Experience with CAD (SolidWorks/Fusion), shop tooling, and rapid prototyping. Strong knowledge of preventive maintenance systems. Comfortable collaborating across engineering, automation, and scientific teams. Certification and preferably contribution to accredited standards or bodies relevant to your field - e.g. SMRP, IEEE Reliability Society, relevant ASTMs. Bonus Qualifications We'd love to hear about your accomplishments in globally recognized competitions relevant to your field (e.g. NASA Lunabotics, ASME's competitions, etc.) #J-18808-Ljbffr
    $112k-160k yearly est. 3d ago
  • Job icon imageJob icon image 2

    Looking for a job?

    Let Zippia find it for you.

  • Site Reliability Engineer - AI Infra & Platform Resilience

    Rethink Recruit

    Reliability engineer job in San Francisco, CA

    A technology firm in San Francisco is seeking a Site Reliability Engineer (SRE) to ensure reliability and performance of their platform. The role involves designing infrastructure, monitoring systems, and mentoring peers. Ideal candidates have experience in SRE or DevOps, strong programming skills (Python or Go), and knowledge of containerization and infrastructure-as-code tools. This position offers a competitive salary, equity, and comprehensive health benefits while working on-site 4 days a week with remote flexibility. #J-18808-Ljbffr
    $113k-160k yearly est. 3d ago
  • Site Reliability Engineer

    Mercor, Inc.

    Reliability engineer job in San Francisco, CA

    About Mercor Mercor is at the intersection of labor markets and AI research. We partner with leading AI labs and enterprises to provide the human intelligence essential to AI development. Our vast talent network trains frontier AI models in the same way teachers teach students: by sharing knowledge, experience, and context that can't be captured in code alone. Today, more than 30,000 experts in our network collectively earn over $1.5 million a day. Mercor is creating a new category of work where expertise powers AI advancement. Achieving this requires an ambitious, fast‑paced and deeply committed team. You'll work alongside researchers, operators, and AI companies at the forefront of shaping the systems that are redefining society. Mercor is a profitable Series C company valued at $10 billion. We work in‑person five days a week in our new San Francisco headquarters. About the Role As a Site Reliability Engineer (SRE) at Mercor, you'll own production reliability across our most critical systems, partnering directly with infrastructure leadership. You'll play a foundational role in building our SRE function from the ground up and shaping how Mercor operates large‑scale, high‑availability systems. What You'll Do Own reliability and production safety for core shared services and customer‑facing systems. Partner directly with infrastructure leadership to define SRE priorities, reliability standards, and production safety roadmap. Repair and improve how our production systems are structured so they are stable, resource‑efficient, isolated, and well‑observed. Introduce and champion modern SRE practices (e.g., incident response, postmortems, SLIs/SLOs) across engineering teams. Collaborate with leverage engineering and applied AI teams to ensure sustainable growth. Represent SRE best practices internally and help teams onboard onto production in a way that is safe, scalable, and consistent with SRE principles. What We're Looking For Experience doing true SRE work (not just operations) across multiple roles or companies. Deep familiarity with SRE practices as popularized by Google (e.g., error budgets, reliability vs. risk trade‑offs, large‑scale distributed systems). 5+ years of SRE experience; 15+ years of overall experience is ideal for this first SRE hire. Proven success operating systems at scale, with a strong understanding of the challenges of large, distributed production environments. Strong collaboration skills; able to work efficiently with cross‑functional engineering teams. Ability to drive cultural change around reliability while remaining hands‑on in building and fixing systems. Comfort working in high‑intensity, high‑availability environments where uptime and production quality are critical. Nice to Haves Experience as a founding SRE or early SRE hire, standing up SRE practices and orgs from scratch. Hands‑on experience in the AWS ecosystem, Kubernetes, and modern IaC tooling (Terraform, Spacelift, etc.). #J-18808-Ljbffr
    $113k-160k yearly est. 2d ago
  • Site Reliability Engineer - Scale Resilience & Observability

    Happyrobot Inc.

    Reliability engineer job in San Francisco, CA

    A leading AI startup based in San Francisco is seeking a Site Reliability Engineer to enhance operational resilience. In this role, you will oversee stability, observability, and debugging workflows, transforming complex failures into seamless operations. Ideal candidates have 3+ years in debugging production systems and are comfortable with coding in Python and Go. Excellent problem-solving skills and ability to communicate clearly under pressure are essential. Join a passionate team and help shape the future of AI-driven enterprises. #J-18808-Ljbffr
    $113k-160k yearly est. 5d ago
  • Site Reliability Engineer

    Fluix

    Reliability engineer job in Palo Alto, CA

    FLUIX is building the AI operating system that plans, designs, and optimizes AI infrastructure. We are based in Silicon Valley. We specialize in providing AI-driven solutions for data centers and power providers, leveraging cutting-edge Machine Learning (ML) and Artificial Intelligence (AI) technologies. Our mission is to double America's compute capacity without building new data centers. We are seeking a skilled Site Reliability Engineer to join our growing team. The ideal candidate will help ensure the reliability, scalability, and performance of our hybrid-based (Cloud & On-Prem) platform while supporting our AI/ML infrastructure. You will work closely with our engineering, AI, and operations teams to build and maintain robust systems that support our cutting-edge solutions. Your expertise in ML/AI and experience with data center sites will be crucial in driving the success of our platform. Who you'll work closely with Founder & CEO Chase Overcash CTO What you'll do Design, implement, and maintain scalable systems while optimizing performance, ensuring high availability and disaster recovery, and assisting with codebase refactoring for modular deployment. Develop and maintain automation tools to streamline operations, improve efficiency, and automate repetitive tasks to enhance system reliability. Collaborate with engineering and data science teams to integrate ML and AI models into production environments, while ensuring seamless integration and high performance of cutting-edge models within our technology stack. Identify areas for improvement and drive initiatives to enhance system reliability and performance, while staying updated on industry trends and advancements in SRE practices, ML, and AI technologies. Respond to and resolve incidents to minimize impact and ensure timely resolution, while conducting post-incident reviews and implementing improvements to prevent recurrence. Create and manage multiple cloud instances (dev, staging, test), optimize cloud infrastructure and data center operations, and ensure the security and compliance of both infrastructure and applications. Your background Bachelorʼs degree in Computer Science, Engineering, or a related field (or equivalent experience). Proven experience as a Site Reliability Engineer or similar role in a SaaS environment, with a strong background in managing and optimizing cloud infrastructure (AWS preferred, or GCP, Azure), experience with ML and AI technologies, and familiarity with data center operations integrations. Proficiency in programming and scripting languages (e.g., Python), experience with containerization and orchestration tools (Kubernetes), a strong understanding of networking, security, and performance optimization, and knowledge of CI/CD pipelines and DevOps practices. Excellent problem-solving skills with attention to detail, strong communication and collaboration abilities, and the capacity to thrive in a fast-paced, dynamic startup environment. Culture Fit We are looking for obsessed individuals who want to give it their all. We are not afraid to get our hands dirty with physical and software systems. We are eager to visit and work with clients and understand the importance and gravitas of their mission-critical work. We are eager to come into the office and on-site, as our work directly affects physical environments. Due to our mission-critical work, we understand and our eager to help our teammates and co-workers during holidays, weekends, and emergencies. We are cordial and over-communicate with teammates, co-workers, and management. Attractive compensation package, including equity options. Comprehensive health, dental, and vision insurance, along with other standard benefits. A dynamic and collaborative San Francisco Bay Area work environment. Opportunities for professional growth and development, with the chance to shape the future of technology in the industry. #J-18808-Ljbffr
    $112k-159k yearly est. 3d ago
  • Site Reliability Engineer - Kubernetes

    Theklicker

    Reliability engineer job in Palo Alto, CA

    theklicker is an online platform specializing in electronic product price comparison, enabling users to browse prices across multiple booking sites effortlessly. We are dedicated to being a one-stop solution for purchasing electronic products. With a focus on delivering the best user experience, theklicker empowers users to make informed purchasing decisions quickly and efficiently. Role Description This is a full-time on-site role for a Site Reliability Engineer - Kubernetes, based in Palo Alto, CA. The role involves maintaining and optimizing system reliability, managing infrastructure, troubleshooting technical issues, and supporting software deployments. The Site Reliability Engineer will work closely with development and operations teams to ensure seamless operations and robust technology solutions. Qualifications Proven expertise in Site Reliability Engineering and troubleshooting complex system issues Experience in Software Development with a strong understanding of coding best practices Proficiency in System Administration, managing Linux/Unix environments, and implementing automation scripts Knowledge of Kubernetes and infrastructure management Strong problem-solving, analytical, and communication skills Experience with monitoring and incident management tools is a plus Bachelor's degree in Computer Science, Engineering, or a related field #J-18808-Ljbffr
    $112k-159k yearly est. 5d ago
  • Site Reliability Engineer - Observability & Automation

    Black.Ai

    Reliability engineer job in Palo Alto, CA

    A leading quantum computing company is seeking a Site Reliability Engineer to join their OS/Platform team in Palo Alto. This role involves maintaining the health and performance of services through effective monitoring using Grafana, Prometheus, and more. The ideal candidate will have extensive experience in SRE or DevOps roles, hands-on expertise with observability tools, and strong automation skills. This position offers a competitive salary and unique opportunities in the evolving field of quantum computing. #J-18808-Ljbffr
    $112k-159k yearly est. 1d ago
  • Site Reliability Engineer - Kubernetes Platform

    Pantera Capital

    Reliability engineer job in Palo Alto, CA

    About xAI xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company's mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates. About the Role We are seeking a highly skilled Senior Site Reliability Storage Engineer to join our mission-driven team, focusing on designing, building, and optimizing Kubernetes clusters across multiple regions. In this role, you will leverage your expertise in Kubernetes orchestration and distributed systems to enhance the reliability, performance, and cost-effectiveness of xAI's infrastructure. You will collaborate closely with engineering teams to deliver robust, scalable solutions that support large-scale AI workloads. The ideal candidate is passionate about automation, observability, and ensuring the integrity of critical systems in a fast-paced, innovative environment. Responsibilities Develop and optimize software to provision and manage Kubernetes clusters on-premises, enabling xAI to scale efficiently. Enhance the reliability, performance, and cost-effectiveness of Kubernetes infrastructure to support large-scale AI and application workloads. Collaborate with xAI engineers to understand workload requirements and design tailored Kubernetes solutions to meet their needs. Implement robust observability, monitoring, and security practices to ensure the integrity, availability, and confidentiality of critical systems. Manage storage infrastructure using Infrastructure-as-Code (IaC) tools such as Pulumi, Terraform, or Ansible. Drive system reliability through incident management, postmortems, and the definition of clear SLAs and SLOs. Contribute to the Kubernetes stack, including expertise in CNI, CRI, CSI, and related components. This is an in-person role based in Palo Alto, CA, with up to 25% travel required. Required Qualifications 5+ years of experience as a Site Reliability Engineer or similar role, with a focus on building and maintaining reliable, scalable systems. Proven expertise in managing Kubernetes infrastructure using tools like Cluster API (CAPI) and kubeadm. Proficiency in managing storage infrastructure with IaC tools such as Pulumi, Terraform, or Ansible. Deep understanding of the Kubernetes stack, including CNI, CRI, CSI, and related components. Demonstrated ability to improve system reliability through incident management, postmortems, and defining SLAs/SLOs. Preferred Qualifications Experience with high-traffic web or mobile application workloads, including optimizing Kubernetes for large-scale deployments. Familiarity with chaos engineering, capacity planning, or similar practices for ensuring system resilience. Proficiency with tools such as Kyverno, ArgoCD, or Go programming for infrastructure automation. Strong sense of ownership, curiosity, and enthusiasm for tackling complex technical challenges. Passion for problem-solving and a proactive drive to deliver impactful results. A sense of adventure and humor to navigate challenges with a positive mindset. Annual Salary Range $180,000 - $440,000 USD Benefits Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks. xAI is an equal opportunity employer. California Consumer Privacy Act (CCPA) Notice #J-18808-Ljbffr
    $112k-159k yearly est. 3d ago
  • Site Reliability Engineer - Observability

    Rivian 4.1company rating

    Reliability engineer job in Palo Alto, CA

    About Us Rivian and Volkswagen Group Technologies is a joint venture between two industry leaders with a clear vision for automotive's next chapter. From operating systems to zonal controllers to cloud and connectivity solutions, we're addressing the challenges of electric vehicles through technology that will set the standards for software-defined vehicles around the world. The road to the future is uncharted. By combining our expertise across connectivity, AI, security and more, we'll map a new way forward. Working together, we'll create a future that's more connected, more intelligent, more sustainable for everyone. Role Summary We are seeking a Senior Site Reliability Engineer (SRE) specializing in Observability to join RivianVW's Data Platform - Production Engineering team. In this role, you will design, implement, and scale robust observability systems to ensure the health, performance, and reliability of our production environment. You will collaborate closely with cross-functional teams to create telemetry solutions that provide actionable insights into our distributed systems. Responsibilities Observability Platform Design: Architect, implement, and maintain observability systems, leveraging tools like Datadog, LGTM stack, OpenTelemetry, and Vector to enable real-time performance monitoring, logging, and alerting. Telemetry Optimization: Evolve and scale telemetry pipelines to ensure low latency and high availability for metrics, logs, and traces across multi-cloud environments. Performance Engineering: Proactively identify performance bottlenecks, optimize systems, and provide recommendations for reliability improvements. Scalable Automation: Implement automation solutions to scale systems sustainably while driving improvements in reliability and deployment velocity. Incident Management: Collaborate with the incident response team to establish data-driven debugging and troubleshooting processes using observability data. Tooling Development: Create and maintain self-service observability tools and dashboards to empower teams across the organization. Cross-functional Collaboration: Partner with development, DevOps, and infrastructure teams to define SLOs/SLIs and ensure observability is embedded throughout the software lifecycle. Qualifications Educational Background: Bachelor's degree in Computer Science, Engineering, or equivalent practical experience. Experience: 5+ years in Site Reliability Engineering or a related role with a strong emphasis on observability. Technical Expertise: Proficiency in designing and operating observability platforms with tools like Prometheus, Grafana, Loki, Jaeger, or Datadog. Experience with OpenTelemetry and distributed tracing in microservices architectures. Deep knowledge of Kubernetes (e.g., EKS), ArgoCD, and Crossplane. Programming Skills: Strong proficiency in Python, Go, or similar languages for building automation and custom telemetry solutions. Cloud & Systems: Familiarity with multi-cloud setups, containerization (Docker), and Linux system fundamentals. Soft Skills: Exceptional problem-solving, communication, and a data-driven approach to decision-making. Pay Disclosure Salary Range/Hourly Rate for California Based Applicants: $146,900 - $194,610 USD Actual Compensation will be determined based on experience, location, and other factors permitted by law. Benefits Summary: Rivian and Volkswagen Group Technologies provides robust medical, prescription, dental and vision insurance packages for full-time employees, their spouse or domestic partner, and their children up to age 26. Coverage is effective on the first day of employment. Equal Opportunity Rivian and Volkswagen Group Technologies is committed to creating a diverse environment and is proud to be an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, national origin, ancestry, sex, sexual orientation, gender, gender expression, gender identity, genetic information or characteristics, physical or mental disability, marital/domestic partner status, age, military/veteran status, medical condition, or any other characteristic protected by law. We are also committed to ensuring compliance with all applicable fair employment practice laws regarding citizenship and immigration status. Rivian and Volkswagen Group Technologies is committed to ensuring that our hiring process is accessible for persons with disabilities. If you have a disability or limitation, such as those covered by the Americans with Disabilities Act, that requires accommodations to assist you in the search and application process, please email us at candidateaccommodations@rivian.com. Candidate Data Privacy Rivian and VW Group Technologies ("Rivian and Volkswagen Group Technologies") may collect, use and disclose your personal information or personal data (within the meaning of the applicable data protection laws) when you apply for employment and/or participate in our recruitment processes ("Candidate Personal Data"). This data includes contact, demographic, communications, educational, professional, employment, social media/website, network/device, recruiting system usage/interaction, security and preference information. Rivian and Volkswagen Group Technologies may use your Candidate Personal Data for the purposes of (i) tracking interactions with our recruiting system; (ii) carrying out, analyzing and improving our application and recruitment process, including assessing you and your application and conducting employment, background and reference checks; (iii) establishing an employment relationship or entering into an employment contract with you; (iv) complying with our legal, regulatory and corporate governance obligations; (v) recordkeeping; (vi) ensuring network and information security and preventing fraud; and (vii) as otherwise required or permitted by applicable law. Rivian and Volkswagen Group Technologies may share your Candidate Personal Data with (i) internal personnel who have a need to know such information in order to perform their duties, including individuals on our People Team, Finance, Legal, and the team(s) with the position(s) for which you are applying; (ii) Rivian and Volkswagen Group Technologies affiliates; and (iii) Rivian and Volkswagen Group Technologies' service providers, including providers of background checks, staffing services, and cloud services. Rivian and Volkswagen Group Technologies may transfer or store internationally your Candidate Personal Data, including to or in the United States, Canada, and the European Union and in the cloud, and this data may be subject to the laws and accessible to the courts, law enforcement and national security authorities of such jurisdictions. Please see our Candidate Data Privacy Notice (English) and Candidate Data Privacy Notice (Serbian) for more information. Please note that we are currently not accepting applications from third party application services. #J-18808-Ljbffr
    $146.9k-194.6k yearly 5d ago
  • Founding SRE Engineer - Reliability & Growth

    Asana 4.6company rating

    Reliability engineer job in San Francisco, CA

    A leading software company is seeking experienced Software Engineers to join the new Site Reliability Engineering team. This role focuses on building reliable, scalable systems and leading projects across infrastructure. Candidates should have strong software engineering skills and a passion for reliability. The position offers a hybrid work model and generous compensation packages with additional benefits. #J-18808-Ljbffr
    $147k-189k yearly est. 4d ago
  • Reliability/DFX Engineer

    Openai 4.2company rating

    Reliability engineer job in San Francisco, CA

    About the Team OpenAI's Hardware organization develops silicon and system-level solutions designed for the unique demands of advanced AI workloads. The team is responsible for building the next generation of AI-native silicon while working closely with software and research partners to co-design hardware tightly integrated with AI models. In addition to delivering production-grade silicon for OpenAI's supercomputing infrastructure, the team also creates custom design tools and methodologies that accelerate innovation and enable hardware optimized specifically for AI. About the Role We are seeking a highly skilled cross-stack engineer with deep expertise in making ML systems reliable at scale. This hands-on individual contributor will sit within our hardware team and work closely with chip design, platform design, hardware health, and the broader industry ecosystem to architect, implement, and deploy reliable next-generation AI accelerator systems. This engineer will evaluate system and chip architecture holistically, identify high-ROI opportunities to improve reliability and availability across the stack, and translate those opportunities into strategy and silicon features. In this role, you will Oversee DFX architecture, implementation, and execution in silicon from concept to high-volume deployment, and propose high-ROI features to enhance reliability and fault tolerance. DFX includes design for testability, reliability, availability, and serviceability of high-performance AI hardware. Build system-level reliability models grounded in empirical data to guide organization-wide DFX and reliability strategy. This requires a detailed understanding of chip and system architecture, design, implementation, and component-level reliability. Collaborate with chip and platform architecture/design teams to explore and implement DFX features, including the specification and implementation of digital/mixed-signal IP, firmware/system software, and DFX methodology (in partnership with engineering teams). Partner with hardware health and platform design teams to continuously improve reliability and fault tolerance in NPI and HVM. This includes optimizing operating conditions, designing experiments, and performing data analysis to drive continuous, data-driven improvements across the stack. Serve as the DFX/reliability champion and evangelist to align the broader industry ecosystem with OpenAI's requirements and roadmap. Qualifications BS with 15+ years, MS with 10+ years, or PhD with 3+ years of relevant industry experience focused on reliability across the chip/platform stack. Hands-on experience with RTL design and DFT is required; physical implementation and/or silicon ATE experience is preferred. Detailed understanding of ML chip and platform architecture and ML workload characteristics is required. Strong fundamentals in reliability modeling, with hands-on skills in empirical data analysis. About OpenAI OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity. We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic. For additional information, please see OpenAI's Affirmative Action and Equal Employment Opportunity Policy Statement. Qualified applicants with arrest or conviction records will be considered for employment in accordance with applicable law, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act. For unincorporated Los Angeles County workers: we reasonably believe that criminal history may have a direct, adverse and negative relationship with the following job duties, potentially resulting in the withdrawal of a conditional offer of employment: protect computer hardware entrusted to you from theft, loss or damage; return all computer hardware in your possession (including the data contained therein) upon termination of employment or end of assignment; and maintain the confidentiality of proprietary, confidential, and non-public information. In addition, job duties require access to secure and protected information technology systems and related data security obligations. To notify OpenAI that you believe this job posting is non-compliant, please submit a report through this form. No response will be provided to inquiries unrelated to job posting compliance. We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link. OpenAI Global Applicant Privacy Policy At OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology. #J-18808-Ljbffr
    $127k-176k yearly est. 4d ago
  • Senior PostgreSQL DBRE - Scale, Reliability & Automation

    Okta, Inc. 4.3company rating

    Reliability engineer job in San Francisco, CA

    A leading identity management firm is looking for a Senior Database Reliability Engineer (DBRE) in San Francisco, California. The ideal candidate will have over 4 years of experience specifically with PostgreSQL and will be responsible for designing and optimizing data persistence layers for mission-critical systems. Key responsibilities include leading database incidents, working cross-functionally with platform teams, developing automation for tasks, and ensuring high availability across database environments. This position is essential for operational excellence in a hybrid environment. #J-18808-Ljbffr
    $157k-199k yearly est. 1d ago
  • Site Reliability Engineer: Scale, Automate & Own Cloud Infra

    Clay 4.0company rating

    Reliability engineer job in San Francisco, CA

    A leading technology firm in San Francisco is looking for a Site Reliability Engineer to design and manage scalable infrastructure solutions. You will ensure high performance and availability while collaborating across teams. Candidates should have at least 5 years of experience, strong coding skills, and familiarity with various cloud services and automation tools. Join us in a culture of continuous improvement and innovation. #J-18808-Ljbffr
    $109k-150k yearly est. 2d ago
  • Senior Site Reliability Engineer, Compute

    Crusoe Energy Systems LLC 4.1company rating

    Reliability engineer job in San Francisco, CA

    Crusoe's mission is to accelerate the abundance of energy and intelligence. We're crafting the engine that powers a world where people can create ambitiously with AI - without sacrificing scale, speed, or sustainability. Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that's setting the pace for responsible, transformative cloud infrastructure. About This Role: At Crusoe, we are building the most sustainable, AI-first cloud infrastructure, and our Compute-focused Site Reliability Engineers are the backbone of that mission. This role is centered on supporting virtualization, hypervisor, and kernel-level performance for Crusoe's compute infrastructure. You'll play a vital role in deploying and optimizing bare-metal and virtualized compute platforms, ensuring performance, security, and scale for modern AI and HPC workloads. What You'll Be Working On: In this role, you will develop automation and observability tools to monitor Crusoe's compute infrastructure, spanning from the kernel to orchestration layers. You will support and scale the company's virtualization stack, including technologies such as KVM, QEMU, and other hypervisors. Collaborating with Linux kernel and hardware teams, you'll help identify and resolve performance bottlenecks, driver issues, and optimize hardware offloads. A key focus will be on optimizing performance for AI and HPC workloads across CPU, GPU, and DPU/NIC resources. You will participate in root cause analysis for kernel crashes, hardware-software integration problems, and performance regressions, while also integrating hypervisor-level enhancements to improve guest VM reliability and workload isolation. The role involves tuning kernel subsystems such as the process scheduler, NUMA configuration, memory management, and interrupt handling. Additionally, you will work closely with platform teams to implement and validate support for emerging compute hardware, including SmartNICs, BlueField devices, and TPUs What You'll Bring to the Team: 5+ years of professional experience in Compute SRE, Linux system engineering, or compute infrastructure roles. Strong proficiency in Linux kernel internals, with exposure to scheduler, memory allocation, and driver subsystems. Experience with virtualization architectures and technologies such as KVM, Xen, QEMU, or VMware. Familiarity with SmartNICs/DPUs (e.g., NVIDIA CX6/7, BlueField-3) and kernel bypass techniques. Expert-level skills in at least one programming language: Go, C or Rust. Experience with system-level debugging, including kdump, kexec, and kernel panic analysis. Proficiency in Infrastructure as Code tooling and CI/CD practices for bare-metal or cloud infrastructure. Strong understanding of compute scheduling, resource management, and high-throughput networking. Benefits: Industry competitive pay Restricted Stock Units in a fast growing, well-funded technology company Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents Employer contributions to HSA accounts Paid Parental Leave Paid life insurance, short-term and long-term disability Teladoc 401(k) with a 100% match up to 4% of salary Generous paid time off and holiday schedule Cell phone reimbursement Tuition reimbursement Subscription to the Calm app MetLife Legal Company paid commuter benefit; $300 per pay period Compensation Range: Compensation will be paid in the range of $166,000 - $201,000 a year + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant's education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data. Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation. #J-18808-Ljbffr
    $166k-201k yearly 4d ago
  • Senior Technology Site Reliability Engineer

    Cooley LLP 4.8company rating

    Reliability engineer job in San Francisco, CA

    Senior Technology Site Reliability Engineer page is loaded## Senior Technology Site Reliability Engineerlocations: San Francisco: New York: Santa Monica: Los Angeles: Palo Altotime type: Full timeposted on: Posted Yesterdayjob requisition id: Req 4348Senior Technology Site Reliability EngineerCooley is seeking a Senior Site Reliability Engineer to join the Infrastructure & Development Operations team The Senior Technology Site Reliability Engineer (“SRE”) is responsible for ensuring the reliability, scalability, and performance of the firm's critical infrastructure and applications. The SRE blends software engineering with systems engineering to build and maintain automated, resilient, and observable systems that support high availability and operational excellence. In addition to being technically advanced, the SRE will have a high degree of emotional intelligence and the ability to work as a team towards complex and layered objectives. Specific duties and responsibilities include, but are not limited to, the following:**Position responsibilities:*** Monitor and maintain production systems to ensure high availability and performance* Implement and manage service-level indicators (SLIs), objectives (SLO's), agreements (SLA's), and error budgets* Participate in on-call rotations and incident response, including root cause analysis and postmortems* Develop and maintain infrastructure as code (IaC) using Terraform* Automate deployment, scaling, and recovery processes to reduce manual intervention* Partner with DevOps to build and maintain CI/CD pipelines to support safe and efficient software delivery* Implement observability solutions using metrics, logs, traces, and alerting systems (Prometheus, Grafana, DataDog, etc.)* Proactively identify and resolve system bottlenecks and reliability risks* Work closely with Infrastructure, DevOps, Development, and security teams to embed reliability into the development lifecycle* Contribute to a culture of blameless post-mortems and continuous improvement* Document operational procedures and share knowledge across teams* All other duties as assigned or required**Skills and experience****:**Required:* After orientation at Cooley LLP, exhibit proficiency in the Microsoft Office suite, iManage and other firm applications* Ability to work extended and/or weekend hours, as required* Ability to travel, as required* 6+ years direct applicable experience (e.g. site reliability engineering or related field)* Proficiency in Terraform and programming languages such as Python, Go, or Java* Deep expertise in cloud platforms, particularly AWS, and container orchestration* Strong background in distributed systems, performance tuning, and automation* Hands-on experience with configuration management tools such as Puppet, Chef, or SaltPreferred:* Bachelor's Degree in Computer Science, Information Technology, Engineering, or associated discipline* Experience working with advanced ETL data workflows including technologies such as AWS EMR, Azure Synapse, Azure Data Factory, or Apache Hive/Spark/Airflow* Experience with IaC deployment of AKS/EKS/GKE architecture* Experience with enterprise Data Lake environments using technologies such as DataBricks or Snowflake**Competencies****:*** Expert analytical/quantitative, problem-solving, and deductive reasoning skills, experience performing advanced troubleshooting and root cause analysis of complex technical issues* Excellent organizational, planning, and time management skills and ability to work independently and in a team environment to manage competing priorities and meet deadlines* Advanced verbal and written communication skills with the ability to present findings, conclusions, alternatives, and information clearly and concisely* Experience working with all levels of business professionals, management, stakeholders, and vendors with the ability to build effective relationships through trust and diplomacy Cooley offers a competitive compensation and excellent benefits package and is committed to fair and equitable employment practices.EOE.The expected annual pay range for this position with a full-time schedule is $140,000 - $205,000. Please note that final offer amount will be dependent on geographic location, applicable experience and skillset of the candidate.We offer a full range of elective benefits including medical, health savings account (with applicable medical plan), dental, vision, health and/or dependent care flexible spending accounts, pre-tax commuter benefits, life insurance, AD&D, long-term care coverage, backup care for children and/or adults and other parental support benefits. In addition to elective benefit options, benefited employees receive firm-paid life insurance, AD&D, LTD, short term medical benefits as well as 21 days of Paid Time Off (“PTO”) and 10 paid holidays each year. We provide generous parental leave and fertility benefits. New employees will attend a detailed benefit orientation to learn more about our many benefits and resources.Welcome to Cooley. We are counselors, strategists and advocates for today's and tomorrow's leaders of the business economy. We seek to meet the evolving needs of our clients by building a community of professionals of the highest caliber who share our vision and embrace our values.Working at Cooley provides an opportunity to work in an environment of collaboration, challenge and reward. We are all part of one firm dedicated to maintaining a diverse workplace that values and celebrates differences-from the way we relate to and support each other, to the way we work together to meet the needs of our clients. It is the unique abilities and perspectives of every individual at Cooley that creates a rewarding workplace.For Cooley, this means offering all employees the tools, training and mentoring they need to succeed. It enables every individual to balance work and family obligations. It looks beyond the Firm's four walls, fostering community involvement. It includes becoming leaders and contributors in our communities.Our cooperative spirit is the trademark of the Cooley Culture and every employee in every department is instrumental to the success of the Firm. We invite you to take a look at our open positions. #J-18808-Ljbffr
    $140k-205k yearly 1d ago
  • Low-Voltage Reliability Engineer for EV Electronics

    Rivian 4.1company rating

    Reliability engineer job in Palo Alto, CA

    A leading electric vehicle manufacturer in Palo Alto is seeking a Design-for-Reliability Engineer to enhance the reliability of low voltage electronics in their vehicles. The role involves monitoring product performance, utilizing statistical methods, and collaborating with manufacturing teams. The ideal candidate has a Bachelor's degree in Engineering and over five years in reliability engineering. Salaries are competitive, ranging from $146,900 to $194,610 based on experience. #J-18808-Ljbffr
    $146.9k-194.6k yearly 2d ago
  • Reliability Engineer: Scale Systems, Observe & Automate

    Openai 4.2company rating

    Reliability engineer job in San Francisco, CA

    A leading AI research company based in San Francisco is seeking experienced reliability engineers to scale their infrastructure and ensure system performance and reliability. This role involves collaborating with diverse teams to develop resilient systems and enhance operations. Candidates should have strong cloud proficiency, experience in containerization technologies, and a bachelor's degree in a related field. #J-18808-Ljbffr
    $127k-176k yearly est. 2d ago
  • Senior PostgreSQL DBRE - Reliability at Scale

    Okta, Inc. 4.3company rating

    Reliability engineer job in San Francisco, CA

    A leading identity management company is seeking a skilled Senior Database Reliability Engineer (DBRE) to optimize and manage their PostgreSQL database environment. The role emphasizes designing resilient data infrastructure, automating key database processes, and collaborating with engineering teams. With a focus on high availability and performance optimization, the ideal candidate will possess extensive experience in high-volume production environments, specifically with PostgreSQL and MySQL. This hybrid position requires in-person onboarding in San Francisco. #J-18808-Ljbffr
    $157k-199k yearly est. 1d ago
  • Site Reliability Engineer

    Clay 4.0company rating

    Reliability engineer job in San Francisco, CA

    Clay is a creative tool for growth. Our mission is to help businesses grow - without huge investments in tooling or manual labor. We're already helping over 100,000 people grow their business with Clay. From local pizza shops to enterprises like Anthropic and Notion, our tool lets you instantly translate any idea that you have for growing your company into reality. We believe that modern GTM teams win by finding GTM alpha - a unique competitive edge powered by data, experimentation, and automation. Clay is the platform they use to uncover hidden signals, build custom plays, and launch faster than their competitors. We're looking for sharp, low-ego people to help teams find their GTM alpha. Why is Clay the best place to work? Customers love the product (100K+ users and growing) We're growing a lot (6x YoY last year, and 10x YoY the two years before that) Incredible culture (our customers keep applying to work here) Well-resourced - We raised a $100M Series C in 2025 at a $3.1B valuation and are backed by world-class investors like Capital G (Google), Sequoia and Meritech Read more about why people love working at Clay here and explore our wall of love to learn more about the product. SRE @ Clay In this role, you'll join our growing infrastructure team in building and fine-tuning our infrastructure to keep our services running smoothly. We're looking for someone who's excited about automation and continuous improvement. While your main focus will be on infrastructure, coding skills are a must. As a growing startup, we all jump in where needed, so you'll need to be comfortable taking on a variety of roles. What You'll Do Architect, design, implement, and manage robust, scalable, and secure infrastructure solutions. Develop, maintain, and enforce best practices for CI/CD, infrastructure as code, and automation. Oversee the management and optimization of cloud infrastructure, ensuring high availability, performance, and cost-efficiency. Implement monitoring, logging, and alerting solutions to maintain system health and quickly resolve issues. Lead incident response efforts, troubleshooting and resolving complex issues in a timely manner. Participate in an oncall rotation. Work with teams across the company to ensure we achieve the right balance of developer velocity, reliability and performance, and cost efficiency. What You'll Bring 5+ years of experience Experience with containerization and orchestration tools Strong understanding of CI/CD concepts and tools Knowledge of infrastructure automation tools Experience with oncall and incident response Proficiency in one or more programming languages Familiarity with our stack or ability to learn unfamiliar technologies quickly: Aurora Postgres RDS, Elasticache Redis, Docker + ECS, Lambda, OpenSearch Terraform and Atlantis CircleCI, Netlify, Playwright Cloudwatch, Datadog, Mezmo Typescript, Python #J-18808-Ljbffr
    $109k-150k yearly est. 2d ago
  • Senior SRE - AI-Driven Cloud Reliability & Automation

    Crusoe Energy Systems LLC 4.1company rating

    Reliability engineer job in San Francisco, CA

    A leading energy technology firm seeks a Site Reliability Engineer to enhance its reliable, energy-efficient, AI-optimized cloud platform. In this role, you'll collaborate with cross-functional teams to improve system performance and incident management. Ideal candidates will have a strong background in cloud operations and automation, alongside critical problem-solving skills. Join this innovative team to drive sustainable technology and contribute to a cutting-edge infrastructure focused on operational excellence. #J-18808-Ljbffr
    $142k-189k yearly est. 5d ago

Learn more about reliability engineer jobs

How much does a reliability engineer earn in Daly City, CA?

The average reliability engineer in Daly City, CA earns between $96,000 and $188,000 annually. This compares to the national average reliability engineer range of $76,000 to $144,000.

Average reliability engineer salary in Daly City, CA

$134,000

What are the biggest employers of Reliability Engineers in Daly City, CA?

The biggest employers of Reliability Engineers in Daly City, CA are:
  1. Cloudflare
  2. OpenAI
  3. Redwood Materials
  4. Rethink Recruit
  5. Cisco
  6. Google
  7. Clay & Company
  8. NEAR Protocol
  9. Intelliswift
  10. Apple
Job type you want
Full Time
Part Time
Internship
Temporary