Post job

Reliability engineer jobs in Walnut Creek, CA - 1,409 jobs

All
Reliability Engineer
Senior Reliability Engineer
  • Site Reliability Engineer US - San Francisco

    Near Inc. 4.6company rating

    Reliability engineer job in San Francisco, CA

    The NEAR AI engineering team is developing decentralized and confidential machine learning infrastructure to power user owned AI. We currently focus on building infrastructure to enable private and confidential inference that works across different compute providers, as well as a blockchain-based coordination layer that incentivizes computer providers to join the decentralized inference network. You will own various components and drive critical decisions throughout their life cycles, including architecture, implementation, and maintenance. You will collaborate with highly knowledgeable and skilled colleagues who are passionate about solving hard problems that can disrupt the industry. What You'll Be Doing: End-to-end infrastructure ownership (for handling telemetry data, for performing testing, etc) Design and implementation of infrastructure components that manage clusters of GPU with special configurations Performance tuning and optimizations Create and maintain runbooks that support the on-call rotation Participate in the on-call rotation. Support code releases and delivery Plan and implement infrastructure cost and security strategies Plan and implement effective CI/CD Pipelines to facilitate development processes What We're Looking For: Agility to quickly learn new programming languages and technologies Ability to write clean and efficient code Ability to transform ambiguous problems into tangible solutions or prototypes Experience with software concurrency or parallelism Experience in building, operating, and scaling Cloud infrastructure (GCP, AWS, etc) Experience with data visualization and observability tooling (Grafana, Graphite, Zabbix, etc) Detail-oriented mindset with a focus on setting priorities and progressing towards objectives Excellent communication and teamwork skills Bachelor's Degree in Computer Science or a related field We'd Love If You Have: Experience with NEAR or other blockchain internals Experience with GPUs Experience with Trusted Execution Environments Experience debugging and troubleshooting complex concurrent systems Professional experience with Rust Locations: onsite, San Francisco office #J-18808-Ljbffr
    $126k-176k yearly est. 1d ago
  • Job icon imageJob icon image 2

    Looking for a job?

    Let Zippia find it for you.

  • Site Reliability Engineer

    The Voleon Group 4.1company rating

    Reliability engineer job in Berkeley, CA

    Voleon is a technology company that applies state‑of‑the‑art AI and machine learning techniques to real‑world problems in finance. For nearly two decades, we have led our industry and worked at the frontier of applying AI/ML to investment management. We have become a multibillion‑dollar asset manager, and we have ambitious goals for the future. Your colleagues will include internationally recognized experts in artificial intelligence and machine learning research as well as highly experienced finance and technology professionals. The people who shape our company come from other backgrounds, including concert music performances, humanitarian aid, opera singing, sports writing, and BMX racing. You will be part of a team that loves to succeed together. In addition to our enriching and collegial working environment, we offer highly competitive compensation and benefits packages, technology talks by our experts, a beautiful modern office, daily catered lunches, and more. As a Site Reliability Engineer (SRE), you will work at the intersection of production operations and software development as you improve, manage, and monitor production‑critical infrastructure and data pipelines. At Voleon, many SREs serve together on a Production Operations team tasked with improving shared production infrastructure. Others are embedded with teams of software engineers to improve specific production systems owned by those teams. Voleon SREs work on important real‑world problems and collaborate with passionate and talented colleagues in an empowering, results‑driven environment. This role is a way to make a real difference: your contributions will make our critical systems more reliable, lower operational risk, and increase the efficiency of our engineering effort. Responsibilities Improve fault‑tolerance and maintainability of code in proprietary data pipelines and trading systems Diagnose and fix bugs in code Lead complex deployments Automate manual workflows Track and prioritize outstanding production‑related issues Share an on‑call rotation responding to incidents to ensure the continuous operation of production‑critical systems Requirements Experience with coding and debugging Python Experience with Linux Familiarity with Relational Databases & SQL Sharp analytical and problem‑solving skills and a persistent drive to make things work (better) Strong growth mindset and a passion for learning Strong technical communication skills Attention to detail 2 years of relevant industry experience An undergraduate degree or comparable training in a quantitative field or equivalent, relevant industry experience Preferred Qualifications Familiarity with best practices concerning code maintainability, documentation, quality assurance, continuous integration and deployment Experience supporting production systems Experience with any of the following: gRPC microservices, Postgres, Pandas, Golang, R, Git, Jenkins, Bazel, Prometheus, Grafana, Airflow, Kubernetes The base salary for this position is $120,000 to $160,000 in the location(s) of this posting. Individual salaries are determined through a variety of factors, including, but not limited to, education, experience, knowledge, skills, and geography. Base salary does not include other forms of total compensation such as bonus compensation and other benefits. Our benefits package includes medical, dental and vision coverage, life and AD&D insurance, 20 days of paid time off, 9 sick days, and a 401(k) plan with a company match. Friends of Voleon Candidate Referral Program If you have a great candidate in mind for this role and would like to have the potential to earn $7,500 - $15,000 if your referred candidate is successfully hired and employed by The Voleon Group, please use this to submit your referral. For more details regarding eligibility, terms and conditions please make sure to review the Voleon Referral Bonus Program. Equal Opportunity Employer The Voleon Group is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law. #J-18808-Ljbffr
    $120k-160k yearly 3d ago
  • Site Reliability Engineer - AI Infra & Platform Resilience

    Rethink Recruit

    Reliability engineer job in San Francisco, CA

    A technology firm in San Francisco is seeking a Site Reliability Engineer (SRE) to ensure reliability and performance of their platform. The role involves designing infrastructure, monitoring systems, and mentoring peers. Ideal candidates have experience in SRE or DevOps, strong programming skills (Python or Go), and knowledge of containerization and infrastructure-as-code tools. This position offers a competitive salary, equity, and comprehensive health benefits while working on-site 4 days a week with remote flexibility. #J-18808-Ljbffr
    $113k-160k yearly est. 5d ago
  • Site Reliability Engineer - Scale & Observability

    Gamma.App

    Reliability engineer job in San Francisco, CA

    A dynamic tech firm located in San Francisco is seeking a Site Reliability Engineer to enhance operational health across their production systems. This high-impact role demands expertise in AWS and strong programming skills. You will manage production systems' reliability and lead incident response efforts to prevent issues, all while contributing to the scalability and efficiency of their services. Ideal candidates will have 5+ years of relevant experience and a passion for leveraging technology to drive outcomes. #J-18808-Ljbffr
    $113k-160k yearly est. 1d ago
  • Site Reliability Engineer - Scale Resilience & Observability

    Happyrobot Inc.

    Reliability engineer job in San Francisco, CA

    A leading AI startup based in San Francisco is seeking a Site Reliability Engineer to enhance operational resilience. In this role, you will oversee stability, observability, and debugging workflows, transforming complex failures into seamless operations. Ideal candidates have 3+ years in debugging production systems and are comfortable with coding in Python and Go. Excellent problem-solving skills and ability to communicate clearly under pressure are essential. Join a passionate team and help shape the future of AI-driven enterprises. #J-18808-Ljbffr
    $113k-160k yearly est. 2d ago
  • Founding Site Reliability Engineer

    Assort Health Inc.

    Reliability engineer job in San Francisco, CA

    Our mission is to make exceptional healthcare accessible anytime, anywhere, for everyone. At Assort Health, we believe healthcare should feel effortless and connected - quick answers, clear communication, and seamless access to care. That's why we're building a new foundation for how patients and providers connect, driven by AI, built to embrace the complexities of healthcare, and tailored to each provider's unique needs. Assort is the most comprehensive patient experience platform powered by specialty-specific agentic AI. Assort's omnichannel AI agents seamlessly integrate with EHR/PMS and complicated provider preferences to eliminate lengthy hold times and inefficiencies that stand in the way of patients getting the care they need. Since launching in 2023, Assort has managed over 50M+ patient interactions, slashing average hold times from 11 minutes to 1 minute. Our platform now handles calls for thousands of providers with 98%+ resolution rates and 99% scheduling accuracy. Patient satisfaction averages 4.4/5, and we've achieved 11× revenue growth since Q4 2024. We're scaling rapidly and expanding adoption across the entire healthcare industry. What You'll Do You'll be the go-to expert for keeping our systems fast, stable, and resilient. While your primary mission is reliability, you'll also help shape the infrastructure, CI/CD, and tooling that enable the team to move faster and safer. Your scope may include: Define, own, and improve SLIs / SLOs / error budgets - set measurable targets around availability, latency, and error rates, and drive toward achieving them Build and maintain observability across the stack (metrics, logging, tracing, dashboards, alerts, anomaly detection) and lead incident management - coordinating responses, improving runbooks and postmortems, automating with AI tools, and collaborating with partners like Deepgram, Cartesia, GCP, and EHRs to ensure capacity Reduce operational toil by automating repetitive tasks and building self-healing systems and remediation workflows Improve deployment safety through canary or blue/green rollouts, automated rollbacks, chaos experiments, and deployment guardrails Contribute to infrastructure work: IaC, cloud architecture, networking, autoscaling, and related systems Ensure reliability across services, databases, caches, queues, third-party integrations, and networks Drive capacity planning, performance tuning, and cost optimization Mentor others on reliability best practices and champion a reliability mindset across engineering Experience & Background 3+ years focused on reliability, SRE, or production infrastructure Hands-on experience running production systems in startups or growth-stage companies (not just large enterprises) Comfortable balancing firefighting with strategic reliability improvements Technical Must-haves: Cloud infrastructure experience (GCP preferred; AWS or Azure fine) Implemented or maintained observability stacks (Datadog, Prometheus, Grafana, Honeycomb, OpenTelemetry, Sentry, PagerDuty) Can code and automate Comfortable with Kubernetes Nice-to-haves: Infrastructure-as-code (Terraform strongly preferred) CI/CD pipelines and modern deployment strategies Early-stage/high-growth experience Exposure to security, compliance, or resilience architectures Voice infra experience (Twilio, etc.) Why This Role Matters You'll build the reliability foundation - systems, culture, and practices - from the ground up You'll work cross-functionally with product, engineering, and leadership to influence priorities and tradeoffs You'll see direct impact: fewer outages, faster incident resolution, and more confidence in launches Benefits & Perks for Assorties 💸 Competitive Compensation - Including salary and employee stock options so you share in our success. 📚 Lifelong Learning - Annual budget for professional development, plus training opportunities to help you grow. 💻 Office Setup Stipend - We'll outfit your in-office workspace so comfy as it's productive. 🩺 Top-Tier Health Coverage - Medical, dental, and vision insurance, because your health comes first. 🏖 Unlimited PTO - We trust you to take the time you need to recharge and come back ready to crush it. 🥗 Meals & Snacks - Lunch, dinner, and snack breaks that fuel great ideas. 💪 Wellness Stipend - Your physical and mental well-being matters, and we've got a yearly stipend to prove it. 👵 401(k) - Let us help you plan for the future. We've got you covered. How We Work & What We Value Our team at Assort Health moves fast, stays focused, and is fueled by a desire to serve our customers and patients. Our company values guide how we work-they are present in how we show up, make decisions and work together to move our mission forward. We bring a Day One Drive, relentlessly striving to improve, keep a 5-Star Focus, as our customers are our lifeblood, always Answer the Call, remembering that ownership and accountability are paramount, and show up with One Pulse, because we are one team, with one rhythm and one result. Our team is growing and we are looking for motivated, hardworking, and passionate talent. If you want to make healthcare accessible for everyone, we'd love to hear from you! #J-18808-Ljbffr
    $113k-160k yearly est. 3d ago
  • Site Reliability Engineer - Observability & Automation

    Black.Ai

    Reliability engineer job in Palo Alto, CA

    A leading quantum computing company is seeking a Site Reliability Engineer to join their OS/Platform team in Palo Alto. This role involves maintaining the health and performance of services through effective monitoring using Grafana, Prometheus, and more. The ideal candidate will have extensive experience in SRE or DevOps roles, hands-on expertise with observability tools, and strong automation skills. This position offers a competitive salary and unique opportunities in the evolving field of quantum computing. #J-18808-Ljbffr
    $112k-159k yearly est. 3d ago
  • Site Reliability Engineer

    Fluix

    Reliability engineer job in Palo Alto, CA

    FLUIX is building the AI operating system that plans, designs, and optimizes AI infrastructure. We are based in Silicon Valley. We specialize in providing AI-driven solutions for data centers and power providers, leveraging cutting-edge Machine Learning (ML) and Artificial Intelligence (AI) technologies. Our mission is to double America's compute capacity without building new data centers. We are seeking a skilled Site Reliability Engineer to join our growing team. The ideal candidate will help ensure the reliability, scalability, and performance of our hybrid-based (Cloud & On-Prem) platform while supporting our AI/ML infrastructure. You will work closely with our engineering, AI, and operations teams to build and maintain robust systems that support our cutting-edge solutions. Your expertise in ML/AI and experience with data center sites will be crucial in driving the success of our platform. Who you'll work closely with Founder & CEO Chase Overcash CTO What you'll do Design, implement, and maintain scalable systems while optimizing performance, ensuring high availability and disaster recovery, and assisting with codebase refactoring for modular deployment. Develop and maintain automation tools to streamline operations, improve efficiency, and automate repetitive tasks to enhance system reliability. Collaborate with engineering and data science teams to integrate ML and AI models into production environments, while ensuring seamless integration and high performance of cutting-edge models within our technology stack. Identify areas for improvement and drive initiatives to enhance system reliability and performance, while staying updated on industry trends and advancements in SRE practices, ML, and AI technologies. Respond to and resolve incidents to minimize impact and ensure timely resolution, while conducting post-incident reviews and implementing improvements to prevent recurrence. Create and manage multiple cloud instances (dev, staging, test), optimize cloud infrastructure and data center operations, and ensure the security and compliance of both infrastructure and applications. Your background Bachelorʼs degree in Computer Science, Engineering, or a related field (or equivalent experience). Proven experience as a Site Reliability Engineer or similar role in a SaaS environment, with a strong background in managing and optimizing cloud infrastructure (AWS preferred, or GCP, Azure), experience with ML and AI technologies, and familiarity with data center operations integrations. Proficiency in programming and scripting languages (e.g., Python), experience with containerization and orchestration tools (Kubernetes), a strong understanding of networking, security, and performance optimization, and knowledge of CI/CD pipelines and DevOps practices. Excellent problem-solving skills with attention to detail, strong communication and collaboration abilities, and the capacity to thrive in a fast-paced, dynamic startup environment. Culture Fit We are looking for obsessed individuals who want to give it their all. We are not afraid to get our hands dirty with physical and software systems. We are eager to visit and work with clients and understand the importance and gravitas of their mission-critical work. We are eager to come into the office and on-site, as our work directly affects physical environments. Due to our mission-critical work, we understand and our eager to help our teammates and co-workers during holidays, weekends, and emergencies. We are cordial and over-communicate with teammates, co-workers, and management. Attractive compensation package, including equity options. Comprehensive health, dental, and vision insurance, along with other standard benefits. A dynamic and collaborative San Francisco Bay Area work environment. Opportunities for professional growth and development, with the chance to shape the future of technology in the industry. #J-18808-Ljbffr
    $112k-159k yearly est. 5d ago
  • Site Reliability Engineer - Kubernetes Platform

    Pantera Capital

    Reliability engineer job in Palo Alto, CA

    About xAI xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company's mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates. About the Role We are seeking a highly skilled Senior Site Reliability Storage Engineer to join our mission-driven team, focusing on designing, building, and optimizing Kubernetes clusters across multiple regions. In this role, you will leverage your expertise in Kubernetes orchestration and distributed systems to enhance the reliability, performance, and cost-effectiveness of xAI's infrastructure. You will collaborate closely with engineering teams to deliver robust, scalable solutions that support large-scale AI workloads. The ideal candidate is passionate about automation, observability, and ensuring the integrity of critical systems in a fast-paced, innovative environment. Responsibilities Develop and optimize software to provision and manage Kubernetes clusters on-premises, enabling xAI to scale efficiently. Enhance the reliability, performance, and cost-effectiveness of Kubernetes infrastructure to support large-scale AI and application workloads. Collaborate with xAI engineers to understand workload requirements and design tailored Kubernetes solutions to meet their needs. Implement robust observability, monitoring, and security practices to ensure the integrity, availability, and confidentiality of critical systems. Manage storage infrastructure using Infrastructure-as-Code (IaC) tools such as Pulumi, Terraform, or Ansible. Drive system reliability through incident management, postmortems, and the definition of clear SLAs and SLOs. Contribute to the Kubernetes stack, including expertise in CNI, CRI, CSI, and related components. This is an in-person role based in Palo Alto, CA, with up to 25% travel required. Required Qualifications 5+ years of experience as a Site Reliability Engineer or similar role, with a focus on building and maintaining reliable, scalable systems. Proven expertise in managing Kubernetes infrastructure using tools like Cluster API (CAPI) and kubeadm. Proficiency in managing storage infrastructure with IaC tools such as Pulumi, Terraform, or Ansible. Deep understanding of the Kubernetes stack, including CNI, CRI, CSI, and related components. Demonstrated ability to improve system reliability through incident management, postmortems, and defining SLAs/SLOs. Preferred Qualifications Experience with high-traffic web or mobile application workloads, including optimizing Kubernetes for large-scale deployments. Familiarity with chaos engineering, capacity planning, or similar practices for ensuring system resilience. Proficiency with tools such as Kyverno, ArgoCD, or Go programming for infrastructure automation. Strong sense of ownership, curiosity, and enthusiasm for tackling complex technical challenges. Passion for problem-solving and a proactive drive to deliver impactful results. A sense of adventure and humor to navigate challenges with a positive mindset. Annual Salary Range $180,000 - $440,000 USD Benefits Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks. xAI is an equal opportunity employer. California Consumer Privacy Act (CCPA) Notice #J-18808-Ljbffr
    $112k-159k yearly est. 5d ago
  • Site Reliability Engineer - Kubernetes

    Theklicker

    Reliability engineer job in Palo Alto, CA

    theklicker is an online platform specializing in electronic product price comparison, enabling users to browse prices across multiple booking sites effortlessly. We are dedicated to being a one-stop solution for purchasing electronic products. With a focus on delivering the best user experience, theklicker empowers users to make informed purchasing decisions quickly and efficiently. Role Description This is a full-time on-site role for a Site Reliability Engineer - Kubernetes, based in Palo Alto, CA. The role involves maintaining and optimizing system reliability, managing infrastructure, troubleshooting technical issues, and supporting software deployments. The Site Reliability Engineer will work closely with development and operations teams to ensure seamless operations and robust technology solutions. Qualifications Proven expertise in Site Reliability Engineering and troubleshooting complex system issues Experience in Software Development with a strong understanding of coding best practices Proficiency in System Administration, managing Linux/Unix environments, and implementing automation scripts Knowledge of Kubernetes and infrastructure management Strong problem-solving, analytical, and communication skills Experience with monitoring and incident management tools is a plus Bachelor's degree in Computer Science, Engineering, or a related field #J-18808-Ljbffr
    $112k-159k yearly est. 2d ago
  • Reliability Engineer

    Periodiclabs

    Reliability engineer job in Menlo Park, CA

    About Periodic Labs We are an AI + physical sciences lab building state of the art models to make novel scientific discoveries. We are well funded and growing rapidly. Team members are owners who identify and solve problems without boundaries or bureaucracy. We eagerly learn new tools and new science to push forward our mission. About the Role Join a world-class team of scientists and engineers pushing the boundaries of materials research in a groundbreaking lab where AI and automation unlock discoveries at unprecedented speed and scale. The Periodic Labs team is developing AI that can both simulate science as well as verify its predictions to train on the full scientific method. As part of this mission, the team is building a high throughput experimental materials science lab. For this facility, we are seeking a hands‑on Reliability Engineer to drive uptime and throughput across our experimental platforms. You'll lead maintenance operations, plan and manage the associated machine shop buildout, and design and integrate custom labware and fixtures to improve experimental efficiency and repeatability. Responsibilities Establish preventive and predictive maintenance programs for lab and automation systems and associated CMMS. Lead root cause analysis and develop corrective actions for system failures or bottlenecks. Build and manage a machine shop for fabricating custom components. Design, prototype, and test custom labware tailored to automated experimental workflows. Collaborate with automation, controls, and scientific teams to integrate fixtures into processes. Track and improve equipment uptime, MTBF, and OEE. Build and train a world‑class maintenance team of technicians to support the above systems Qualifications Bachelor's or Master's degree in an engineering field. 5+ years experience in reliability, mechanical, or systems engineering, especially related to supporting materials science/solid‑state chemistry and/or thin‑film processing research equipment Familiarity with precision machining, mechatronics, or custom part design. Experience with CAD (SolidWorks/Fusion), shop tooling, and rapid prototyping. Strong knowledge of preventive maintenance systems. Comfortable collaborating across engineering, automation, and scientific teams. Certification and preferably contribution to accredited standards or bodies relevant to your field - e.g. SMRP, IEEE Reliability Society, relevant ASTMs. Bonus Qualifications We'd love to hear about your accomplishments in globally recognized competitions relevant to your field (e.g. NASA Lunabotics, ASME's competitions, etc.) #J-18808-Ljbffr
    $112k-160k yearly est. 5d ago
  • Site Reliability Engineer - Observability

    Rivian 4.1company rating

    Reliability engineer job in Palo Alto, CA

    About Us Rivian and Volkswagen Group Technologies is a joint venture between two industry leaders with a clear vision for automotive's next chapter. From operating systems to zonal controllers to cloud and connectivity solutions, we're addressing the challenges of electric vehicles through technology that will set the standards for software-defined vehicles around the world. The road to the future is uncharted. By combining our expertise across connectivity, AI, security and more, we'll map a new way forward. Working together, we'll create a future that's more connected, more intelligent, more sustainable for everyone. Role Summary We are seeking a Senior Site Reliability Engineer (SRE) specializing in Observability to join RivianVW's Data Platform - Production Engineering team. In this role, you will design, implement, and scale robust observability systems to ensure the health, performance, and reliability of our production environment. You will collaborate closely with cross-functional teams to create telemetry solutions that provide actionable insights into our distributed systems. Responsibilities Observability Platform Design: Architect, implement, and maintain observability systems, leveraging tools like Datadog, LGTM stack, OpenTelemetry, and Vector to enable real-time performance monitoring, logging, and alerting. Telemetry Optimization: Evolve and scale telemetry pipelines to ensure low latency and high availability for metrics, logs, and traces across multi-cloud environments. Performance Engineering: Proactively identify performance bottlenecks, optimize systems, and provide recommendations for reliability improvements. Scalable Automation: Implement automation solutions to scale systems sustainably while driving improvements in reliability and deployment velocity. Incident Management: Collaborate with the incident response team to establish data-driven debugging and troubleshooting processes using observability data. Tooling Development: Create and maintain self-service observability tools and dashboards to empower teams across the organization. Cross-functional Collaboration: Partner with development, DevOps, and infrastructure teams to define SLOs/SLIs and ensure observability is embedded throughout the software lifecycle. Qualifications Educational Background: Bachelor's degree in Computer Science, Engineering, or equivalent practical experience. Experience: 5+ years in Site Reliability Engineering or a related role with a strong emphasis on observability. Technical Expertise: Proficiency in designing and operating observability platforms with tools like Prometheus, Grafana, Loki, Jaeger, or Datadog. Experience with OpenTelemetry and distributed tracing in microservices architectures. Deep knowledge of Kubernetes (e.g., EKS), ArgoCD, and Crossplane. Programming Skills: Strong proficiency in Python, Go, or similar languages for building automation and custom telemetry solutions. Cloud & Systems: Familiarity with multi-cloud setups, containerization (Docker), and Linux system fundamentals. Soft Skills: Exceptional problem-solving, communication, and a data-driven approach to decision-making. Pay Disclosure Salary Range/Hourly Rate for California Based Applicants: $146,900 - $194,610 USD Actual Compensation will be determined based on experience, location, and other factors permitted by law. Benefits Summary: Rivian and Volkswagen Group Technologies provides robust medical, prescription, dental and vision insurance packages for full-time employees, their spouse or domestic partner, and their children up to age 26. Coverage is effective on the first day of employment. Equal Opportunity Rivian and Volkswagen Group Technologies is committed to creating a diverse environment and is proud to be an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, national origin, ancestry, sex, sexual orientation, gender, gender expression, gender identity, genetic information or characteristics, physical or mental disability, marital/domestic partner status, age, military/veteran status, medical condition, or any other characteristic protected by law. We are also committed to ensuring compliance with all applicable fair employment practice laws regarding citizenship and immigration status. Rivian and Volkswagen Group Technologies is committed to ensuring that our hiring process is accessible for persons with disabilities. If you have a disability or limitation, such as those covered by the Americans with Disabilities Act, that requires accommodations to assist you in the search and application process, please email us at candidateaccommodations@rivian.com. Candidate Data Privacy Rivian and VW Group Technologies ("Rivian and Volkswagen Group Technologies") may collect, use and disclose your personal information or personal data (within the meaning of the applicable data protection laws) when you apply for employment and/or participate in our recruitment processes ("Candidate Personal Data"). This data includes contact, demographic, communications, educational, professional, employment, social media/website, network/device, recruiting system usage/interaction, security and preference information. Rivian and Volkswagen Group Technologies may use your Candidate Personal Data for the purposes of (i) tracking interactions with our recruiting system; (ii) carrying out, analyzing and improving our application and recruitment process, including assessing you and your application and conducting employment, background and reference checks; (iii) establishing an employment relationship or entering into an employment contract with you; (iv) complying with our legal, regulatory and corporate governance obligations; (v) recordkeeping; (vi) ensuring network and information security and preventing fraud; and (vii) as otherwise required or permitted by applicable law. Rivian and Volkswagen Group Technologies may share your Candidate Personal Data with (i) internal personnel who have a need to know such information in order to perform their duties, including individuals on our People Team, Finance, Legal, and the team(s) with the position(s) for which you are applying; (ii) Rivian and Volkswagen Group Technologies affiliates; and (iii) Rivian and Volkswagen Group Technologies' service providers, including providers of background checks, staffing services, and cloud services. Rivian and Volkswagen Group Technologies may transfer or store internationally your Candidate Personal Data, including to or in the United States, Canada, and the European Union and in the cloud, and this data may be subject to the laws and accessible to the courts, law enforcement and national security authorities of such jurisdictions. Please see our Candidate Data Privacy Notice (English) and Candidate Data Privacy Notice (Serbian) for more information. Please note that we are currently not accepting applications from third party application services. #J-18808-Ljbffr
    $146.9k-194.6k yearly 2d ago
  • Senior PostgreSQL DBRE - Reliability at Scale

    Okta, Inc. 4.3company rating

    Reliability engineer job in San Francisco, CA

    A leading identity management company is seeking a skilled Senior Database Reliability Engineer (DBRE) to optimize and manage their PostgreSQL database environment. The role emphasizes designing resilient data infrastructure, automating key database processes, and collaborating with engineering teams. With a focus on high availability and performance optimization, the ideal candidate will possess extensive experience in high-volume production environments, specifically with PostgreSQL and MySQL. This hybrid position requires in-person onboarding in San Francisco. #J-18808-Ljbffr
    $157k-199k yearly est. 3d ago
  • Reliability/DFX Engineer

    Openai 4.2company rating

    Reliability engineer job in San Francisco, CA

    About the Team OpenAI's Hardware organization develops silicon and system-level solutions designed for the unique demands of advanced AI workloads. The team is responsible for building the next generation of AI-native silicon while working closely with software and research partners to co-design hardware tightly integrated with AI models. In addition to delivering production-grade silicon for OpenAI's supercomputing infrastructure, the team also creates custom design tools and methodologies that accelerate innovation and enable hardware optimized specifically for AI. About the Role We are seeking a highly skilled cross-stack engineer with deep expertise in making ML systems reliable at scale. This hands-on individual contributor will sit within our hardware team and work closely with chip design, platform design, hardware health, and the broader industry ecosystem to architect, implement, and deploy reliable next-generation AI accelerator systems. This engineer will evaluate system and chip architecture holistically, identify high-ROI opportunities to improve reliability and availability across the stack, and translate those opportunities into strategy and silicon features. In this role, you will Oversee DFX architecture, implementation, and execution in silicon from concept to high-volume deployment, and propose high-ROI features to enhance reliability and fault tolerance. DFX includes design for testability, reliability, availability, and serviceability of high-performance AI hardware. Build system-level reliability models grounded in empirical data to guide organization-wide DFX and reliability strategy. This requires a detailed understanding of chip and system architecture, design, implementation, and component-level reliability. Collaborate with chip and platform architecture/design teams to explore and implement DFX features, including the specification and implementation of digital/mixed-signal IP, firmware/system software, and DFX methodology (in partnership with engineering teams). Partner with hardware health and platform design teams to continuously improve reliability and fault tolerance in NPI and HVM. This includes optimizing operating conditions, designing experiments, and performing data analysis to drive continuous, data-driven improvements across the stack. Serve as the DFX/reliability champion and evangelist to align the broader industry ecosystem with OpenAI's requirements and roadmap. Qualifications BS with 15+ years, MS with 10+ years, or PhD with 3+ years of relevant industry experience focused on reliability across the chip/platform stack. Hands-on experience with RTL design and DFT is required; physical implementation and/or silicon ATE experience is preferred. Detailed understanding of ML chip and platform architecture and ML workload characteristics is required. Strong fundamentals in reliability modeling, with hands-on skills in empirical data analysis. About OpenAI OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity. We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic. For additional information, please see OpenAI's Affirmative Action and Equal Employment Opportunity Policy Statement. Qualified applicants with arrest or conviction records will be considered for employment in accordance with applicable law, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act. For unincorporated Los Angeles County workers: we reasonably believe that criminal history may have a direct, adverse and negative relationship with the following job duties, potentially resulting in the withdrawal of a conditional offer of employment: protect computer hardware entrusted to you from theft, loss or damage; return all computer hardware in your possession (including the data contained therein) upon termination of employment or end of assignment; and maintain the confidentiality of proprietary, confidential, and non-public information. In addition, job duties require access to secure and protected information technology systems and related data security obligations. To notify OpenAI that you believe this job posting is non-compliant, please submit a report through this form. No response will be provided to inquiries unrelated to job posting compliance. We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link. OpenAI Global Applicant Privacy Policy At OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology. #J-18808-Ljbffr
    $127k-176k yearly est. 1d ago
  • Founding SRE Engineer - Reliability & Growth

    Asana 4.6company rating

    Reliability engineer job in San Francisco, CA

    A leading software company is seeking experienced Software Engineers to join the new Site Reliability Engineering team. This role focuses on building reliable, scalable systems and leading projects across infrastructure. Candidates should have strong software engineering skills and a passion for reliability. The position offers a hybrid work model and generous compensation packages with additional benefits. #J-18808-Ljbffr
    $147k-189k yearly est. 1d ago
  • Senior+ Site Reliability Engineer

    Crusoe Energy Systems LLC 4.1company rating

    Reliability engineer job in San Francisco, CA

    Crusoe's mission is to accelerate the abundance of energy and intelligence. We're crafting the engine that powers a world where people can create ambitiously with AI - without sacrificing scale, speed, or sustainability. Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that's setting the pace for responsible, transformative cloud infrastructure. About This Role: Crusoe is building the most reliable, energy-efficient, AI-optimized cloud platform - and operational excellence is at the heart of that mission. As a Site Reliability Engineer focused on Operational Excellence, you will help ensure the stability, resilience, and performance of Crusoe's GPU cloud. This role is ideal for engineers who thrive in fast-paced environments, enjoy solving operational problems, and want to grow their technical career while supporting incident response, reliability, and continuous improvement across a large-scale distributed platform. You'll partner closely with senior SREs, infrastructure engineers, and platform teams to improve reliability, reduce operational toil, and strengthen Crusoe's incident management practices. What You'll Be Working On: Collaborate with cross-functional teams to define and refine availability metrics for Crusoe's cloud infrastructure, including establishing, tracking, and improving SLIs and SLOs. Assist in incident response by identifying, diagnosing, and resolving service disruptions, and support post-incident processes through RCA documentation and participation in post-incident reviews. Build, operate, and monitor infrastructure health using Crusoe's observability stack (Prometheus, Grafana, Alertmanager, OpenTelemetry). Identify and communicate reliability risks, performance bottlenecks, and early indicators of potential incidents that could impact service availability. Develop automation and tooling to reduce operational toil, minimize manual intervention, and enhance service recovery and self‑healing capabilities. Partner with compute, network, storage, and platform teams to improve service resilience and strengthen disaster recovery readiness. Contribute to knowledge sharing, process improvements, and the development of operational best practices across the organization. Participate in ongoing training, mentorship, and professional development to grow into advanced SRE responsibilities. What You'll Bring to the Team: 5+ years of experience in cloud operations, SRE, or related roles Understanding of cloud platforms and infrastructure fundamentals (Kubernetes, AWS/GCP, virtualization, distributed systems) Familiarity with incident management practices and operational frameworks (SRE/ITIL/etc.) Experience with monitoring and alerting tools (Prometheus, Grafana) or a strong willingness to learn Familiarity with infrastructure-as-code and configuration management tools such as Terraform and Ansible Basic Scripting and automation experience (Go, Python, C, C++, or similar) Strong communication skills, with the ability to clearly articulate technical issues to diverse stakeholders Ability to stay calm, focused, and effective in fast-moving or high-pressure situations A growth mindset with enthusiasm for operational excellence, reliability engineering, and continuous improvement Bonus Points: Experience with Kubernetes, container orchestration, or large-scale distributed systems Exposure to change management, operational readiness reviews, or structured RCAs Familiarity with self‑healing systems, automated remediation, or event‑driven operations Interest in scaling AI/HPC infrastructure and solving reliability challenges in GPU‑heavy environments Passion for learning, mentorship, and developing deeper SRE capabilities over time Benefits: Industry competitive pay Restricted Stock Units in a fast growing, well‑funded technology company Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents Employer contributions to HSA accounts Paid Parental Leave Paid life insurance, short-term and long-term disability Teladoc 401(k) with a 100% match up to 4% of salary Generous paid time off and holiday schedule Cell phone reimbursement Tuition reimbursement Subscription to the Calm app MetLife Legal Company paid commuter benefit; $300 per month Compensation: Compensation will be paid in the range of $172,000 - $209,000 + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant's education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data. Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation. #J-18808-Ljbffr
    $142k-189k yearly est. 2d ago
  • Senior Technology Site Reliability Engineer

    Cooley LLP 4.8company rating

    Reliability engineer job in San Francisco, CA

    Senior Technology Site Reliability Engineer page is loaded## Senior Technology Site Reliability Engineerlocations: San Francisco: New York: Santa Monica: Los Angeles: Palo Altotime type: Full timeposted on: Posted Yesterdayjob requisition id: Req 4348Senior Technology Site Reliability EngineerCooley is seeking a Senior Site Reliability Engineer to join the Infrastructure & Development Operations team The Senior Technology Site Reliability Engineer (“SRE”) is responsible for ensuring the reliability, scalability, and performance of the firm's critical infrastructure and applications. The SRE blends software engineering with systems engineering to build and maintain automated, resilient, and observable systems that support high availability and operational excellence. In addition to being technically advanced, the SRE will have a high degree of emotional intelligence and the ability to work as a team towards complex and layered objectives. Specific duties and responsibilities include, but are not limited to, the following:**Position responsibilities:*** Monitor and maintain production systems to ensure high availability and performance* Implement and manage service-level indicators (SLIs), objectives (SLO's), agreements (SLA's), and error budgets* Participate in on-call rotations and incident response, including root cause analysis and postmortems* Develop and maintain infrastructure as code (IaC) using Terraform* Automate deployment, scaling, and recovery processes to reduce manual intervention* Partner with DevOps to build and maintain CI/CD pipelines to support safe and efficient software delivery* Implement observability solutions using metrics, logs, traces, and alerting systems (Prometheus, Grafana, DataDog, etc.)* Proactively identify and resolve system bottlenecks and reliability risks* Work closely with Infrastructure, DevOps, Development, and security teams to embed reliability into the development lifecycle* Contribute to a culture of blameless post-mortems and continuous improvement* Document operational procedures and share knowledge across teams* All other duties as assigned or required**Skills and experience****:**Required:* After orientation at Cooley LLP, exhibit proficiency in the Microsoft Office suite, iManage and other firm applications* Ability to work extended and/or weekend hours, as required* Ability to travel, as required* 6+ years direct applicable experience (e.g. site reliability engineering or related field)* Proficiency in Terraform and programming languages such as Python, Go, or Java* Deep expertise in cloud platforms, particularly AWS, and container orchestration* Strong background in distributed systems, performance tuning, and automation* Hands-on experience with configuration management tools such as Puppet, Chef, or SaltPreferred:* Bachelor's Degree in Computer Science, Information Technology, Engineering, or associated discipline* Experience working with advanced ETL data workflows including technologies such as AWS EMR, Azure Synapse, Azure Data Factory, or Apache Hive/Spark/Airflow* Experience with IaC deployment of AKS/EKS/GKE architecture* Experience with enterprise Data Lake environments using technologies such as DataBricks or Snowflake**Competencies****:*** Expert analytical/quantitative, problem-solving, and deductive reasoning skills, experience performing advanced troubleshooting and root cause analysis of complex technical issues* Excellent organizational, planning, and time management skills and ability to work independently and in a team environment to manage competing priorities and meet deadlines* Advanced verbal and written communication skills with the ability to present findings, conclusions, alternatives, and information clearly and concisely* Experience working with all levels of business professionals, management, stakeholders, and vendors with the ability to build effective relationships through trust and diplomacy Cooley offers a competitive compensation and excellent benefits package and is committed to fair and equitable employment practices.EOE.The expected annual pay range for this position with a full-time schedule is $140,000 - $205,000. Please note that final offer amount will be dependent on geographic location, applicable experience and skillset of the candidate.We offer a full range of elective benefits including medical, health savings account (with applicable medical plan), dental, vision, health and/or dependent care flexible spending accounts, pre-tax commuter benefits, life insurance, AD&D, long-term care coverage, backup care for children and/or adults and other parental support benefits. In addition to elective benefit options, benefited employees receive firm-paid life insurance, AD&D, LTD, short term medical benefits as well as 21 days of Paid Time Off (“PTO”) and 10 paid holidays each year. We provide generous parental leave and fertility benefits. New employees will attend a detailed benefit orientation to learn more about our many benefits and resources.Welcome to Cooley. We are counselors, strategists and advocates for today's and tomorrow's leaders of the business economy. We seek to meet the evolving needs of our clients by building a community of professionals of the highest caliber who share our vision and embrace our values.Working at Cooley provides an opportunity to work in an environment of collaboration, challenge and reward. We are all part of one firm dedicated to maintaining a diverse workplace that values and celebrates differences-from the way we relate to and support each other, to the way we work together to meet the needs of our clients. It is the unique abilities and perspectives of every individual at Cooley that creates a rewarding workplace.For Cooley, this means offering all employees the tools, training and mentoring they need to succeed. It enables every individual to balance work and family obligations. It looks beyond the Firm's four walls, fostering community involvement. It includes becoming leaders and contributors in our communities.Our cooperative spirit is the trademark of the Cooley Culture and every employee in every department is instrumental to the success of the Firm. We invite you to take a look at our open positions. #J-18808-Ljbffr
    $140k-205k yearly 3d ago
  • Low-Voltage Reliability Engineer for EV Electronics

    Rivian 4.1company rating

    Reliability engineer job in Palo Alto, CA

    A leading electric vehicle manufacturer in Palo Alto is seeking a Design-for-Reliability Engineer to enhance the reliability of low voltage electronics in their vehicles. The role involves monitoring product performance, utilizing statistical methods, and collaborating with manufacturing teams. The ideal candidate has a Bachelor's degree in Engineering and over five years in reliability engineering. Salaries are competitive, ranging from $146,900 to $194,610 based on experience. #J-18808-Ljbffr
    $146.9k-194.6k yearly 4d ago
  • Reliability Engineer: Scale Systems, Observe & Automate

    Openai 4.2company rating

    Reliability engineer job in San Francisco, CA

    A leading AI research company based in San Francisco is seeking experienced reliability engineers to scale their infrastructure and ensure system performance and reliability. This role involves collaborating with diverse teams to develop resilient systems and enhance operations. Candidates should have strong cloud proficiency, experience in containerization technologies, and a bachelor's degree in a related field. #J-18808-Ljbffr
    $127k-176k yearly est. 4d ago
  • Senior SRE - AI-Driven Cloud Reliability & Automation

    Crusoe Energy Systems LLC 4.1company rating

    Reliability engineer job in San Francisco, CA

    A leading energy technology firm seeks a Site Reliability Engineer to enhance its reliable, energy-efficient, AI-optimized cloud platform. In this role, you'll collaborate with cross-functional teams to improve system performance and incident management. Ideal candidates will have a strong background in cloud operations and automation, alongside critical problem-solving skills. Join this innovative team to drive sustainable technology and contribute to a cutting-edge infrastructure focused on operational excellence. #J-18808-Ljbffr
    $142k-189k yearly est. 2d ago

Learn more about reliability engineer jobs

How much does a reliability engineer earn in Walnut Creek, CA?

The average reliability engineer in Walnut Creek, CA earns between $96,000 and $187,000 annually. This compares to the national average reliability engineer range of $76,000 to $144,000.

Average reliability engineer salary in Walnut Creek, CA

$134,000

What are the biggest employers of Reliability Engineers in Walnut Creek, CA?

The biggest employers of Reliability Engineers in Walnut Creek, CA are:
  1. Marathon Petroleum
Job type you want
Full Time
Part Time
Internship
Temporary