Post job

Reliability engineer jobs in Palo Alto, CA

- 1,324 jobs
All
Reliability Engineer
New Product Introduction Engineer
Senior Reliability Engineer
Sustaining Engineer
Quality Engineer
Validation Engineer
Business Process Engineer
Manufacturing Engineer
Process Development Engineer
Continuous Improvement Engineer
Industrial Engineer
Senior Manufacturing Process Engineer
Lead Quality Engineer
  • Site Reliability Engineer

    Hamilton Barnes 🌳

    Reliability engineer job in Fremont, CA

    Senior Platform Engineer/Site Reliability Engineer - AI Infrastructure Join a stealth-mode startup building out their AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready to go for experimentation, full-scale model training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you'll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring seamless orchestration across environments managed by Slurm, Kubernetes, or direct SSH access. Aswell as supporting their extremely exciting new products coming to the market! This is a rare opportunity to work at the intersection of AI infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment. If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it. Responsibilities Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads. Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments. Develop observability, alerting, and auto-healing systems for high-availability GPU workloads. Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilisation, and data flow. Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes. Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput. Required Skills & Experience Customer facing experience and the attitude to be a Swiss army knife! Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management. Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred). Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning. Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks. Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale. Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.
    $112k-159k yearly est. 3d ago
  • Site Reliability Engineer

    Ascendion

    Reliability engineer job in San Jose, CA

    Ascendion is a full-service digital engineering solutions company. We make and manage software platforms and products that power growth and deliver captivating experiences to consumers and employees. Our engineering, cloud, data, experience design, and talent solution capabilities accelerate transformation and impact for enterprise clients. Headquartered in New Jersey, our workforce of 6,000+ Ascenders delivers solutions from around the globe. Ascendion is built differently to engineer the next. Ascendion | Engineering to elevate life We have a culture built on opportunity, inclusion, and a spirit of partnership. Come, change the world with us: Build the coolest tech for world's leading brands Solve complex problems - and learn new skills Experience the power of transforming digital engineering for Fortune 500 clients Master your craft with leading training programs and hands-on experience Experience a community of change makers! Join a culture of high-performing innovators with endless ideas and a passion for tech. Our culture is the fabric of our company, and it is what makes us unique and diverse. The way we share ideas, learning, experiences, successes, and joy allows everyone to be their best at Ascendion. About the Role: Title: SRE Location: REMOTE Qualifications Identity-Related Efforts Knowledge of identity - authentication, authorization, and directory services Experience with Okta Experience with Terraform Specifically, must be able to prioritize, plan, build, test, and launch Terraform workflows from end-to-end Proficiency in Python Recent portfolio projects will be required for review and interviews will include coding challenges Experience with CI/CD pipelines Infrastructure as Code experience Knowledge of observability - monitoring and alerting Experience with Kubermetheus Stack (Kubernetes, Prometheus, Loki, Grafana, Alert Manager) Experience with AWS Proficiency in Python Salary Range: The salary for this position is between $130,000- $145,000 annually. Factors which may affect pay within this range may include geography/market, skills, education, experience, and other qualifications of the successful candidate. Benefits: The Company offers the following benefits for this position, subject to applicable eligibility requirements: [medical insurance] [dental insurance] [vision insurance] [401(k) retirement plan] [long-term disability insurance] [short-term disability insurance] [5 personal day accrued each calendar year. The Paid time off benefits meet the paid sick and safe time laws that pertains to the City/ State] [10-15 day of paid vacation time] [6 paid holidays and 1 floating holiday per calendar year] [Ascendion Learning Management System]
    $130k-145k yearly 19h ago
  • Senior Reliability Engineer

    Aivres

    Reliability engineer job in Fremont, CA

    Aivres is a leading global data center and cloud computing solutions provider committed to delivering innovative technologies that propel the world's leading industries to new frontiers. We deliver and deploy robust, performance-optimized, purpose-built platforms to major data centers around the globe. About the Role: We are seeking a highly motivated and experienced Sr. Reliability Engineer to join our team. You will play a critical role in ensuring the reliability of our server and storage products through comprehensive testing, analysis, and design improvements. This is an exciting opportunity to contribute to the development of cutting-edge technology and make a significant impact on product quality and customer satisfaction. Responsibilities: Lead system reliability verification for new development projects, testing reliability indicators and recommending design improvements to meet customer requirements. Develop and execute project reliability test plans, including building test environments, coordinating with internal and external labs, tracking progress, and verifying bug fixes. Contribute to the creation of reliability test cases, specifications, and outlines, and actively participate in improving test processes. Analyze customer complaints and project issues, identify root causes, and coordinate cross-functional efforts to resolve them. Continuously improve testing capabilities by researching new technologies, sharing industry best practices, and participating in discussion with testing lab. Research on new testing technologies and introduction of them to our reliability test. Qualifications: Bachelor's degree or higher in Electronics, Computer Science, Mechanical Engineering, Measurement & Control, or a related field. 2+years of experience in hardware reliability testing, with experience in reliability design being a strong advantage. In depth understanding of server and storage architecture, as well as relevant local, international, and industry standards for reliability testing. Strong grasp of reliability testing theory and expertise in writing test cases. Passion for technical research and staying at the forefront of reliability testing technology. Excellent communication and collaboration skills, with a proactive and responsible work ethic to deliver tasks on schedule. Bilingual in both Chinese and English preferred, including reading technical documents and writing emails and reports. Benefits: Competitive salary and benefits package. Opportunity to work on cutting-edge technology and contribute to impactful projects. Collaborative and supportive work environment. Career development and growth opportunities. EEO Statement Aivres is an Equal Opportunity Employer and embraces diversity in our employee population. It is the policy of Aivres to provide equal opportunity to all qualified applicants and employees without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, protected veteran status or special disabled veteran, marital status, pregnancy, genetic information, or any other legally protected status.
    $128k-177k yearly est. 19h ago
  • Senior DevOps / Site Reliability Engineer

    Meeruai

    Reliability engineer job in Pleasanton, CA

    We are Hiring: Senior DevOps / Site Reliability Engineer (Fulltime, Hybrid - Pleasanton, CA) MeeruAI | Pleasanton, California At MeeruAI, we are building next-generation AI-powered, multi-tenant SaaS platforms designed to scale securely and reliably for enterprise customers. We're at an exciting growth stage and are looking for a Senior DevOps / Site Reliability Engineer to help lay the cloud and reliability foundation for our newest platform while supporting existing products. This is a foundational, high-impact role where you'll define AWS architecture, establish DevOps and SRE best practices, and ensure production-grade reliability as we scale. About the Role As a Senior DevOps / SRE, you will partner closely with Platform, Backend, Frontend, and AI teams to enable fast, secure deployments and maintain 99.9%+ uptime for a multi-tenant SaaS environment. What You will Do Architect and manage AWS infrastructure (EKS, RDS, VPC, IAM, S3) Build and maintain Terraform-based Infrastructure as Code Own Kubernetes/EKS clusters, including scaling, upgrades, and deployments Design and optimize CI/CD pipelines (GitHub Actions, Jenkins, GitOps) Implement monitoring, alerting, and observability (Datadog, CloudWatch) Lead incident response, on-call processes, and postmortems Define and track SLOs, SLIs, and error budgets Implement security and compliance controls (SOC 2, IAM, encryption) Required Qualifications 7-10+ years of DevOps / SRE experience in production environments Deep expertise in AWS and Kubernetes (EKS) Strong experience with Terraform or CloudFormation Proven ownership of CI/CD, monitoring, and incident management Experience supporting multi-tenant B2B SaaS platforms Strong scripting skills (Python or Bash) Security-first mindset with hands-on compliance exposure Compensation Salary Range: $150,000 - $170,000 base salary (California market) Work Location: Hybrid - Pleasanton, CA This role requires on-site presence at least three (3) days per week in our Pleasanton office. Please apply only if you are currently local to the Pleasanton area. Relocation assistance is not available for this role.
    $150k-170k yearly 2d ago
  • Sr. Site Reliability Engineer (SRE)

    Avenue Code 3.5company rating

    Reliability engineer job in Mountain View, CA

    About the Opportunity: We're seeking an experienced, highly collaborative SRE to partner with product teams and tackle our most critical infrastructure challenges. You'll be hands-on in designing, building, and operating our cloud platform-and driving the reliability, performance, and security that empower our engineering organization. Responsibilities: Infrastructure as Code & CI/CD: Automate provisioning and deployments with Terraform and integrate best-practice pipelines (GitHub Actions, ArgoCD, etc.). Reliability Engineering: Define SLIs/SLOs, manage error budgets, and build dashboards & alerts to proactively measure and improve system health. Security & Compliance: Enforce least-privilege IAM policies, automate vulnerability scans, and maintain audit logging for compliance. Monitoring & Observability: Instrument services with metrics, logs, and distributed tracing to enable rapid troubleshooting, aid teams in alerting, custom metrics, and dashboarding Incident Management: Own on-call rotations, lead real-time incident response, conduct post-mortems, and drive continuous improvements. Cost Optimization: Implement tagging strategies, right-size resources, and leverage concrete data to decide on optimal methods to control cloud spend at scale. Documentation & Mentorship: Author runbooks, standards, and best-practice guides-and coach dev teams on implementing modern DevOps, reliability, and security patterns. Required Qualifications: Have 5+ years of experience running production critical systems. Deep proficiency with the AWS Cloud and Cloud-Native best practices. Experience with Kubernetes (EKS, GKE) and Container Orchestration at scale. Skilled in Terraform to declaratively provision and maintain infrastructure services. Working knowledge of managing and debugging databases like Redis and Postgres. Strong familiarity with VPC, VPN, Load Balancing, and cloud networking components. Proficiency with Git workflows, branching strategies, and CI/CD systemintegrations. Solid understanding of web and network protocols and standards (HTTP, REST, TLS, DNS, etc...) Professional proficiency in English (both written and spoken) is required for this role. Nice to Have Skills: Bachelor's degree, or equivalent in Computer Science, Engineering, or a related field. Experience with ArgoCD, GitHub Actions, Jenkins, or other CI/CD pipeline solutions. Working knowledge of Python, Golang, and Helm templating languages. Node.js experience a plus, including running scalable, resilient Node microservices. Grasp of foundational security best practices for cloud infrastructure. Awareness of Terragrunt, managing Terraform state, and optimal project structure. Seasoned in production readiness fundamentals amidst a fast-moving team. Avenue Code reinforces its commitment to privacy and to all the principles guaranteed by the most accurate global data protection laws, such as GDPR, LGPD, CCPA and CPRA. The Candidate data shared with Avenue Code will be kept confidential and will not be transmitted to disinterested third parties, nor will it be used for purposes other than the application for open positions. As a Consultancy company, Avenue Code may share your information with its clients and other Companies from the CompassUol Group to which Avenue Code's consultants are allocated to perform its services.
    $144k-188k yearly est. 4d ago
  • Sustaining Engineer

    Intelliswift-An LTTS Company

    Reliability engineer job in Sunnyvale, CA

    Job Title: Sustaining Engineer Duration: 12 Months As a Sustaining Engineer, you will own post ramp qualifications, second source qualifications, and quality improvements for devices. You will partner with other electrical engineers during development, take over products as they reach maturity, and support them through end-of-life. You must be a great hands-on engineer and generalist that can wear many hats across EE, RF, Displays, Power, Audio, etc., and have strong interpersonal skills and experience working with cross-functional teams. The ideal candidate will have expertise in electrical and hardware design and have shipped consumer devices. RESPONSIBILITIES Take ownership of hardware projects from production ramp through end-of-life. Develop and qualify hardware changes, including new components and design/process changes. Participate in engineering reviews, such as schematic, layout, and DFM, with both cross-functional teams and external engineering partners. Develop system test plans, including test methodology and coverage, to validate changes and ensure product reliability. Create documentation such as Engineering Change Orders and general process documentation. Monitor data and test reports from design validation, production builds, and field returns to identify potential issues. Participate in Failure Analysis. QUALIFICATIONS BS in Electrical or Computer Systems Engineering, or equivalent experience. Direct experience in electrical hardware engineering and/or sustaining engineering. Knowledge of the technical rigor and processes required to bring a product to market at scale. Strong interpersonal and communication skills, and the ability to work with cross-functional engineering teams. Experience working with external partners, such as vendors and manufacturers. Experience reading and creating technical documentation such as datasheets, technical drawings, block diagrams, etc. Experience with common lab and test equipment such as oscilloscopes, spectrum analyzers, multimeters, etc. Experience utilizing schematic and layout tools such as Cadence PREFERRED Skills: MS in Electrical or Computer Systems Engineering. Experience within the Consumer Electronics industry. Experience with offshore manufacturing partners. Experience with electrical subsystems, such as Power/Battery, Display/Camera, Wireless/RF, Audio, etc.
    $96k-133k yearly est. 2d ago
  • ONLY W2 & LOCAL CANDIDATES :: Quality Engineer in Sunnyvale, CA

    Infotree Global Solutions 4.1company rating

    Reliability engineer job in Sunnyvale, CA

    Key Skills: Full Stack Validation & Hands-On Testing, Automation Scripting and Tool Development, Data Collection & Analysis, DevOps Required Experience: Proven experience with full stack validation, test case development, and test strategy creation. Proven experience running data collection protocols and managing datasets. Strong experience with Python (and ideally Bash) scripting for data processing and automation. Data analysis and statistical interpretation skills. Meticulous attention to detail with excellent written and verbal communication.
    $85k-113k yearly est. 3d ago
  • Validation Engineer

    Collabera 4.5company rating

    Reliability engineer job in Palo Alto, CA

    Role: Firmware Validation Software Engineer Type: Contract to Hire Pay Range: $48-$53/hr. Mission: This will be part of the supercharger team and will be responsible for testing our EV charger features to ensure the quality and safety of the charging experience for both client owners and third party EVs.??You will architect, design, and implement firmware validation procedures, equipment, tooling, and automation to efficiently test charging components and subsystems.??You will work closely with development and integration teams to explore and validate the performance capabilities of our hardware and firmware to ensure code quality is high. Must Haves: Degree in Electrical Engineering, Computer Engineering, or a related technical field, or equivalent experience Experience in embedded systems validation, firmware testing, or related fields Hands-on expertise with hardware debugging tools (oscilloscopes, protocol analyzers, etc.) Strong understanding of software development in systems languages (e.g. C, C++, Rust), Linux software architecture, embedded firmware (e.g. RTOS) Ability to translate complex requirements into scalable test solutions Day-to-Day Design and deploy advanced automated test frameworks for embedded Linux and RTOS-based products Develop software-in-the-loop (SIL) and hardware-in-the-loop (HIL) test systems using tools like oscilloscopes, logic analyzers, and custom automation Create actionable test reports to track code coverage, regression metrics, and release readiness Reverse-engineer complex systems to identify edge cases and failure modes Collaborate with cross-functional teams to refine validation strategies and troubleshoot issues Drive adoption of best practices for test automation, CI/CD, code robustness, and infrastructure scalability Plusses: communication protocols (Ethernet, CAN, RS485, etc.) The Company offers the following benefits for this position, subject to applicable eligibility requirements: medical insurance, dental insurance, vision insurance, 401(k) retirement plan, life insurance, long-term disability insurance, short-term disability insurance, paid parking/public transportation, paid time off, paid sick and safe time, hours of paid vacation time, weeks of paid parental leave, and paid holidays annually - as applicable.
    $48-53 hourly 3d ago
  • Business Intelligence Engineer

    Comrise 4.3company rating

    Reliability engineer job in Foster City, CA

    Foster City, CA (On-Site) Contract | 6-12 Months | $90-100/hr About the Role We're an autonomous mobility company building an on-demand, driverless ride-hailing service-and we're looking for a Business Intelligence Engineer to help power the insights behind our safety, operations, and commercial readiness efforts. In this role, you'll partner closely with data scientists, engineers, and operational leaders to build scalable data models, high-impact dashboards, and reliable metrics that support informed, data-driven decisions. What You'll Do Partner with technical and non-technical teams to gather requirements and deliver automated, actionable BI solutions. Design, build, and maintain data models, datamarts, and ETL/ELT pipelines. Collaborate with data scientists and engineers to define consistent and trustworthy metrics. Develop dashboards and visualizations that drive operational insights and support leadership decisions. Enable self-service analytics and promote data literacy across the organization. Ensure reporting best practices-data integrity, validation, documentation, and scalability. Translate business needs into well-structured data assets under fast-paced timelines. Ideal Candidate Profile dbt certification or strong hands-on experience with dbt. Experience with Airflow for workflow orchestration. Strong background in analytics engineering, SQL, and dimensional data modeling. Full-stack BI skill set: ~40-50% dashboarding and ~50-60% backend datamart development. Proven ability to build and maintain datamarts-not just frontend dashboards. Skilled in creating self-serve dashboards and working directly with stakeholders. Must have Looker (not Looker Studio) experience, including LookML modeling. Required Skills 6+ years of relevant industry experience. Degree or background in Computer Science, Engineering, Applied Math, Statistics, or similar. High proficiency in SQL, dbt, and data modeling. Expertise in Looker and BI best practices. Strong communication and collaboration skills. Interview Process Coding Assessment 30-minute Zoom interview with Hiring Manager 1.5-hour technical panel interview
    $90-100 hourly 1d ago
  • Process Development Engineer

    Solidigm

    Reliability engineer job in San Jose, CA

    San Jose Full-time Department: NAND Development Join a multibillion-dollar global company that brings together amazing technology, people, and operational scale to become a powerhouse in the memory industry. Headquartered in Rancho Cordova, California, Solidigm combines elements of an established, successful technology company with the spirit, agility, and entrepreneurial mindset of a start-up. In addition to the U.S. headquarters and other facilities in the U.S., the company has international presence in Asia, Europe, and the Americas. Solidigm will continue to lead the world in innovating new Memory technologies with aspirations to be the #1 NAND memory company in the world. At Solidigm, we view problems as opportunities to define innovative solutions that hold the power to change the world and unleash the potential technological needs that the future holds. At Solidigm, we are One Team that fosters a diverse, equitable, and inclusive culture that embraces individual uniqueness and empowers us to bring our best selves to deliver excellence in support of Solidigm's vision and mission to be the go-to partner for optimized data storage solutions. You can be part of the takeoff of an innovative business that develops cutting-edge products, delivers strong business value for customers, provides an engaging workplace for its employees, and serves a greater impact on the world. This is a golden opportunity for the right applicant to join us and help design, build, and lead Solidigm. We want a diverse team of dedicated professionals who will not just be Solidigm team members but contribute to how we shape the future of the organization. We are seeking applicants who will grow and thrive in our culture; be customer inspired, trusting, innovative, team-oriented, inclusive, results driven, collaborative, passionate, and flexible. Job Description As a Process Development Engineer at Solidigm you will be responsible for delivering module capability for the next node on the 3D NAND technology roadmap, partnering with the development team at the factory, as well as engaging with equipment vendors to deliver cost-effective and manufacturable process technology. As a Process Development Principal Engineer at Solidigm you will be responsible for delivering module capability for the next node on the 3D NAND technology roadmap, partnering with the development team at the factory, as well as engaging with equipment vendors to deliver cost-effective and manufacturable process technology. Key Responsibilities Lead the development and optimization of processes for 3D NAND semiconductor manufacturing Collaborate with cross-functional teams to identify and solve process issues and drive yield improvement Utilize your strong understanding of process principles, equipment, and materials to design and implement innovative solutions as well as driving cost reduction Conduct experiments and analyze data to make data-driven decisions for process improvements Monitor and track process performance, identify areas for improvement, and implement corrective actions Stay updated on industry trends and advancements in dry etch technology and integrate them Train and mentor junior engineers on processes and techniques Collaborate with equipment suppliers to ensure the smooth integration of new equipment and processes Contribute to the development of new process technologies to maintain competitive edge in the industry Qualifications Required Qualifications: A Master's degree in Electrical Engineering, Material Science Engineering, Physics, or similar technical degree. 12+ years experience semi-conductor industry experience 7+ years direct process development experience Established track record of innovation and results Preferred Qualification The ideal candidate will have direct experience in 3D NAND process development. Additional Information The compensation range for this role is $127,260 - $220,060 USD. Actual compensation is influenced by a variety of factors including but not limited to skills, experience, qualifications, and geographic location.
    $127.3k-220.1k yearly 19h ago
  • AI/ML Engineer - Build Core Intelligence for a New Class of Enterprise AI Products

    Evolution USA

    Reliability engineer job in San Jose, CA

    Our client is a well-funded, product-focused AI startup building next-generation systems that help organizations capture, organize, and leverage their internal knowledge at scale. Their platform blends modern machine learning, intelligent data pipelines, and context-aware models to make teams more effective, faster, and dramatically more aligned. They're growing quickly and looking for engineers with strong CS fundamentals and a passion for building ML systems that operate in the real world, not just in research settings. If you're energized by early-stage ownership and solving hard problems that matter, you'll thrive here. What You'll Work On Designing and shipping ML components end-to-end: training pipelines, data ingestion, evaluation, optimization, and deployment Building models and services that learn from unstructured signals (text, docs, conversations, workflows) and surface actionable insights Transforming research concepts into scalable production systems used by enterprise customers Architecting efficient, privacy-conscious model-serving infrastructure Collaborating across engineering, product, and research to rapidly experiment and iterate Owning technical decisions in a fast-paced environment where craftsmanship and velocity both matter Who Thrives Here Engineers with strong computer science foundations (algorithms, systems, distributed computing, data structures) Builders with experience in ML frameworks (PyTorch, TensorFlow, JAX) and modern backend systems Candidates coming from top CS programs or AI-first startups where they've shipped real, user-facing ML/AI features People who enjoy working in ambiguous, zero-to-one environments and want significant ownership Curious thinkers who care about model performance, reliability, and how their work impacts real users Bonus skills (not required): experience with LLMs, embeddings, personalization models, agentic workflows, or model evaluation at scale Why This Role Is Compelling You'll build foundational AI systems at a moment where the company is scaling quickly Your work will directly shape how customers interact with and benefit from AI You'll partner closely with experienced founders and senior engineers who value autonomy and deep thinking You'll have room to experiment, influence architecture, and meaningfully impact technical direction You'll join early enough to feel the upside, but after core customer traction is already proven What's On Offer A high-ownership engineering role with scope to grow into senior or staff pathways A culture that balances speed with thoughtful engineering Mission-driven work with clear real-world utility Competitive compensation, strong equity, and flexibility in how you build If you're excited by the idea of building intelligent systems that help real teams work smarter, and you want your engineering decisions to matter, we'd love to talk.
    $98k-136k yearly est. 2d ago
  • Sr Manufacturing Process Engineering

    Omega Electronics Manufacturing Services

    Reliability engineer job in San Jose, CA

    The Senior Manufacturing / Process Engineer is responsible for developing, implementing, and improving manufacturing processes for electronic assemblies in a high-mix, low-volume environment. This role supports both the U.S. and Vietnam facilities, ensuring process repeatability, cost efficiency, and product quality across SMT, Through-Hole, and system assembly operations. The senior manufacturing process engineer will report directly to the VP of operations. Goals: Build products and provide services with the highest Flexibility, Productivity, and Quality. Achieve total customer satisfaction through technical excellence and responsive engineering support. Ensure successful NPI launches through cross-functional collaboration, process validation, and data-driven feedback to design and quality teams. Objectives: 1. Support production operations in the following categories: a. Reduce downtime caused by engineering issues (programming, MPI errors, tooling, design, or line stoppage). b. Improve quality yield through root cause analysis, corrective actions, and robust process setup. c. Lead NPI and prototype builds, ensuring process readiness, documentation completeness, and manufacturability validation prior to production release. 2. Provide engineering services to meet customer needs and expectations in the following areas: a. Design for Manufacturability (DFM). b. Manufacturing Process Instruction (MPI) creation and maintenance. c. Engineering Change Order (ECO) implementation. d. Defect Reduction Team (DRT) meetings and follow-up actions. e. Failure analysis and corrective action documentation. f. Develop and validate new or modified processes, including process capability studies, DOE validation, and reflow/wave solder profile optimization g. Other engineering requests as required by customers or management. Job Description: SMT / Through-Hole / 2nd Ops / 3rd Ops Process Support Review daily SMT or build schedule to ensure process readiness. Confirm all required items are complete and available prior to production: Job package with full build documentation. Manufacturing Process Instruction (MPI) reviewed and approved. Routing definitions for data collection. Validated reflow or wave solder profiles. ECOs, deviations, or special instructions incorporated into the MPI and/or job package. All required tooling available and verified. Review pre-build DFM, document known defects, and hyperlink details in the MPI. Lead cross-functional NPI kickoff meetings to review design requirements, risk areas, and special process considerations. Document and track NPI issues and lessons learned for future builds. Coordinate with Program Managers to resolve DFM showstoppers prior to build. Analyze previous quality data, identify recurring defects, determine root causes, and implement corrective actions. Design, order, and verify all required tooling (stencils, wave solder pallets, press-fit fixtures, conformal coat fixtures, etc.). Maintain tooling logs, labeling, and readiness tracking within Omega Build Readiness. Inspect and sign off first article setups for critical processes (stencil printer, reflow oven, wave solder, etc.) using the First Article Checklist. Inspect initial boards after print and reflow for solder release, bridging, voids, and process anomalies. Document findings and sign off the First Article Report. Provide on-the-floor training for operators and technicians regarding new processes, corrective actions, or observed deficiencies. Support production by promptly responding to technical inquiries or line support issues. Exercise full authority to stop the line if repeated defects or safety concerns are observed. Quality Data Review & Root Cause Analysis Review production data in Omega Data Collection, identifying root causes and corrective actions. Review Daily, Weekly, and Customer Quality Reports to identify trends, recurring issues, or process gaps. Provide structured analysis and report findings to Quality and Production (using 8D or equivalent methodology). Document corrective actions and verify implementation during the next production run. Present findings and improvement updates in internal and customer quality meetings. Other Responsibilities: Create and submit Post-Build DFM reports to Program Managers with improvement recommendations. Implement and validate ECO changes per revision control procedures. Perform and document detailed failure analyses for internal and customer returns. Participate in process improvement projects and defect-reduction initiatives. Provide customer-driven engineering services or special support requests. Develop and deliver internal technical training for operators and peers. Support ISO 9001 and AS9100 activities, including audits, documentation, and Work Instruction updates. Qualifications: Bachelor's degree in Manufacturing, Industrial, or Mechanical Engineering (or related discipline). 8-12 years of hands-on experience in electronics manufacturing (PCBA, box-build, system integration). Deep understanding of SMT, Through-Hole, and system assembly processes. Proficient in process validation, FAI, SPC, DOE, and yield improvement. Familiarity with FactoryLogix and related MES/ERP systems. Experience leading NPI builds and developing new assembly processes from prototype through production release. Familiarity with DFM/DFT analysis tools and PCB CAD systems (e.g., Altium, Valor, Mentor). Experience with Lean, Six Sigma, and structured problem-solving tools. Strong communication and analytical skills with the ability to multitask in a fast-paced environment. U.S. Citizen or Permanent Resident (ITAR requirement). Compensation: $120-$150K Annually Benefits: Medical Dental Vision 401K + Roth 401K Vacation Paid Holidays
    $120k-150k yearly 1d ago
  • Lead WMS Functional Quality Engineer

    Gspann Technologies, Inc. 3.4company rating

    Reliability engineer job in San Ramon, CA

    About GSPANN Headquartered in California, U.S.A., GSPANN provides consulting and IT services to global clients. We help clients transform how they deliver business value by helping them optimize their IT capabilities, practices, and operations with our experience in retail, high-technology, and manufacturing. With five global delivery centers and 1900+ employees, we provide the intimacy of a boutique consultancy with the capabilities of a large IT services firm. Title: Lead WMS Functional Quality Engineer Location: San Ramon, CA (4 days/week) Job Type: Contractual Duration: Long Term 15+ years of total IT Experience and expertise in technical configuration of WMS product, Solution Integration, Quality Assurance, System Testing, PLSQL, SQL Server, and Performance Tuning. Must have Retail/Ecommerce ERP experience like Manhattan/Blue Yonder/Fluent etc.. 10+ years of experience in progressive web and mobile apps technology and Retail Ecommerce domain. Comfort and experience working in an Agile SCRUM and Test-driven environment. Ensure that the WMS is correctly set up, tested, and commissioned within the warehouse operations. Must have Working experience in Retail Domain with Ecommerce platforms, OMS, WMS, Supply chain, inventory. Exposure and understanding to Modern web standards and technology. Coordinate changes and reconfiguration of the WMS effectively to ensure that day-to-day business operations are not adversely affected during the transition. Gather and understand process requirements for effective process management via the WMS. Conduct system testing prior to any process change go-live. Ensure all relevant and required test cases are executed, documented with results, and implemented in a controlled manner. Draft user acceptance testing (UAT) cases and prepare the system for UAT. Conduct UAT as required. Working at GSPANN GSPANN is a diverse, prosperous, and rewarding place to work. We provide competitive benefits, educational assistance, and career growth opportunities to our employees. Every employee is valued for their talent and contribution. Working with us will give you an opportunity to work globally with some of the best brands in the industry. The company does and will take affirmative action to employ and advance in the employment of individuals with disabilities and protected veterans and to treat qualified individuals without discrimination based on their physical or mental disability status. GSPANN is an equal opportunity employer for minorities/females/veterans/disabled.
    $108k-144k yearly est. 1d ago
  • Manufacturing Engineer

    Hcltech

    Reliability engineer job in Milpitas, CA

    HCLTech is looking for a highly talented and self-motivated Manufacturing Engineer to join it in advancing the technological world through innovation and creativity. Job Title: Manufacturing Engineer Position Type: Fulltime Location: Milpitas, CA Responsibilities: Manufacturing Process & Documentation Manage general manufacturing processes including BOM management, assembly procedures, tools, and drawings. Maintain accurate documentation and ensure compliance with established standards. Production Support Provide day-to-day manufacturing production support. Drive process improvements, conduct root cause investigations, and perform failure analysis. Redline and update manufacturing instructions and test methods as needed. Change Management Execute change order processes with proper reasoning and justification. Manage change requests requiring impact assessments, coordinate meetings with cross-functional teams, and ensure timely closure. Validation & Qualification Prepare IQ/OQ/PQ protocols in line with standard procedures. Organize review meetings with the validation board and ensure report completion and closure. Conduct periodic reviews of validation plans to align with current state. Risk & Compliance Demonstrate proficiency in PFMEA and risk assessments at both process and product levels. Ensure adherence to IEC and EMC standards and medical regulatory requirements. Technical Expertise Interpret and create/edit electrical schematics and drawings. Collaborate with Quality teams regularly. Perform NCMR disposition per SOP guidelines. Troubleshoot component-level issues related to PCBA, electrical designs, test fixtures, tools, and equipment. Build electro-mechanical test fixtures/equipment/tools based on specifications. Perform preventive maintenance of equipment and fixtures periodically. Manage End-of-Life (EOL) and component obsolescence. Audit & Compliance Support audits, address observations, and implement corrective actions. Own NC and CAPA processes and deliver on time. Design & Development Work with CAD models and PCB layouts. Write technical reports for component evaluations and engineering studies. Systems & Tools Proficient in SAP and Agile systems. Cross-Functional Collaboration Communicate effectively with cross-functional teams, buyers, planners, suppliers, and calibration coordinators. Provide project updates to senior management and take ownership of deliverables. Work closely with technicians to collect data for continuous improvement. Pay and Benefits Pay Range Minimum: $71000 per year Pay Range Maximum: $108000 per year HCLTech is an equal opportunity employer, committed to providing equal employment opportunities to all applicants and employees regardless of race, religion, sex, color, age, national origin, pregnancy, sexual orientation, physical disability or genetic information, military or veteran status, or any other protected classification, in accordance with federal, state, and/or local law. Should any applicant have concerns about discrimination in the hiring process, they should provide a detailed report of those concerns to ****************** for investigation. Compensation and Benefits A candidate's pay within the range will depend on their work location, skills, experience, education, and other factors permitted by law. This role may also be eligible for performance-based bonuses subject to company policies. In addition, this role is eligible for the following benefits subject to company policies: medical, dental, vision, pharmacy, life, accidental death & dismemberment, and disability insurance; employee assistance program; 401(k) retirement plan; 10 days of paid time off per year (some positions are eligible for need-based leave with no designated number of leave days per year); and 10 paid holidays per year. How You'll Grow At HCLTech, we offer continuous opportunities for you to find your spark and grow with us. We want you to be happy and satisfied with your role and to really learn what type of work sparks your brilliance the best. Throughout your time with us, we offer transparent communication with senior-level employees, learning and career development programs at every level, and opportunities to experiment in different roles or even pivot industries. We believe that you should be in control of your career with unlimited opportunities to find the role that fits you best.
    $71k-108k yearly 4d ago
  • Lab Operations Continuous Improvement Engineer

    Applicantz

    Reliability engineer job in Fremont, CA

    This is 100% onsite role in Fremont, CA 94538. The Lab Operations team is expanding and transforming our labs and services to meet the new business needs and to continue to provide a safe, customer-trusted and efficient lab environment. This position will work on lab improvement project(s) in a Semiconductor Equipment Research and Development environment to achieve lab operational business goals and serve lab users communities. This role will apply 6S, PSDM (Problem Solving & Decision Making) and project management methodologies to develop or improve prioritization system, execution efficiency and service quality for lab user communities. Top skills: Continuous Improvement expertise using 5S/6S, Lean and Six Sigma methodologies Strong project ownership with cross-functional execution capability Data-driven analytical problem solving with KPI and performance metrics focus Requirements: 5S or 6-sigma methodology knowledge required Project management skill required Fast learner and highly motivated individual Team player with good communication skills Good presentation, analytical and problem solving skills Able to lift ~25 lb and stand/walk/work in the clean room lab BS Industrial engineer major with 3+ years of lab/manufacturing operation experience Primary Responsibilities: Apply industrial engineering knowledge to determine improvement opportunities and solutions Be project owner to develop, implement, drive continuous improvement of lab process and system in a cross-functional environment Coordinate testing and collect lab user feedback on system usability and performance Analyze data and create metrics for lab process or system performance Sustain lab operation systems for daily operation including answering questions, resolving issues, performing routine health check and continuous improvement Present project in lab operation or cross-functional meetings Our Client is a Fortune 350 company that engages in the design, manufacturing, marketing, and service of semiconductor processing equipment.
    $82k-113k yearly est. 19h ago
  • Site Reliability Engineer

    Hamilton Barnes 🌳

    Reliability engineer job in San Mateo, CA

    Senior Platform Engineer/Site Reliability Engineer - AI Infrastructure Join a stealth-mode startup building out their AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready to go for experimentation, full-scale model training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you'll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring seamless orchestration across environments managed by Slurm, Kubernetes, or direct SSH access. Aswell as supporting their extremely exciting new products coming to the market! This is a rare opportunity to work at the intersection of AI infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment. If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it. Responsibilities Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads. Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments. Develop observability, alerting, and auto-healing systems for high-availability GPU workloads. Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilisation, and data flow. Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes. Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput. Required Skills & Experience Customer facing experience and the attitude to be a Swiss army knife! Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management. Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred). Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning. Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks. Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale. Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.
    $112k-160k yearly est. 3d ago
  • AI/ML Engineer - Build Core Intelligence for a New Class of Enterprise AI Products

    Evolution USA

    Reliability engineer job in Fremont, CA

    Our client is a well-funded, product-focused AI startup building next-generation systems that help organizations capture, organize, and leverage their internal knowledge at scale. Their platform blends modern machine learning, intelligent data pipelines, and context-aware models to make teams more effective, faster, and dramatically more aligned. They're growing quickly and looking for engineers with strong CS fundamentals and a passion for building ML systems that operate in the real world, not just in research settings. If you're energized by early-stage ownership and solving hard problems that matter, you'll thrive here. What You'll Work On Designing and shipping ML components end-to-end: training pipelines, data ingestion, evaluation, optimization, and deployment Building models and services that learn from unstructured signals (text, docs, conversations, workflows) and surface actionable insights Transforming research concepts into scalable production systems used by enterprise customers Architecting efficient, privacy-conscious model-serving infrastructure Collaborating across engineering, product, and research to rapidly experiment and iterate Owning technical decisions in a fast-paced environment where craftsmanship and velocity both matter Who Thrives Here Engineers with strong computer science foundations (algorithms, systems, distributed computing, data structures) Builders with experience in ML frameworks (PyTorch, TensorFlow, JAX) and modern backend systems Candidates coming from top CS programs or AI-first startups where they've shipped real, user-facing ML/AI features People who enjoy working in ambiguous, zero-to-one environments and want significant ownership Curious thinkers who care about model performance, reliability, and how their work impacts real users Bonus skills (not required): experience with LLMs, embeddings, personalization models, agentic workflows, or model evaluation at scale Why This Role Is Compelling You'll build foundational AI systems at a moment where the company is scaling quickly Your work will directly shape how customers interact with and benefit from AI You'll partner closely with experienced founders and senior engineers who value autonomy and deep thinking You'll have room to experiment, influence architecture, and meaningfully impact technical direction You'll join early enough to feel the upside, but after core customer traction is already proven What's On Offer A high-ownership engineering role with scope to grow into senior or staff pathways A culture that balances speed with thoughtful engineering Mission-driven work with clear real-world utility Competitive compensation, strong equity, and flexibility in how you build If you're excited by the idea of building intelligent systems that help real teams work smarter, and you want your engineering decisions to matter, we'd love to talk.
    $98k-136k yearly est. 2d ago
  • Industrial Engineer

    Applicantz

    Reliability engineer job in Fremont, CA

    This is 100% onsite role in Fremont, CA 94538. The Lab Operations team is expanding and transforming our labs and services to meet the new business needs and to continue to provide a safe, customer-trusted and efficient lab environment. This position will work on lab improvement project(s) in a Semiconductor Equipment Research and Development environment to achieve lab operational business goals and serve lab users communities. This role will apply 6S, PSDM (Problem Solving & Decision Making) and project management methodologies to develop or improve prioritization system, execution efficiency and service quality for lab user communities. Top skills: Continuous Improvement expertise using 5S/6S, Lean and Six Sigma methodologies Strong project ownership with cross-functional execution capability Data-driven analytical problem solving with KPI and performance metrics focus Requirements: 5S or 6-sigma methodology knowledge required Project management skill required Fast learner and highly motivated individual Team player with good communication skills Good presentation, analytical and problem solving skills Able to lift ~25 lb and stand/walk/work in the clean room lab BS Industrial engineer major with 3+ years of lab/manufacturing operation experience Primary Responsibilities: Apply industrial engineering knowledge to determine improvement opportunities and solutions Be project owner to develop, implement, drive continuous improvement of lab process and system in a cross-functional environment Coordinate testing and collect lab user feedback on system usability and performance Analyze data and create metrics for lab process or system performance Sustain lab operation systems for daily operation including answering questions, resolving issues, performing routine health check and continuous improvement Present project in lab operation or cross-functional meetings Our Client is a Fortune 350 company that engages in the design, manufacturing, marketing, and service of semiconductor processing equipment.
    $83k-110k yearly est. 19h ago
  • Site Reliability Engineer

    Hamilton Barnes 🌳

    Reliability engineer job in San Jose, CA

    Senior Platform Engineer/Site Reliability Engineer - AI Infrastructure Join a stealth-mode startup building out their AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready to go for experimentation, full-scale model training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you'll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring seamless orchestration across environments managed by Slurm, Kubernetes, or direct SSH access. Aswell as supporting their extremely exciting new products coming to the market! This is a rare opportunity to work at the intersection of AI infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment. If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it. Responsibilities Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads. Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments. Develop observability, alerting, and auto-healing systems for high-availability GPU workloads. Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilisation, and data flow. Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes. Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput. Required Skills & Experience Customer facing experience and the attitude to be a Swiss army knife! Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management. Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred). Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning. Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks. Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale. Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.
    $112k-159k yearly est. 3d ago
  • AI/ML Engineer - Build Core Intelligence for a New Class of Enterprise AI Products

    Evolution USA

    Reliability engineer job in San Francisco, CA

    Our client is a well-funded, product-focused AI startup building next-generation systems that help organizations capture, organize, and leverage their internal knowledge at scale. Their platform blends modern machine learning, intelligent data pipelines, and context-aware models to make teams more effective, faster, and dramatically more aligned. They're growing quickly and looking for engineers with strong CS fundamentals and a passion for building ML systems that operate in the real world, not just in research settings. If you're energized by early-stage ownership and solving hard problems that matter, you'll thrive here. What You'll Work On Designing and shipping ML components end-to-end: training pipelines, data ingestion, evaluation, optimization, and deployment Building models and services that learn from unstructured signals (text, docs, conversations, workflows) and surface actionable insights Transforming research concepts into scalable production systems used by enterprise customers Architecting efficient, privacy-conscious model-serving infrastructure Collaborating across engineering, product, and research to rapidly experiment and iterate Owning technical decisions in a fast-paced environment where craftsmanship and velocity both matter Who Thrives Here Engineers with strong computer science foundations (algorithms, systems, distributed computing, data structures) Builders with experience in ML frameworks (PyTorch, TensorFlow, JAX) and modern backend systems Candidates coming from top CS programs or AI-first startups where they've shipped real, user-facing ML/AI features People who enjoy working in ambiguous, zero-to-one environments and want significant ownership Curious thinkers who care about model performance, reliability, and how their work impacts real users Bonus skills (not required): experience with LLMs, embeddings, personalization models, agentic workflows, or model evaluation at scale Why This Role Is Compelling You'll build foundational AI systems at a moment where the company is scaling quickly Your work will directly shape how customers interact with and benefit from AI You'll partner closely with experienced founders and senior engineers who value autonomy and deep thinking You'll have room to experiment, influence architecture, and meaningfully impact technical direction You'll join early enough to feel the upside, but after core customer traction is already proven What's On Offer A high-ownership engineering role with scope to grow into senior or staff pathways A culture that balances speed with thoughtful engineering Mission-driven work with clear real-world utility Competitive compensation, strong equity, and flexibility in how you build If you're excited by the idea of building intelligent systems that help real teams work smarter, and you want your engineering decisions to matter, we'd love to talk.
    $98k-136k yearly est. 2d ago

Learn more about reliability engineer jobs

How much does a reliability engineer earn in Palo Alto, CA?

The average reliability engineer in Palo Alto, CA earns between $96,000 and $187,000 annually. This compares to the national average reliability engineer range of $76,000 to $144,000.

Average reliability engineer salary in Palo Alto, CA

$134,000

What are the biggest employers of Reliability Engineers in Palo Alto, CA?

The biggest employers of Reliability Engineers in Palo Alto, CA are:
  1. Tesla
  2. Google
  3. Fortinet
  4. xAI
  5. Cerebras
  6. Alibaba Group Ltd.
  7. JPMorgan Chase & Co.
  8. Luma Ai, Inc.
  9. Robust.Ai
  10. Freelance Computer Services
Job type you want
Full Time
Part Time
Internship
Temporary