Site Reliability Engineer
Reliability engineer job in Santa Clara, CA
Senior Platform Engineer/Site Reliability Engineer - AI Infrastructure
Join a stealth-mode startup building out their AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready to go for experimentation, full-scale model training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you'll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring
seamless orchestration across environments managed by Slurm, Kubernetes, or direct SSH access. Aswell as supporting their extremely exciting new products coming to the market!
This is a rare opportunity to work at the intersection of AI infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment.
If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it.
Responsibilities
Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilisation, and data flow.
Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.
Required Skills & Experience
Customer facing experience and the attitude to be a Swiss army knife!
Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management.
Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale.
Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.
Site Reliability Engineer
Reliability engineer job in San Jose, CA
Ascendion is a full-service digital engineering solutions company. We make and manage software platforms and products that power growth and deliver captivating experiences to consumers and employees. Our engineering, cloud, data, experience design, and talent solution capabilities accelerate transformation and impact for enterprise clients. Headquartered in New Jersey, our workforce of 6,000+ Ascenders delivers solutions from around the globe. Ascendion is built differently to engineer the next.
Ascendion | Engineering to elevate life
We have a culture built on opportunity, inclusion, and a spirit of partnership. Come, change the world with us:
Build the coolest tech for world's leading brands
Solve complex problems - and learn new skills
Experience the power of transforming digital engineering for Fortune 500 clients
Master your craft with leading training programs and hands-on experience
Experience a community of change makers!
Join a culture of high-performing innovators with endless ideas and a passion for tech. Our culture is the fabric of our company, and it is what makes us unique and diverse. The way we share ideas, learning, experiences, successes, and joy allows everyone to be their best at Ascendion.
About the Role:
Title: SRE
Location: REMOTE
Qualifications
Identity-Related Efforts
Knowledge of identity - authentication, authorization, and directory services
Experience with Okta
Experience with Terraform
Specifically, must be able to prioritize, plan, build, test, and launch Terraform workflows from end-to-end
Proficiency in Python
Recent portfolio projects will be required for review and interviews will include coding challenges
Experience with CI/CD pipelines
Infrastructure as Code experience
Knowledge of observability - monitoring and alerting
Experience with Kubermetheus Stack (Kubernetes, Prometheus, Loki, Grafana, Alert Manager)
Experience with AWS
Proficiency in Python
Salary Range: The salary for this position is between $130,000- $145,000 annually. Factors which may affect pay within this range may include geography/market, skills, education, experience, and other qualifications of the successful candidate.
Benefits: The Company offers the following benefits for this position, subject to applicable eligibility requirements: [medical insurance] [dental insurance] [vision insurance] [401(k) retirement plan] [long-term disability insurance] [short-term disability insurance] [5 personal day accrued each calendar year. The Paid time off benefits meet the paid sick and safe time laws that pertains to the City/ State] [10-15 day of paid vacation time] [6 paid holidays and 1 floating holiday per calendar year] [Ascendion Learning Management System]
Sr. Site Reliability Engineer (SRE)
Reliability engineer job in Mountain View, CA
About the Opportunity:
We're seeking an experienced, highly collaborative SRE to partner with product teams and tackle our most critical infrastructure challenges. You'll be hands-on in designing, building, and operating our cloud platform-and driving the reliability, performance, and security that empower our engineering organization.
Responsibilities:
Infrastructure as Code & CI/CD: Automate provisioning and deployments with
Terraform and integrate best-practice pipelines (GitHub Actions, ArgoCD, etc.).
Reliability Engineering: Define SLIs/SLOs, manage error budgets, and build dashboards & alerts to proactively measure and improve system health.
Security & Compliance: Enforce least-privilege IAM policies, automate vulnerability scans, and maintain audit logging for compliance.
Monitoring & Observability: Instrument services with metrics, logs, and distributed tracing to enable rapid troubleshooting, aid teams in alerting, custom metrics, and dashboarding
Incident Management: Own on-call rotations, lead real-time incident response, conduct post-mortems, and drive continuous improvements.
Cost Optimization: Implement tagging strategies, right-size resources, and leverage concrete data to decide on optimal methods to control cloud spend at scale.
Documentation & Mentorship: Author runbooks, standards, and best-practice guides-and coach dev teams on implementing modern DevOps, reliability, and security patterns.
Required Qualifications:
Have 5+ years of experience running production critical systems.
Deep proficiency with the AWS Cloud and Cloud-Native best practices.
Experience with Kubernetes (EKS, GKE) and Container Orchestration at scale.
Skilled in Terraform to declaratively provision and maintain infrastructure services.
Working knowledge of managing and debugging databases like Redis and Postgres.
Strong familiarity with VPC, VPN, Load Balancing, and cloud networking components.
Proficiency with Git workflows, branching strategies, and CI/CD systemintegrations.
Solid understanding of web and network protocols and standards (HTTP, REST, TLS, DNS, etc...)
Professional proficiency in English (both written and spoken) is required for this role.
Nice to Have Skills:
Bachelor's degree, or equivalent in Computer Science, Engineering, or a related field.
Experience with ArgoCD, GitHub Actions, Jenkins, or other CI/CD pipeline solutions.
Working knowledge of Python, Golang, and Helm templating languages.
Node.js experience a plus, including running scalable, resilient Node microservices.
Grasp of foundational security best practices for cloud infrastructure.
Awareness of Terragrunt, managing Terraform state, and optimal project structure.
Seasoned in production readiness fundamentals amidst a fast-moving team.
Avenue Code reinforces its commitment to privacy and to all the principles guaranteed by the most accurate global data protection laws, such as GDPR, LGPD, CCPA and CPRA. The Candidate data shared with Avenue Code will be kept confidential and will not be transmitted to disinterested third parties, nor will it be used for purposes other than the application for open positions. As a Consultancy company, Avenue Code may share your information with its clients and other Companies from the CompassUol Group to which Avenue Code's consultants are allocated to perform its services.
Senior DevOps / Site Reliability Engineer
Reliability engineer job in Pleasanton, CA
We are Hiring: Senior DevOps / Site Reliability Engineer (Fulltime, Hybrid - Pleasanton, CA)
MeeruAI | Pleasanton, California
At MeeruAI, we are building next-generation AI-powered, multi-tenant SaaS platforms designed to scale securely and reliably for enterprise customers. We're at an exciting growth stage and are looking for a Senior DevOps / Site Reliability Engineer to help lay the cloud and reliability foundation for our newest platform while supporting existing products.
This is a foundational, high-impact role where you'll define AWS architecture, establish DevOps and SRE best practices, and ensure production-grade reliability as we scale.
About the Role
As a Senior DevOps / SRE, you will partner closely with Platform, Backend, Frontend, and AI teams to enable fast, secure deployments and maintain 99.9%+ uptime for a multi-tenant SaaS environment.
What You will Do
Architect and manage AWS infrastructure (EKS, RDS, VPC, IAM, S3)
Build and maintain Terraform-based Infrastructure as Code
Own Kubernetes/EKS clusters, including scaling, upgrades, and deployments
Design and optimize CI/CD pipelines (GitHub Actions, Jenkins, GitOps)
Implement monitoring, alerting, and observability (Datadog, CloudWatch)
Lead incident response, on-call processes, and postmortems
Define and track SLOs, SLIs, and error budgets
Implement security and compliance controls (SOC 2, IAM, encryption)
Required Qualifications
7-10+ years of DevOps / SRE experience in production environments
Deep expertise in AWS and Kubernetes (EKS)
Strong experience with Terraform or CloudFormation
Proven ownership of CI/CD, monitoring, and incident management
Experience supporting multi-tenant B2B SaaS platforms
Strong scripting skills (Python or Bash)
Security-first mindset with hands-on compliance exposure
Compensation
Salary Range: $150,000 - $170,000 base salary (California market)
Work Location: Hybrid - Pleasanton, CA
This role requires on-site presence at least three (3) days per week in our Pleasanton office.
Please apply only if you are currently local to the Pleasanton area. Relocation assistance is not available for this role.
Business Intelligence Engineer
Reliability engineer job in Foster City, CA
Foster City, CA (On-Site)
Contract | 6-12 Months | $90-100/hr
About the Role
We're an autonomous mobility company building an on-demand, driverless ride-hailing service-and we're looking for a Business Intelligence Engineer to help power the insights behind our safety, operations, and commercial readiness efforts.
In this role, you'll partner closely with data scientists, engineers, and operational leaders to build scalable data models, high-impact dashboards, and reliable metrics that support informed, data-driven decisions.
What You'll Do
Partner with technical and non-technical teams to gather requirements and deliver automated, actionable BI solutions.
Design, build, and maintain data models, datamarts, and ETL/ELT pipelines.
Collaborate with data scientists and engineers to define consistent and trustworthy metrics.
Develop dashboards and visualizations that drive operational insights and support leadership decisions.
Enable self-service analytics and promote data literacy across the organization.
Ensure reporting best practices-data integrity, validation, documentation, and scalability.
Translate business needs into well-structured data assets under fast-paced timelines.
Ideal Candidate Profile
dbt certification or strong hands-on experience with dbt.
Experience with Airflow for workflow orchestration.
Strong background in analytics engineering, SQL, and dimensional data modeling.
Full-stack BI skill set: ~40-50% dashboarding and ~50-60% backend datamart development.
Proven ability to build and maintain datamarts-not just frontend dashboards.
Skilled in creating self-serve dashboards and working directly with stakeholders.
Must have Looker (not Looker Studio) experience, including LookML modeling.
Required Skills
6+ years of relevant industry experience.
Degree or background in Computer Science, Engineering, Applied Math, Statistics, or similar.
High proficiency in SQL, dbt, and data modeling.
Expertise in Looker and BI best practices.
Strong communication and collaboration skills.
Interview Process
Coding Assessment
30-minute Zoom interview with Hiring Manager
1.5-hour technical panel interview
Process Development Engineer
Reliability engineer job in San Jose, CA
San Jose
Full-time
Department: NAND Development
Join a multibillion-dollar global company that brings together amazing technology, people, and operational scale to become a powerhouse in the memory industry. Headquartered in Rancho Cordova, California, Solidigm combines elements of an established, successful technology company with the spirit, agility, and entrepreneurial mindset of a start-up. In addition to the U.S. headquarters and other facilities in the U.S., the company has international presence in Asia, Europe, and the Americas. Solidigm will continue to lead the world in innovating new Memory technologies with aspirations to be the #1 NAND memory company in the world. At Solidigm, we view problems as opportunities to define innovative solutions that hold the power to change the world and unleash the potential technological needs that the future holds. At Solidigm, we are One Team that fosters a diverse, equitable, and inclusive culture that embraces individual uniqueness and empowers us to bring our best selves to deliver excellence in support of Solidigm's vision and mission to be the go-to partner for optimized data storage solutions. You can be part of the takeoff of an innovative business that develops cutting-edge products, delivers strong business value for customers, provides an engaging workplace for its employees, and serves a greater impact on the world. This is a golden opportunity for the right applicant to join us and help design, build, and lead Solidigm. We want a diverse team of dedicated professionals who will not just be Solidigm team members but contribute to how we shape the future of the organization. We are seeking applicants who will grow and thrive in our culture; be customer inspired, trusting, innovative, team-oriented, inclusive, results driven, collaborative, passionate, and flexible.
Job Description
As a Process Development Engineer at Solidigm you will be responsible for delivering module capability for the next node on the 3D NAND technology roadmap, partnering with the development team at the factory, as well as engaging with equipment vendors to deliver cost-effective and manufacturable process technology.
As a Process Development Principal Engineer at Solidigm you will be responsible for delivering module capability for the next node on the 3D NAND technology roadmap, partnering with the development team at the factory, as well as engaging with equipment vendors to deliver cost-effective and manufacturable process technology.
Key Responsibilities
Lead the development and optimization of processes for 3D NAND semiconductor manufacturing
Collaborate with cross-functional teams to identify and solve process issues and drive yield improvement
Utilize your strong understanding of process principles, equipment, and materials to design and implement innovative solutions as well as driving cost reduction
Conduct experiments and analyze data to make data-driven decisions for process improvements
Monitor and track process performance, identify areas for improvement, and implement corrective actions
Stay updated on industry trends and advancements in dry etch technology and integrate them
Train and mentor junior engineers on processes and techniques
Collaborate with equipment suppliers to ensure the smooth integration of new equipment and processes
Contribute to the development of new process technologies to maintain competitive edge in the industry
Qualifications
Required Qualifications:
A Master's degree in Electrical Engineering, Material Science Engineering, Physics, or similar technical degree.
12+ years experience semi-conductor industry experience
7+ years direct process development experience
Established track record of innovation and results
Preferred Qualification
The ideal candidate will have direct experience in 3D NAND process development.
Additional Information
The compensation range for this role is $127,260 - $220,060 USD. Actual compensation is influenced by a variety of factors including but not limited to skills, experience, qualifications, and geographic location.
Lead WMS Functional Quality Engineer
Reliability engineer job in San Ramon, CA
About GSPANN
Headquartered in California, U.S.A., GSPANN provides consulting and IT services to global clients. We help clients transform how they deliver business value by helping them optimize their IT capabilities, practices, and operations with our experience in retail, high-technology, and manufacturing. With five global delivery centers and 1900+ employees, we provide the intimacy of a boutique consultancy with the capabilities of a large IT services firm.
Title: Lead WMS Functional Quality Engineer
Location: San Ramon, CA (4 days/week)
Job Type: Contractual
Duration: Long Term
15+ years of total IT Experience and expertise in technical configuration of WMS product, Solution Integration, Quality Assurance, System Testing, PLSQL, SQL Server, and Performance Tuning.
Must have Retail/Ecommerce ERP experience like Manhattan/Blue Yonder/Fluent etc..
10+ years of experience in progressive web and mobile apps technology and Retail Ecommerce domain.
Comfort and experience working in an Agile SCRUM and Test-driven environment.
Ensure that the WMS is correctly set up, tested, and commissioned within the warehouse operations.
Must have Working experience in Retail Domain with Ecommerce platforms, OMS, WMS, Supply chain, inventory.
Exposure and understanding to Modern web standards and technology.
Coordinate changes and reconfiguration of the WMS effectively to ensure that day-to-day business operations are not adversely affected during the transition.
Gather and understand process requirements for effective process management via the WMS.
Conduct system testing prior to any process change go-live.
Ensure all relevant and required test cases are executed, documented with results, and implemented in a controlled manner.
Draft user acceptance testing (UAT) cases and prepare the system for UAT.
Conduct UAT as required.
Working at GSPANN
GSPANN is a diverse, prosperous, and rewarding place to work. We provide competitive benefits, educational assistance, and career growth opportunities to our employees. Every employee is valued for their talent and contribution. Working with us will give you an opportunity to work globally with some of the best brands in the industry.
The company does and will take affirmative action to employ and advance in the employment of individuals with disabilities and protected veterans and to treat qualified individuals without discrimination based on their physical or mental disability status. GSPANN is an equal opportunity employer for minorities/females/veterans/disabled.
AI/ML Engineer - Build Core Intelligence for a New Class of Enterprise AI Products
Reliability engineer job in San Francisco, CA
Our client is a well-funded, product-focused AI startup building next-generation systems that help organizations capture, organize, and leverage their internal knowledge at scale. Their platform blends modern machine learning, intelligent data pipelines, and context-aware models to make teams more effective, faster, and dramatically more aligned.
They're growing quickly and looking for engineers with strong CS fundamentals and a passion for building ML systems that operate in the real world, not just in research settings. If you're energized by early-stage ownership and solving hard problems that matter, you'll thrive here.
What You'll Work On
Designing and shipping ML components end-to-end: training pipelines, data ingestion, evaluation, optimization, and deployment
Building models and services that learn from unstructured signals (text, docs, conversations, workflows) and surface actionable insights
Transforming research concepts into scalable production systems used by enterprise customers
Architecting efficient, privacy-conscious model-serving infrastructure
Collaborating across engineering, product, and research to rapidly experiment and iterate
Owning technical decisions in a fast-paced environment where craftsmanship and velocity both matter
Who Thrives Here
Engineers with strong computer science foundations (algorithms, systems, distributed computing, data structures)
Builders with experience in ML frameworks (PyTorch, TensorFlow, JAX) and modern backend systems
Candidates coming from top CS programs or AI-first startups where they've shipped real, user-facing ML/AI features
People who enjoy working in ambiguous, zero-to-one environments and want significant ownership
Curious thinkers who care about model performance, reliability, and how their work impacts real users
Bonus skills (not required): experience with LLMs, embeddings, personalization models, agentic workflows, or model evaluation at scale
Why This Role Is Compelling
You'll build foundational AI systems at a moment where the company is scaling quickly
Your work will directly shape how customers interact with and benefit from AI
You'll partner closely with experienced founders and senior engineers who value autonomy and deep thinking
You'll have room to experiment, influence architecture, and meaningfully impact technical direction
You'll join early enough to feel the upside, but after core customer traction is already proven
What's On Offer
A high-ownership engineering role with scope to grow into senior or staff pathways
A culture that balances speed with thoughtful engineering
Mission-driven work with clear real-world utility
Competitive compensation, strong equity, and flexibility in how you build
If you're excited by the idea of building intelligent systems that help real teams work smarter, and you want your engineering decisions to matter, we'd love to talk.
Materials Science AI Engineer
Reliability engineer job in Santa Clara, CA
Role: Materials Science Ai Engineer
We are seeking an AI Scientist/Engineer to join our team in developing and supporting materials discovery and design. The ideal candidate will have strong experience building AI-based solutions for building neural network architecture, attention mechanisms, multi-modal learning, aggregating and structuring training data, statistical theory, and cloud-based compute for parallelized, scalable, and automated workflows.
Key Responsibilities
• Design, develop and deploy multi-modal AI, ML, and hybrid physical-based models to solve ground-breaking material physics and design problems.
• Aggregate, process, transform and quality-control experimental and simulation data for modeling and analysis.
• Design, develop, and maintain data workflows to support materials informatics initiatives. Optimize data pipelines and model execution on parallel cloud systems (e.g., Azure, GCP, AWS).
• Collaborate with materials scientists, chemists, and software engineers to integrate analytics and predictive modeling into core R&D workflows.
• Document code, workflows, and best practices to support reproducible research.
• Apply AI and data analytics to optimize material synthesis and processing parameters in real-time, minimizing defects, improving consistency.
• Build materials-informatics pipelines combining DFT/MD simulations, high-throughput experiments, and fab/metrology data to learn process-structure-property relationships for materials used in CVD/ALD/etch equipment.
• Develop deep learning models for forecasting thermal, mechanical, chemical, and plasma-compatibility behavior of candidate materials.
Technical Skills:
• Strong proficiency in programming languages like Python and C++.
• Experience with machine learning and deep learning frameworks (e.g., PyTorch, TensorFlow).
• Knowledge of generative modeling techniques and architectures (e.g., GANs, VAEs, transformers).
• Knowledge of MLOps, model deployment pipelines, and CI/CD.
• Experience with data cleansing, preprocessing, and feature engineering
Qualifications
• Graduate or undergraduate degree in Computer Science, Engineering, Applied Mathematics, or a related technical field.
• 2-4 years of work experience (depending on educational degree) in data science, AI, machine learning, or data engineering roles.
• A strong foundation in the principles of materials science is essential to understand the underlying science and set up meaningful problems for AI.
• Expert in Python and data science libraries (e.g., pandas, NumPy, scikit-learn, TensorFlow or PyTorch).
• Expertise in use of cloud-based compute environments and tools for parallel or distributed computing.
• Strong problem-solving and communication skills.
Sr. Autonomy Systems Safety Engineer
Reliability engineer job in Fremont, CA
Company DescriptionJobs for Humanity is partnering with Supernal to build an inclusive and just employment ecosystem. Therefore, we prioritize individuals coming from the following communities: Refugee, Neurodivergent, Single Parent, Blind or Low Vision, Deaf or Hard of Hearing, Black, Hispanic, Asian, Military Veterans, the Elderly, the LGBTQ, and Justice Impacted individuals. This position is open to candidates who reside in and have the legal right to work in the country where the job is located.
Company Name: Supernal
Job Description
Supernal is at the forefront of creating emerging mobility solutions that will foster the development of human-centered cities. We are designing a completely new electric vertical take-off and landing (eVTOL) aircraft tailored to the mobility needs of future cities. This allows passengers a seamless intermodal journey that safely transports them to their final destination. We fuse research in autonomy, robotics, aviation and services to define a new category of mobility for the world's communities. We believe in creative thinking and collaboration to help build a better mobility experience for everyone, improving people's ability to move - whether for work or play.
What we do:
The Sr. Autonomy Systems Safety Engineer is responsible for Autonomy product requirements definition and decomposition, systems design and analysis, systems integration, test, and verification. The engineer works as part of the Autonomy team to develop and mature the automation technologies that are central to the Supernal Advanced Air Mobility vision.
This position will be required to work on-site 5 days a week.
What you can do:
Generate and review safety assessments and analysis at the aircraft, system, and component levels.
Generate and maintain safety-derived requirements through the development life cycle
Collaborate with designers and developers to optimize the system and software architecture to maximize capability and safety, while costs and validation efforts
Support aircraft-level development and certification activities, including aircraft-level design reviews, system simulations and evaluations, safety reviews, and aircraft integration
May require up to 10% of domestic and international travel
Other duties as assigned
What you can contribute:
Bachelor's degree in science, technology, engineering, or mathematics field required; Master's degree preferred
Minimum of five (5) years of experience (an equivalent combination of education and experience may be considered)
Experience with guidance material - such as Advisory Circular 23.1309-1E, SAE ARP 4761, and SAE ARP 4754A.
Experience with Requirement Management Tools: DOORS and MBSE Tool Rhapsody (IBM tool suite)
Experience with model based, code generation tools (e.g. MATLAB/Simulink, SCADE)
Experience with an embedded programming language (e.g., C, C++, Rust, Ada)
Experience with agile software development practices
Experience with embedded devices
Experience in a regulatory environment
Experience with highly automated systems
Strong understanding of aviation software development or autonomy software development, and system communication protocols.
Basic understanding of RTCA/DO-178 and RTCA/DO-254
Proficient with Jira (Agile tool)
Strong architectural problem solver with attention to detail
Excellent verbal and written communication skills
Proactive delivery of communication and follow up
Excellent organizational skills and attention to detail
Must have the ability to independently prioritize and accomplish work within time constraints
You may also be able to contribute:
Experience working with software and hardware requirements for onboard fault management systems preferred
Experience with analysis including but not limited to: Functional Hazard Assessment (FHA), Preliminary System Safety Assessment (PSSA), System Safety Assessment (SSA), Fault Tree Analysis (FTA), Common Mode Analysis (CMA), Failure Modes and Effects Analysis (FMEA), Failure Modes and Effects Diagnostics Analysis (FMEDA), System-Theoretic Process Analysis (STPA)
Any offer of employment is conditioned upon the successful completion of a background check. We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, citizenship, sex, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other category or class protected under applicable federal, state or local law. Individuals with disabilities may request a reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request accommodation at: [email protected]
This position may include access to certain technology and/or software source code subject to U.S. export controls laws and regulations. If an export authorization from an applicable US regulatory agency is required in connection with your employment, your employment is contingent upon Supernal's receipt of such regulatory authorization(s) and your continued compliance with all conditions and limitations pursuant to such authorization(s).
Base pay offered may vary depending on skills, experience, job-related knowledge and location. This position is also eligible for a bonus as part of total compensation.The pay range for this position is: $166,400-$232,960 USD
Click HERE or visit: *********************************** to view our benefits!
Senior Video Standards Engineer
Reliability engineer job in Cupertino, CA
Do you want to shape the future of technology and multimedia? Apple's Multimedia Engineering team is looking for a video standards engineer who will represent the voice of Apple Multimedia in standards bodies around the world. In this role, you will join a small team that is responsible for Apple's engagement with several standards developing organizations including Alliance for Open Media (AOM), ITU, MPEG, 3GPP, and others. You will investigate and develop new media technologies collaborate with other teams within Apple in such development and then engage with these organizations for their standardization.
As a video standards engineer, you will be a member of a team that represents Apple in the international standards' community. In this role you will be providing a clear position on our analysis of ongoing standards, refining and commenting on work in progress, proposing and driving to completion Apples initiatives and in general, steering the work to develop the field in line with Apple's vision and values. You may also represent Apple at standardization calls and meeting, integrating the feedback of the Apple community on the ongoing work and representing that at meetings. You will engage with various standards bodies, including the Alliance for Open Media, MPEG, 3GPP, ITU etc. and you will be involved in the development of new 2D and 3D multimedia standards. This position requires travel.
Awareness of the interaction of intellectual property and standards. Curiosity and interest in new technologies and ideas especially pertaining to multimedia. Excellent judgement and integrity with the ability to make timely and sound decisions. Self-motivation and creative and critical thinking capabilities. Strong Collaboration Skills The ability to prioritize, drive and supervise initiatives. M.S. or higher in EE/CE/CS with a focus in software engineering and or signal processing
Knowledgeable in video compression and the multimedia field in general, including 3D visual information coding. Good knowledge of the development of international standards, especially those pertaining to video/multimedia compression. Experienced in developing sophisticated video compression technologies such as MPEG-4 AVC, HEVC, VVC, or AOMedia's AV1 adn AV2 helpful. Track record of effective participation. Able to develop, analyze, propose and refine new algorithms and techniques; help develop Apples intellectual property position with inventions and filings. Experienced in key multimedia specifications from AOM, ISO, ITU, SMPTE, etc. Familiar with software development tools including Xcode, Git etc. Proficient in practical software development and able to contribute to large codebases e.g. of reference software (C, C++). Excellent writing, speaking, presentation and other interpersonal skills.
Process Quality Engineer - Swing shift
Reliability engineer job in Fremont, CA
@HYVE Solutions, missions to help customers, business partners, and employees achieve success through shared goals, strategies, resources and technology solutions.
Salary range: $90K-120K
THE SYNNEX CULTURE
SYNNEX creates additional value for all of our partners at all transaction points. For the company to succeed, each SYNNEX associate is focused on delivering the finest products, services, and solutions in the industry. SYNNEX values and rewards loyalty, teamwork, integrity, and industry. We encourage team collaboration and the spirit of entrepreneurship. Our associates are our greatest asset, and we are dedicated to providing our team members with the opportunity to realize personal growth and professional success.
Get in S•Y•N•C• with SYNNEX
Start Your New Career as a……..Quality Engineer
THE RIGHT FIT
SYNNEX Corporation is looking for a detail-oriented, hands-on, results-driven individual with proven communication skills and a strong work ethic to work in a challenging, fast-paced, energetic environment with responsibilities that include managing all aspects of the quality control production process, fall-out, audits and ISO; ensuring that division and departmental practices comply with company requirements; achieve stated objectives and meet current ISO standards.
PRINCIPAL DUTIES AND RESPONSIBILITIES (ESSENTIAL FUNCTIONS)
Main point of contact for process quality issues, including any inspection activities, priorities and escalations.
Collaborate with production teams to address quality issues and implement corrective actions.
Collaborate with PE/TE/PM to ensure alignment on quality objectives and priorities.
Support regular inspections and audits of manufacturing processes and products to identify defects or deviations from quality standards.
Provide a guidance of the acceptance criteria on the cosmetic issues to QC and MFG team.
Coordinate and resolve Stop line, Quarantine and QRQC (Quick Response Quality Control) issues.
Refocusing QA resources from data-gathering/reporting to using audit for driving process improvement opportunities.
Direct QA resources in performing primarily in-line audits to auditing primary upstream processes.
Establish and build closer links between site QA teams and Engineering / Manufacturing teams.
Work with internal Production, Engineering, Shipping/Receiving, Warehouse, Program Managers and Procurement to meet quality standards.
Develop proactive solutions and implement Quality department strategies across the organization.
Customer-facing site-based QA representative who can effectively present and communicate to internal customers and other areas.
Direct site QA teams to maintain consistent standards & metrics & to share/implement best practices across products.
Review and approve Product and Processes Corrective and Prevention Action Plans (8Ds) and perform additional assessment and analysis as assigned.
Perform failure analyses, root cause analysis and corrective action follow-up.
Assess and evaluate all reliability testing, equipment service and calibration and the verification process.
Execute internal audits on QMS, EMS, ISO 9001 and ISO 14001 standards.
Coordinate UL (Underwriter's Laboratory) and other regulatory factory audits.
ESSENTIAL CRITERIA
BS degree in Computer Science, Electrical, Mechanical or Industrial Engineering or relevant discipline plus 3 years of experience including a combination of 2+ years in contract manufacturing, 3+ years in quality control and 3+ years in a leadership position or equivalent combination of experience.
Prior manufacturing engineering and quality experience.
Proven understanding of mechanical drawing and/or tools.
Experience with server/computer, build or repair processes.
Knowledge of key customer processes (i.e. Microsoft, etc.)
Demonstrated background in interfacing with key customers within the high tech industry and experience working across multiple sites sharing best practices & implementing process improvements.
Working knowledge of MS Office programs; Word, Excel and PowerPoint.
Hands on experience with quality system training including understanding of SPC (Statistics Process Control) principles and tools.
Established ability prioritizing and managing multiple projects to meet strict deadlines.
Flexibility to work in a fast-paced, high volume, and diverse environment across functions to produce expected results.
Able to work as business needs require which may include long days, occasional evenings and weekends, and travel to all manufacturing and warehouse locations, for business meetings or training.
WHAT SYNNEX OFFERS YOU
SYNNEX Corporation (SNX) is committed to investing in our associates. They are our greatest asset, and we are dedicated to providing our team members with the opportunity to realize professional and personal growth. If you share our mission, our strong work ethic, and our values of integrity, continuous learning, quality of work, commitment, teamwork, execution and results, respect for the individual, and taking manageable risks, then SYNNEX may be the place for you.
Competitive Compensation
Profit Sharing
Employee Stock Purchase Plan
Paid Vacation Days
Paid Holidays
Paid Sick Days
Direct Deposit
Tuition Reimbursement
Medical and Prescription Insurance
Dental Insurance
Vision Care
Life & Accident Insurance
Development Scholarship Program
Flexible Spending Accounts (FSA)
Short- & Long-Term Disability
Bereavement and Jury Duty Leaves
Casual Dress Code
Employee Assistance Program
Live Well Work Well Program
Training Opportunities
Pet Insurance
“SYNNEX Corporation is an Equal Employment Opportunity employer M/F/D/V and is committed to the Quality Policy.”
Note: The preceding job description has been designed to indicate the general nature and level of work performed by employees with this classification. It is not designed to contain or be interpreted as a comprehensive inventory of all duties, responsibilities, and qualifications required of employees assigned to this job.
Top of Form
Top of Form
@ HYVE Solutions, we believe employees are our greatest asset and we empower them to make a difference in our business. Diversity and inclusion make us all better. Qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability or protected veteran status.
Auto-ApplyDrinking Water Process Engineer
Reliability engineer job in Santa Clara, CA
Kennedy Jenks is seeking a Drinking Water Process Engineer to provide technical expertise and support for drinking water treatment projects. The ideal candidate will have experience in drinking water treatment technologies and regulations, with a strong interest in continuing to develop their skills in water treatment. This role offers opportunities to work alongside experienced engineers and support clients on a variety of water quality and treatment projects.
Key Responsibilities:
Provide technical support for drinking water treatment projects, including treatment process evaluation, process selection, and operations optimization.
Assist in preliminary engineering studies and feasibility assessments for municipal drinking water treatment systems.
Support project teams with design and technical contributions, including developing process flow diagrams and design criteria for water treatment facilities.
Collaborate with client service managers by contributing technical insights during project meetings and presentations.
Participate in research and process improvements related to water quality and treatment technologies.
Provide input on water treatment facility performance evaluations and assist in operations optimization.
Stay engaged in water-focused professional organizations and present technical material at conferences.
Qualifications:
Bachelor's or Master's degree in civil / environmental engineering, or related scientific discipline required.
7+ years of experience in drinking water treatment engineering
Practical professional engineer (PE) license or ability to obtain licensure within one year of hire. License in one or multiple states (CA, CO, FL, HI, OR, TX, VA, WA) preferred.
Strong familiarity with drinking water treatment regulations and technologies.
Ability to work as part of a project team, supporting senior staff and contributing to technical deliverables.
Strong communication skills and ability to convey technical information clearly to colleagues and clients.
Kennedy Jenks supports a healthy work-life balance and utilizes a hybrid model of home and office work, with a minimum of two days per week in the office. This approach empowers our people to thrive, collaborate, and achieve their full potential.
The salary range for this position is anticipated to be $110,000 to $140,000, and may vary based upon education, experience, qualifications, licensure/certifications and geographic location.
This position is eligible for performance and incentive compensation.
#LI-Hybrid
Standard Cell Characterization Engineer
Reliability engineer job in Santa Clara, CA
Job Title: Standard Cell Characterization Engineer Position Type: Full-Time About Us Auradine is a fast-growing, well-funded startup building scalable, sustainable, and secure infrastructure for blockchain and AI applications. Founded in 2022 and backed by top-tier
investors, our team includes veterans from Palo Alto Networks, Marvell, Intel, Google, and other
leading tech companies. We are headquartered in Santa Clara, CA.
Role Overview
You will own the cell characterization flow in Auradine and play a critical role in building the
foundation of our ASIC and silicon platform by enabling high-quality, high-performance cell
libraries.
Job Responsibilities
● Characterize standard cells for advanced technology nodes.
● Generate Liberty (NLDM/CCS/ECSM) timing, power, and noise models.
● Drive methodology improvement to achieve accurate and self-consistent characterization
with reasonable runtime.
● Develop automation and analysis scripts in Python or Perl.
● Interface with design, physical implementation, and CAD teams to ensure library quality
and smooth integration.
● Perform silicon correlation and validation of characterized models as needed.
Qualifications
● Bachelor's or advanced degree in a related field.
● Minimum 3+ years of experience in standard cell library characterization.
● Experience with Liberate or PrimeLib.
● Strong command of SPICE simulations and Liberty model generation.
● Strong scripting ability in Python, Perl, or TCL.
● Understanding of advanced node design challenges (e.g., 7nm, 5nm, 3nm).
● Knowledge of standard cell design and physical layout.
● Familiarity with library QA methodologies.
● Deep understanding of power analysis, variation-aware characterization and LVF
methodology.
Site Reliability Engineer
Reliability engineer job in San Francisco, CA
Senior Platform Engineer/Site Reliability Engineer - AI Infrastructure
Join a stealth-mode startup building out their AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready to go for experimentation, full-scale model training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you'll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring
seamless orchestration across environments managed by Slurm, Kubernetes, or direct SSH access. Aswell as supporting their extremely exciting new products coming to the market!
This is a rare opportunity to work at the intersection of AI infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment.
If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it.
Responsibilities
Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilisation, and data flow.
Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.
Required Skills & Experience
Customer facing experience and the attitude to be a Swiss army knife!
Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management.
Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale.
Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.
AI/ML Engineer - Build Core Intelligence for a New Class of Enterprise AI Products
Reliability engineer job in Sonoma, CA
Our client is a well-funded, product-focused AI startup building next-generation systems that help organizations capture, organize, and leverage their internal knowledge at scale. Their platform blends modern machine learning, intelligent data pipelines, and context-aware models to make teams more effective, faster, and dramatically more aligned.
They're growing quickly and looking for engineers with strong CS fundamentals and a passion for building ML systems that operate in the real world, not just in research settings. If you're energized by early-stage ownership and solving hard problems that matter, you'll thrive here.
What You'll Work On
Designing and shipping ML components end-to-end: training pipelines, data ingestion, evaluation, optimization, and deployment
Building models and services that learn from unstructured signals (text, docs, conversations, workflows) and surface actionable insights
Transforming research concepts into scalable production systems used by enterprise customers
Architecting efficient, privacy-conscious model-serving infrastructure
Collaborating across engineering, product, and research to rapidly experiment and iterate
Owning technical decisions in a fast-paced environment where craftsmanship and velocity both matter
Who Thrives Here
Engineers with strong computer science foundations (algorithms, systems, distributed computing, data structures)
Builders with experience in ML frameworks (PyTorch, TensorFlow, JAX) and modern backend systems
Candidates coming from top CS programs or AI-first startups where they've shipped real, user-facing ML/AI features
People who enjoy working in ambiguous, zero-to-one environments and want significant ownership
Curious thinkers who care about model performance, reliability, and how their work impacts real users
Bonus skills (not required): experience with LLMs, embeddings, personalization models, agentic workflows, or model evaluation at scale
Why This Role Is Compelling
You'll build foundational AI systems at a moment where the company is scaling quickly
Your work will directly shape how customers interact with and benefit from AI
You'll partner closely with experienced founders and senior engineers who value autonomy and deep thinking
You'll have room to experiment, influence architecture, and meaningfully impact technical direction
You'll join early enough to feel the upside, but after core customer traction is already proven
What's On Offer
A high-ownership engineering role with scope to grow into senior or staff pathways
A culture that balances speed with thoughtful engineering
Mission-driven work with clear real-world utility
Competitive compensation, strong equity, and flexibility in how you build
If you're excited by the idea of building intelligent systems that help real teams work smarter, and you want your engineering decisions to matter, we'd love to talk.
Sr. Autonomy Systems Safety Engineer
Reliability engineer job in Fremont, CA
Jobs for Humanity is partnering with Supernal to build an inclusive and just employment ecosystem. Therefore, we prioritize individuals coming from the following communities: Refugee, Neurodivergent, Single Parent, Blind or Low Vision, Deaf or Hard of Hearing, Black, Hispanic, Asian, Military Veterans, the Elderly, the LGBTQ, and Justice Impacted individuals. This position is open to candidates who reside in and have the legal right to work in the country where the job is located.
Company Name: Supernal
Job Description
Supernal is at the forefront of creating emerging mobility solutions that will foster the development of human-centered cities. We are designing a completely new electric vertical take-off and landing (eVTOL) aircraft tailored to the mobility needs of future cities. This allows passengers a seamless intermodal journey that safely transports them to their final destination. We fuse research in autonomy, robotics, aviation and services to define a new category of mobility for the world's communities. We believe in creative thinking and collaboration to help build a better mobility experience for everyone, improving people's ability to move - whether for work or play.
What we do:
The Sr. Autonomy Systems Safety Engineer is responsible for Autonomy product requirements definition and decomposition, systems design and analysis, systems integration, test, and verification. The engineer works as part of the Autonomy team to develop and mature the automation technologies that are central to the Supernal Advanced Air Mobility vision.
This position will be required to work on-site 5 days a week.
What you can do:
Generate and review safety assessments and analysis at the aircraft, system, and component levels.
Generate and maintain safety-derived requirements through the development life cycle
Collaborate with designers and developers to optimize the system and software architecture to maximize capability and safety, while costs and validation efforts
Support aircraft-level development and certification activities, including aircraft-level design reviews, system simulations and evaluations, safety reviews, and aircraft integration
May require up to 10% of domestic and international travel
Other duties as assigned
What you can contribute:
Bachelor's degree in science, technology, engineering, or mathematics field required; Master's degree preferred
Minimum of five (5) years of experience (an equivalent combination of education and experience may be considered)
Experience with guidance material - such as Advisory Circular 23.1309-1E, SAE ARP 4761, and SAE ARP 4754A.
Experience with Requirement Management Tools: DOORS and MBSE Tool Rhapsody (IBM tool suite)
Experience with model based, code generation tools (e.g. MATLAB/Simulink, SCADE)
Experience with an embedded programming language (e.g., C, C++, Rust, Ada)
Experience with agile software development practices
Experience with embedded devices
Experience in a regulatory environment
Experience with highly automated systems
Strong understanding of aviation software development or autonomy software development, and system communication protocols.
Basic understanding of RTCA/DO-178 and RTCA/DO-254
Proficient with Jira (Agile tool)
Strong architectural problem solver with attention to detail
Excellent verbal and written communication skills
Proactive delivery of communication and follow up
Excellent organizational skills and attention to detail
Must have the ability to independently prioritize and accomplish work within time constraints
You may also be able to contribute:
Experience working with software and hardware requirements for onboard fault management systems preferred
Experience with analysis including but not limited to: Functional Hazard Assessment (FHA), Preliminary System Safety Assessment (PSSA), System Safety Assessment (SSA), Fault Tree Analysis (FTA), Common Mode Analysis (CMA), Failure Modes and Effects Analysis (FMEA), Failure Modes and Effects Diagnostics Analysis (FMEDA), System-Theoretic Process Analysis (STPA)
Any offer of employment is conditioned upon the successful completion of a background check. We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, citizenship, sex, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other category or class protected under applicable federal, state or local law. Individuals with disabilities may request a reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request accommodation at:
[email protected]
This position may include access to certain technology and/or software source code subject to U.S. export controls laws and regulations. If an export authorization from an applicable US regulatory agency is required in connection with your employment, your employment is contingent upon Supernal's receipt of such regulatory authorization(s) and your continued compliance with all conditions and limitations pursuant to such authorization(s).
Base pay offered may vary depending on skills, experience, job-related knowledge and location. This position is also eligible for a bonus as part of total compensation.
The pay range for this position is:
$166,400
-
$232,960 USD
Click HERE or visit: *********************************** to view our benefits!
Process Engineer
Reliability engineer job in Fremont, CA
@HYVE Solutions, missions to help customers, business partners, and employees achieve success through shared goals, strategies, resources and technology solutions.
Process Engineer
Hyve Solutions is a leader in the data center solutions industry, designing, manufacturing, and delivering custom Server, Storage, and Networking Solutions to the world's largest Cloud, Social Media, and Enterprise companies. We pride ourselves on collaboration, innovation and thought leadership. Our team consists of diverse, forward-thinking individuals who dare to challenge the status quo, while working with many of the world's biggest customers. Hyve Solutions is a part of Synnex Corporation, a Fortune 500 company. Become part of a team that thrives on excellence in a fast changing, high-growth technology environment!
Responsibilities:
• Assess and set-up production infrastructure and production process flow set-up.
• Manage NPI introduction process and production documentation management to include:
• WO BOM Validation
• Shop Floor Control System Routing
• Repaid and Quality Error Code Set-Up
• Production Flow Diagram
• Work Instructions
• Pack Out Instructions
• Accountable for Change Order Management to include:
• Assess, document and communicate all production floor and/or customer driven changes
• Coordinate ECN phase in/cut in process
• Control ECN/ECO Documentation
• Control Manufacturing Notice Documentation
• Update all applicable work instructions
• Responsible for production training in the areas of NPI, ongoing KPI's and ECN's.
• Provide project detail status reports to internal core team managers, and when applicable, our customers.
• Calculate Task time and workforce requirements.
Qualifications:
• Bachelor's Degree in an Engineering discipline (preferably Industrial or Mechanical)
• Established analytical and problem solving skills with a strong continuous improvement mindset to drive and identify improvement plans.
• Understanding of Lean Manufacturing, Six Sigma and 5S disciples.
• Detail-oriented with excellent written and verbal communication skills, to include presentation skills.
• Flexibility to travel to other client sites as needed (5%+ Domestic & International).
• Flexibility to travel to customer locations as needed (5%+ Domestic & International).
Hyve Perks
Every Day is Casual Day • Company Discounts • Community Involvement Opportunities • Profit Sharing • Medical, Dental & Vision Insurance • 401k • FSA & HSA • Paid Vacation, Holiday & Sick Days • Employee Stock Purchase Plan • Tuition Reimbursement • Live Well Work Well Program • And More
@ HYVE Solutions, we believe employees are our greatest asset and we empower them to make a difference in our business. Diversity and inclusion make us all better. Qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability or protected veteran status.
Auto-ApplyDrinking Water Process Engineer
Reliability engineer job in San Francisco, CA
Kennedy Jenks is seeking a Drinking Water Process Engineer to provide technical expertise and support for drinking water treatment projects. The ideal candidate will have experience in drinking water treatment technologies and regulations, with a strong interest in continuing to develop their skills in water treatment. This role offers opportunities to work alongside experienced engineers and support clients on a variety of water quality and treatment projects.
Key Responsibilities:
Provide technical support for drinking water treatment projects, including treatment process evaluation, process selection, and operations optimization.
Assist in preliminary engineering studies and feasibility assessments for municipal drinking water treatment systems.
Support project teams with design and technical contributions, including developing process flow diagrams and design criteria for water treatment facilities.
Collaborate with client service managers by contributing technical insights during project meetings and presentations.
Participate in research and process improvements related to water quality and treatment technologies.
Provide input on water treatment facility performance evaluations and assist in operations optimization.
Stay engaged in water-focused professional organizations and present technical material at conferences.
Qualifications:
Bachelor's or Master's degree in civil / environmental engineering, or related scientific discipline required.
7+ years of experience in drinking water treatment engineering
Practical professional engineer (PE) license or ability to obtain licensure within one year of hire. License in one or multiple states (CA, CO, FL, HI, OR, TX, VA, WA) preferred.
Strong familiarity with drinking water treatment regulations and technologies.
Ability to work as part of a project team, supporting senior staff and contributing to technical deliverables.
Strong communication skills and ability to convey technical information clearly to colleagues and clients.
Kennedy Jenks supports a healthy work-life balance and utilizes a hybrid model of home and office work, with a minimum of two days per week in the office. This approach empowers our people to thrive, collaborate, and achieve their full potential.
The salary range for this position is anticipated to be $110,000 to $140,000, and may vary based upon education, experience, qualifications, licensure/certifications and geographic location.
This position is eligible for performance and incentive compensation.
#LI-Hybrid
Site Reliability Engineer
Reliability engineer job in San Jose, CA
Senior Platform Engineer/Site Reliability Engineer - AI Infrastructure
Join a stealth-mode startup building out their AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready to go for experimentation, full-scale model training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you'll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring
seamless orchestration across environments managed by Slurm, Kubernetes, or direct SSH access. Aswell as supporting their extremely exciting new products coming to the market!
This is a rare opportunity to work at the intersection of AI infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment.
If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it.
Responsibilities
Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilisation, and data flow.
Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.
Required Skills & Experience
Customer facing experience and the attitude to be a Swiss army knife!
Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management.
Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale.
Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.