Senior Reliability Engineer jobs at JPMorgan Chase & Co. - 618 jobs

Senior Lead Site Reliability Engineer
Jpmorgan Chase & Co 4.8
Senior reliability engineer job at JPMorgan Chase & Co.
JobID: 210666813 JobSchedule: Full time JobShift: : Elevate your engineering prowess to unprecedented levels by joining a team of exceptionally gifted professionals and position yourself among the top echelon in site reliability. As a Senior Lead Site Reliability Engineer at JPMorgan Chase within the Infrastructure Platforms team of Corporate Technology, you work with your fellow stakeholders to define non-functional requirements (NFRs) and availability targets for the services in your application and product lines. You will ensure those NFRs are accounted for in your products' design and test phases, that your service level indicators are effectively measuring customer experience, and that service level objectives are defined with stakeholders and implemented in production. Job responsibilities * Creates high quality designs, roadmaps, and program charters that are delivered by you or the engineers under your guidance * Provides advice and mentoring to other engineers and acts as a key resource for technologists seeking advice on technical and business-related issues * Demonstrates site reliability principles and practices every day and champions the adoption of site reliability throughout your team * Collaborates with others to create and implement observability and reliability designs for complex systems that are robust, stable, and do not incur additional toil or technical debt * Works toward becoming an expert on the applications and platforms in your remit while understanding their interdependencies and limitations * Evolves and debug critical components of applications and platforms * Provides comprehensive and ongoing guidance, tools, and solutions to support the firms' growth * Makes significant contributions to JPMorgan Chase's site reliability community via internal forums, communities of practice, guilds, and conferences Required qualifications, capabilities, and skills * Formal training or certification on software engineering concepts and 5+ years applied experience * Advanced knowledge in site reliability culture and principles with demonstrated ability to implement site reliability within an application or platform * Advanced knowledge and experience in observability such as white and black box monitoring, service level objectives, alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, etc. * Advanced knowledge of software applications and technical processes with considerable depth in one or more technical disciplines * Ability to communicate data-based solutions with complex reporting and visualization methods * Ability to anticipate, identify, and troubleshoot defects found during testing Preferred qualifications, capabilities, and skills * Strong communication skills with ability to mentor and educate others on site reliability principles and practices * Recognized as an active contributor of the engineering community * Continues to expand network and leads evaluation sessions with vendors to see how offerings can fit into the firm's strategy * Fin-tech background is a plus
$108k-132k yearly est. Auto-Apply 60d+ ago

Looking for a job?

Let Zippia find it for you.

Senior Lead Site Reliability Engineer- Network Reliability
Jpmorganchase 4.8
Senior reliability engineer job at JPMorgan Chase & Co.
Guide and shape the future of technology at a globally recognized firm, driven by pride in ownership. As a Senior Manager of Site Reliability Engineering at JPMorgan Chase within the Network Product , you are the non-functional requirement owner and champion for the applications in your remit. You are a key influencer in your team's strategic planning, driving continual improvement in customer experience, resiliency, security, scalability, monitoring, instrumentation, and automation of the software in your area. You act in a blameless, data-driven manner and navigate difficult situations with composure and tact. Job responsibilities Demonstrates expertise in network reliability principles, including Permit to Operate, FMEA, and operational readiness, balancing new features, efficiency, and stability. Collaborates closely with network engineering teams (Datacenter, Firewall, Proxies, DMZ, Load Balancing, etc.) and Lines of Business to ensure alignment and optimal outcomes. Drives the adoption of network reliability best practices and robust observability across the organization, empirically demonstrating improvements through stability and reliability metrics. Acts as the bridge between Engineering, Operations, DevOps, and customers to build and maintain resilient, scalable, and secure network services. Leads Tier-3 network support, providing operational support for major incidents and ensuring rapid resolution and root cause analysis. Fosters a culture of continual improvement, soliciting real-time feedback to enhance the customer and user experience. Ensures knowledge sharing and collaboration across teams, avoiding duplication of work and promoting innovation. Conducts blameless, data-driven post-mortems and regular team debriefs to enable learning from both successes and failures. Provides personalized coaching and development for team members at all levels. Documents and shares knowledge, innovations, and best practices via internal forums, communities of practice, and industry conferences. Works with internal specialists, product, and engineering teams to package approaches, best practices, and lessons learned into thought leadership, methodologies, and published assets. Interacts with business, partners, and customer technical stakeholders to manage project scope, priorities, deliverables, risks and issues, and timelines for successful client outcomes. Required qualifications, capabilities, and skills Advanced proficiency in network reliability engineering, including Permit to Operate, FMEA, and operational readiness processes. Experience leading technologists to manage and solve complex network issues at a firmwide level. Ability to influence team culture by championing innovation and change for success. Proficiency in SD-WAN, cloud platforms (AWS, Azure, etc.), and major network technologies (Palo Alto, Juniper, F5, Broadcom, Arista, Cisco, etc.). Proficiency in observability and monitoring tools such as Grafana, SevOne, Prometheus, Kibana, ThousandEyes, and Splunk. Demonstrated proficiency in troubleshooting and supporting complex networking environments, including Tier-3 operational support for major incidents. Experience with continuous integration and delivery tools (e.g., Jenkins, GitLab, Terraform, etc.). Formal training or certification in network engineering concepts and 5+ years of applied experience. 10+ years of experience leading technologists to manage and solve complex technical items within your domain of expertise. 5+ years of managing a team, experience hiring, developing, and recognizing network engineering talent. Experience in scalable networking design, including high availability, redundancy, failover, and load balancing. Experience troubleshooting networking protocols such as TCP/IP, HTTPS, and BGP. Experience in customer-facing migration, including service discovery, assessment, planning, execution, and operations. Preferred qualifications, capabilities, and skills Ability to code and demonstrate data fluency
$108k-132k yearly est. Auto-Apply 56d ago
Senior AI SRE: Scale GenAI Reliability & Impact
Charles Schwab Corporation 4.8
San Francisco, CA jobs
A leading financial services firm is seeking a Senior AI Site Reliability Engineer responsible for designing and managing the reliability of AI-driven applications. In this role, you'll work on innovative projects and mentor junior engineers while collaborating with cross-functional teams. Candidates should have extensive experience in software development and reliability engineering, with a particular focus on AI systems. This on-site position is located in San Francisco and offers opportunities for professional growth and development. #J-18808-Ljbffr
$118k-152k yearly est. 2d ago
Site Reliability Engineer
The Voleon Group 4.1
Berkeley, CA jobs
Voleon is a technology company that applies state‑of‑the‑art AI and machine learning techniques to real‑world problems in finance. For nearly two decades, we have led our industry and worked at the frontier of applying AI/ML to investment management. We have become a multibillion‑dollar asset manager, and we have ambitious goals for the future. Your colleagues will include internationally recognized experts in artificial intelligence and machine learning research as well as highly experienced finance and technology professionals. The people who shape our company come from other backgrounds, including concert music performances, humanitarian aid, opera singing, sports writing, and BMX racing. You will be part of a team that loves to succeed together. In addition to our enriching and collegial working environment, we offer highly competitive compensation and benefits packages, technology talks by our experts, a beautiful modern office, daily catered lunches, and more. As a Site Reliability Engineer (SRE), you will work at the intersection of production operations and software development as you improve, manage, and monitor production‑critical infrastructure and data pipelines. At Voleon, many SREs serve together on a Production Operations team tasked with improving shared production infrastructure. Others are embedded with teams of software engineers to improve specific production systems owned by those teams. Voleon SREs work on important real‑world problems and collaborate with passionate and talented colleagues in an empowering, results‑driven environment. This role is a way to make a real difference: your contributions will make our critical systems more reliable, lower operational risk, and increase the efficiency of our engineering effort. Responsibilities Improve fault‑tolerance and maintainability of code in proprietary data pipelines and trading systems Diagnose and fix bugs in code Lead complex deployments Automate manual workflows Track and prioritize outstanding production‑related issues Share an on‑call rotation responding to incidents to ensure the continuous operation of production‑critical systems Requirements Experience with coding and debugging Python Experience with Linux Familiarity with Relational Databases & SQL Sharp analytical and problem‑solving skills and a persistent drive to make things work (better) Strong growth mindset and a passion for learning Strong technical communication skills Attention to detail 2 years of relevant industry experience An undergraduate degree or comparable training in a quantitative field or equivalent, relevant industry experience Preferred Qualifications Familiarity with best practices concerning code maintainability, documentation, quality assurance, continuous integration and deployment Experience supporting production systems Experience with any of the following: gRPC microservices, Postgres, Pandas, Golang, R, Git, Jenkins, Bazel, Prometheus, Grafana, Airflow, Kubernetes The base salary for this position is $120,000 to $160,000 in the location(s) of this posting. Individual salaries are determined through a variety of factors, including, but not limited to, education, experience, knowledge, skills, and geography. Base salary does not include other forms of total compensation such as bonus compensation and other benefits. Our benefits package includes medical, dental and vision coverage, life and AD&D insurance, 20 days of paid time off, 9 sick days, and a 401(k) plan with a company match. Friends of Voleon Candidate Referral Program If you have a great candidate in mind for this role and would like to have the potential to earn $7,500 - $15,000 if your referred candidate is successfully hired and employed by The Voleon Group, please use this to submit your referral. For more details regarding eligibility, terms and conditions please make sure to review the Voleon Referral Bonus Program. Equal Opportunity Employer The Voleon Group is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law. #J-18808-Ljbffr
$120k-160k yearly 1d ago
Senior ML Engineer: Production Pipelines & HPC Expert
Capital One 4.7
McLean, VA jobs
A leading financial services company in Virginia seeks an experienced professional to design and build data-intensive solutions. The role requires expertise in C, C++, Python, Scala, and machine learning, along with the ability to lead teams and communicate complex concepts effectively. Candidates should possess a Bachelor's and preferably a Master's degree, with a proven track record in production-ready data pipelines and ML lifecycle. Competitive compensation and comprehensive benefits are offered. #J-18808-Ljbffr
$90k-111k yearly est. 3d ago
Process Designer
Tata Consulting Engineers 4.3
Washington, WV jobs
“Together We Make Life Better”. Our quality engineering, sustainable solutions and safety record inspire everything we do. Our diverse and inclusive workforce allows for all employees to feel valued and safe to give their opinions and improve our company. Tata Consulting Engineers USA, LLC, (TCE), is a multi-disciplinary engineering organization offering a full range of integrated engineering design, project support, procurement and construction management services to the energy and chemicals industries. Position Summary: The Process / Mechanical Designer will support capital and cost projects at the Washington Works site by developing detailed mechanical and process design deliverables. This role does not require a formal engineering degree but demands strong technical and design experience in an industrial setting. The designer will collaborate closely with project managers, engineers, drafters, and construction teams to ensure safe, efficient, and cost-effective project execution. Responsibilities: Adhere to company core values of Safety, Integrity, Partnership, Respect, and Ownership. Develop complete design packages including conceptual, preliminary, and construction deliverables. Develop and revise mechanical and process design drawings including P&IDs, equipment drawings, and general arrangements. Interpret sketches, field notes, and verbal instructions to produce accurate design documents. Conduct field verification and site walkdowns to support design accuracy. Collaborate with engineers, vendors, and stake holders to resolve design challenges and optimize solutions. Review and verify drawings for accuracy, compliance, and constructability. Incorporate redlines and as-built updates into final drawing packages. Ensure compliance with applicable codes, standards, and company procedures. Participate in design reviews and provide input on constructability and safety. Prepare and revise Bills of Materials (BOMs) and technical specifications. Follow company QA/QC requirements. Support design change documentation. Qualifications & Experience: Required: Minimum 5 - 8 years of experience in mechanical or process design in a chemical or industrial environment. Familiarity with P&IDs, piping systems, and mechanical equipment layouts. Ability to read and interpret engineering drawings and specifications. Strong attention to detail and organizational skills. Ability to work independently and as part of a cross-functional team. Working knowledge of Microsoft products (Word and Excel). Proficiency in CAD software (2D/3D) - primarily MicroStation with knowledge of AutoCAD. Preferred: Familiarity with Management of Change (MOC) and Process Safety Management (PSM) documentation practices. Work Environment: Primarily office-based with frequent fieldwork in active chemical manufacturing areas. Must be able to access all areas of the plant, including elevated platforms. Exposure to industrial hazards such as moving equipment, chemicals, and varying weather conditions. Use of appropriate PPE is required. Physical Requirements: Ability to sit, stand, walk, climb, and stoop as needed. Must be able to lift up to 25 pounds occasionally. Additional Expectations: Strong problem-solving and reasoning abilities. Effective communication skills for working with cross-functional teams. Ability to manage multiple priorities and meet deadlines. Education Requirements: High school diploma or equivalent required. Associate degree or technical certification in drafting/design preferred.
$60k-75k yearly est. 2d ago
Process Engineer
CTC 4.6
Cincinnati, OH jobs
20 hrs/week ONSITE Cincinnati, OH 45224 The Manufacturing Process Engineer will be responsible for evaluating, improving, and maintaining manufacturing processes and equipment to ensure efficiency, safety, and compliance. This role requires strong analytical skills, technical expertise, and the ability to drive continuous improvement initiatives across the plant. Responsibilities Evaluate existing manufacturing processes and identify areas for improvement. Inspect and maintain mechanical equipment performance within the plant. Diagnose production issues and implement effective solutions. Conduct cost-benefit analyses for new processes and equipment. Design detailed layouts for equipment, processes, and workflows. Research and develop new processes, equipment, and products. Implement cost-saving measures and quality control systems. Ensure compliance with safety standards and legal regulations. Maintain documentation and prepare technical reports. Must Have Process evaluation and continuous improvement experience. Mechanical equipment inspection and maintenance knowledge. Strong problem-solving and root cause analysis skills. Ability to perform cost-benefit analysis. Process design and workflow optimization expertise. Knowledge of quality control systems and regulatory compliance. Technical documentation and report preparation skills. Bachelor's degree in Mechanical, Industrial, or Manufacturing Engineering (or equivalent). 2 years of experience Nice to Have Experience with advanced manufacturing technologies (automation, robotics, Industry 4.0). Familiarity with Lean Manufacturing, Six Sigma, or Kaizen methodologies. Exposure to ERP systems (SAP, Oracle, Salesforce). Project management and cross-functional collaboration skills. Innovation mindset for R&D of new processes and products. Bilingual communication (English/Spanish) for global operations. Experience in cost-saving initiatives with measurable impact.
$55k-74k yearly est. 3d ago
Senior Cluster Site Reliability Engineer
The Voleon Group 4.1
Remote
Voleon is a technology company that applies state-of-the-art machine learning techniques to real-world problems in finance. For nearly two decades, we have led our industry and worked at the frontier of applying machine learning to investment management. We have become a multibillion-dollar asset manager, and we have ambitious goals for the future. As a Senior Cluster Site Reliability Engineer (SRE), you will help scale our research compute cluster to meet our growing needs, and you will leverage engineering skills to ensure high degrees of uptime, reliability, and robustness. Our research clusters are at the core of our R&D, and you will be directly responsible for keeping this key resource available and performant. Your work will provide a world-class HPC platform for researchers to focus on cutting-edge machine learning problems at scale. You will support both on-prem and cloud infrastructure, and work to provide the best experience to our technical staff. You will leverage IaC, Automation, and SRE principles to refine and hone a product that operates 24/7 to support Voleon. The Cluster Operations team works on the frontline to triage and mitigate real-time operational issues. You will be an integral member of this team, solving day-to-day issues with high urgency, while also engineering systemic improvements and architectural fixes to prevent recurring issues. You will collaborate with engineering teams to develop improvements to monitoring/telemetry. You will help design and oversee operational frameworks to ensure the cluster operates within a set of rigorous SLAs. Responsibilities Be a first responder in the event of cluster outages or issues. Triage and resolve urgent issues as they arise Ensure a high degree of cluster uptime (measured in multiple nines), and define + track SLAs to quantify reliability Diagnose systemic/recurring patterns of problems, and engineer precision solutions to them in collaboration with engineering teams Develop robust metrics and observability for cluster health and use those metrics to inform your work. Build out custom observability mechanisms when off-the-shelf ones won't do Help software and research teams design policies around fair cluster usage, and help develop enforcement mechanisms for said policies Assist in forecasting cluster growth, and help select appropriate scale-up strategies. Help optimize operations across dimensions of cost and usability Requirements 5+ years of experience in SRE or DevOps roles, preferably working as a senior engineer or tech lead Knowledge of HPC/batch compute frameworks (Slurm, Kueue, AWS/GCP Batch) and/or machine learning training systems (Kubeflow, MLflow, Horovod) Ability to develop scripts and utilities of moderate complexity in a common scripting language (Python, Ruby, etc.) Familiarity with infrastructure-as-code and configuration management tools (Terraform, Ansible) Experience with cloud infrastructure (AWS or GCP) Familiarity designing and implementing modern observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry) Experience with distributed storage technologies (Lustre, Ceph, S3) Embodies a "system engineer" rather than "system administrator" mindset, thinking systematically and leveraging automation Bachelor degree in computer science Preferred Qualifications Hands-on experience with HPC frameworks (Slurm, Grid Engine) and Kubernetes-based job orchestrators (Airflow, Kueue, Kubeflow Pipelines), along with other distributed computing frameworks (Ray, Modin, Dask, Spark) Familiarity with ML frameworks (PyTorch/Tensorflow, JAX, Horovod, DeepSpeed) Familiarity with hybrid/on-prem environments Experience with containerization (Docker, Podman, Singularity), particularly for HPC/batch compute environments Experience with HPC networking (InfiniBand, RDMA) Solid security/IAM foundations (Identity management systems, AWS/GCP IAM, Zero Trust) The base salary range for this position is $205,000 to $235,000 in the location(s) of this posting. Individual salaries are determined through a variety of factors, including, but not limited to, education, experience, knowledge, skills, and geography. Base salary does not include other forms of total compensation such as bonus compensation and other benefits. Our benefits package includes medical, dental and vision coverage, life and AD&D insurance, 20 days of paid time off, 9 sick days, and a 401(k) plan with a company match. “Friends of Voleon” Candidate Referral ProgramIf you have a great candidate in mind for this role and would like to have the potential to earn $15,000 if your referred candidate is successfully hired and employed by The Voleon Group, please use this form to submit your referral. For more details regarding eligibility, terms and conditions please make sure to review the Voleon Referral Bonus Program. Equal Opportunity EmployerThe Voleon Group is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
$205k-235k yearly Auto-Apply 60d+ ago
Senior Reliability Engineer
Cyrusone Management Services 4.6
Remote
The Senior Reliability Engineer serves as a subject-matter expert and strategic technical authority for infrastructure reliability across a portfolio of mission-critical data center sites. This role leads the design, governance, and continuous improvement of reliability strategies for power, cooling, and control systems, applying advanced engineering judgment, analytics, and risk-based decision-making. The Senior Reliability Engineer independently evaluates complex reliability risks, prioritizes initiatives under uncertainty, and influences operational, maintenance, and capital decisions that materially impact uptime, safety, and lifecycle cost. This role operates with minimal oversight and is expected to shape standards, mentor others, and elevate reliability capability across the organization. Responsibilities: Enterprise Reliability Strategy & Asset Care Architect and govern portfolio-level, risk-based asset strategies for mission-critical power and cooling infrastructure. Apply advanced RCM principles to define maintenance and inspection strategies aligned to failure risk, system criticality, and redundancy posture. Evaluate and balance tradeoffs between maintenance investment, operational risk, spares coverage, redundancy, and capital replacement. Establish and maintain enterprise PM quality standards, including audits, task effectiveness reviews, and elimination of low-value maintenance. Operational Governance & Change Risk Management Serve as a final technical authority for high-risk SOPs, MOPs, EOPs, and operational change packages. Perform system-level risk assessments for planned work, incidents, and abnormal operating conditions. Guide site teams in CMMS data integrity, work management maturity, and adherence to approved operating procedures. Lead or oversee complex reliability investigations involving multiple systems, teams, or contributing factors. Advanced Analytics & Condition Monitoring Design and mature predictive condition-monitoring programs across the portfolio (oil analysis, thermography, vibration, battery monitoring, controls analytics). Develop and interpret leading reliability indicators and degradation trends to anticipate failures before impact. Apply statistical analysis, reliability modeling, and engineering judgment to evaluate failure likelihood and consequence. Translate analytical insights into strategic maintenance, operational mitigations, or capital recommendations. Critical Spares & Lifecycle Strategy Define and govern enterprise critical spares strategies, accounting for supplier risk, lead times, and system exposure. Identify systemic spares gaps and drive remediation plans in partnership with Supply Chain and Operations. Lead lifecycle asset assessments to guide long-range capital planning and replacement prioritization. Provide data-driven input to business cases supporting capital investments and infrastructure upgrades. Incident Leadership, RCA & Continuous Improvement Lead high-impact post-incident RCAs and FMEAs, ensuring depth of analysis beyond proximate causes. Identify and address latent design, procedural, and organizational contributors to reliability events. Ensure lessons learned result in durable changes to standards, procedures, maintenance strategies, or training. Champion continuous improvement initiatives that measurably reduce risk and failure recurrence across sites. Technical Leadership & Capability Development Act as a mentor and technical escalation point for Reliability Engineers, site engineers, and CE leaders. Coach teams on reliability methods, risk-based decision-making, and interpretation of condition-monitoring data. Influence and evolve enterprise reliability standards, playbooks, and operating philosophies. Partner with leadership to strengthen operator certification, training rigor, and operational discipline. Qualifications: 10+ years of experience in reliability engineering, maintenance engineering, or facilities engineering within mission-critical environments. Demonstrated leadership of complex, multi-system reliability programs with measurable business impact. Expert-level knowledge of RCM, FMEA, RCA, and maintenance optimization methodologies. Deep technical understanding of mission-critical infrastructure, including UPS, generators, switchgear, chillers, cooling towers, CRAH/CRAC, and BMS/EPMS. Proven experience governing SOP/MOP/EOP programs and assessing operational change risk in live environments. Advanced ability to analyze condition-monitoring, CMMS, and operational datasets and convert insights into strategic actions. Proficiency in data analysis and visualization tools (Excel, Power BI, or similar). Ability to apply statistical techniques or reliability modeling to support risk-informed decision-making under uncertainty. Strong executive-level communication skills; able to influence senior leaders and defend technical positions. Preferred Experience: Experience designing and scaling enterprise critical spares and lifecycle asset management programs. Hands-on experience with predictive analytics, failure modeling, or reliability simulations. Proficiency with Python, R, or similar tools for advanced reliability analytics. Working knowledge of SQL or other data query languages. Strong familiarity with NFPA, IEEE, ASHRAE, and other relevant codes and standards. Experience presenting reliability risk, capital tradeoffs, and investment recommendations to executive audiences. Education & Certifications: Bachelor's degree in Mechanical, Electrical, or Industrial Engineering (or equivalent experience). Preferred: CMRP, CRE, or similar advanced reliability or maintenance certification. Work Conditions: Supports 24×7 mission-critical operations; participates in on-call rotation and may support after-hours events. Ability to work safely in energized environments in compliance with LOTO and NFPA 70E. Travel to supported sites approximately 25%. Salary range: $140,000-$170,000 CyrusOne is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, sex, sexual orientation, gender identity, religion, national origin, disability, veteran status, or other legally protected status. CyrusOne provides reasonable accommodation for qualified individuals with disabilities in accordance with the Americans with Disabilities Act (ADA) and any other state or local laws. We will respond to requests for reasonable accommodations to assist you in applying for positions at CyrusOne, or to submit a resume.
$140k-170k yearly Auto-Apply 6d ago
Senior Site Reliability Engineer
Circle Internet Financial 4.5
Remote
Circle is a financial technology company at the epicenter of the emerging internet of money, where value can finally travel like other digital data - globally, nearly instantly and less expensively than legacy settlement systems. This ground-breaking new internet layer opens up previously unimaginable possibilities for payments, commerce and markets that can help raise global economic prosperity and enhance inclusion. Our infrastructure - including USDC, a blockchain-based dollar - helps businesses, institutions and developers harness these breakthroughs and capitalize on this major turning point in the evolution of money and technology. What you'll be part of: Circle is committed to visibility and stability in everything we do. As we grow as an organization, we're expanding into some of the world's strongest jurisdictions. Speed and efficiency are motivators for our success and our employees live by our company values: High Integrity, Future Forward, Multistakeholder, Mindful, and Driven by Excellence. We have built a flexible and diverse work environment where new ideas are encouraged and everyone is a stakeholder. What you'll be responsible for: The Site Reliability Engineer is responsible for building and maintaining Circle's common libraries and infrastructure to support the rapid development of software features; analyzing requirements, procedures, and problems to improve existing systems and modifying systems; building and owning scalable microservices that are responsible for reliable and secure APIs; working with SRE to improve software shipping experience and improve the speed and quality of iteration; building internal developer platform capabilities; collaborating with Product and Engineering teams to design, test, and ship software, including developing and documenting system design procedures, testing procedures, and quality standards; troubleshooting program and system malfunctions to restore normal functioning; consulting with management to ensure agreement on system principles; writing the infrastructure to deliver great development experiences. What you'll bring to Circle: 2-4 years of professional software development experience, with a strong foundation in object-oriented programming, preferably in languages such as Java or Golang Hands-on experience with major cloud platforms, including AWS, Google Cloud Platform (GCP), and Microsoft Azure Proficient with Kubernetes for container orchestration and managing scalable infrastructure Skilled in SQL database design, including schema modeling and query optimization Experience in the deployment and operation of production-quality, scalable software Emphasis on clean, maintainable code with a focus on speed, quality, and high test coverage to support continuous delivery practices Adaptable and quick learner, comfortable exploring new languages, frameworks, and technologies as needed Computer Science degree or a closely related field (or foreign equivalent) Solid understanding of API design and RESTful architecture, with the ability to derive and communicate well-structured designs Excellent communicator, able to collaborate effectively across remote teams and clearly present technical ideas and solutions Self-motivated with a growth mindset, thrives in fast-paced environments, delivers impactful user-focused software, and continuously seeks to improve without heavy oversight. Circle is on a mission to create an inclusive financial future, with transparency at our core. We consider a wide variety of elements when crafting our compensation ranges and total compensation packages. Starting pay is determined by various factors, including but not limited to: relevant experience, skill set, qualifications, and other business and organizational needs. Please note that compensation ranges may differ for candidates in other locations. Base Pay Range: $147,500 - $195,000 We are an equal opportunity employer and value diversity at Circle. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status. Additionally, Circle participates in the E-Verify Program in certain locations, as required by law. Should you require accommodations or assistance in our interview process because of a disability, please reach out to accommodations@circle.com for support. We respect your privacy and will connect with you separately from our interview process to accommodate your needs. #LI-Remote
$147.5k-195k yearly Auto-Apply 60d+ ago
Senior Site Reliability Engineer
Circle Internet Financial 4.5
San Francisco, CA jobs
Circle (NYSE: CRCL) is one of the world's leading internet financial platform companies, building the foundation of a more open, global economy through digital assets, payment applications, and programmable blockchain infrastructure. Circle's platform includes the world's largest regulated stablecoin network anchored by USDC, Circle Payments Network for global money movement, and Arc, an enterprise-grade blockchain designed to become the Economic OS for the internet. Enterprises, financial institutions, and developers use Circle to power trusted, internet-scale financial innovation. Learn more at circle.com. What you'll be part of: Circle is committed to visibility and stability in everything we do. As we grow as an organization, we're expanding into some of the world's strongest jurisdictions. Speed and efficiency are motivators for our success and our employees live by our company values: High Integrity, Future Forward, Multistakeholder, Mindful, and Driven by Excellence. We have built a flexible work environment where new ideas are encouraged and everyone is a stakeholder. What you'll be responsible for: As a Site Reliability Engineer at Circle, you'll harness AI-powered insights to build and maintain our production infrastructure estate, enabling us to serve a rapidly expanding global customer base across multiple public cloud regions. You'll apply your experience, technical expertise, and AI-enabled solutions to keep Circle's products and core systems running consistently, reliably, and with top-tier performance. In a fun, collaborative, and ever-evolving environment, you'll have ample opportunities to develop new skills, partner with diverse and cross-function teams across the organization, and stay on the cutting edge of technology. What you'll work on: Empower agile development teams with a high-performance CI/CD pipeline, ensuring fast, high-quality releases with measurable performance and quality metrics. Design, maintain, and secure cloud infrastructure using Infrastructure-as-Code tools like Terraform and Crossplane. Automate operational tasks using Go, Python, and serverless solutions (AWS Lambda, Kubernetes Jobs). Manage and monitor Kubernetes clusters for multiple production workloads. Develop and maintain blockchain infrastructure, managing nodes across Ethereum, Solana, Arbitrum, Base, Avalanche, and others. Ensure system reliability and security by participating in on-call rotations, troubleshooting disruptions, conducting root cause analysis, and collaborating with Security teams on security-focused tools and frameworks. Plan, test, and implement disaster recovery strategies for a highly available microservices architecture. Leverage AI-powered solutions for managing infrastructure, analyzing logs, detecting anomalies, capacity planning, maintaining predictively, and optimizing performance. Mentor and support team growth, fostering collaboration and scalability. Here is our team hierarchy for individual contributors: Site Reliability Engineer (II) Senior Site Reliability Engineer (III) What you'll bring to Circle (not all required): Site Reliability Engineer (II) 2+ years in DevOps or SRE roles, with a focus on tooling, automation, and infrastructure on a public cloud provider (AWS, Azure, GCP) 1+ years in CI/CD platform development and microservices support Proficiency in Go, Python, and Shell Eagerness to learn emerging AI tools Excellent communication skills-able to break down technical concepts and foster collaboration Observability, troubleshooting, and performance optimization skills in complex, distributed systems Experience with: Kubernetes clusters at scale, containerization, and Helm charts. Modern CI/CD platforms with seemingly complex gates and workflows Distributed blockchain systems and blockchain full nodes Networking (routing, DNS, load balancing, edge networking) APM, RUM, monitoring, and telemetry tools. Database technologies (PostgreSQL, Redis, OpenSearch) Migrating and transforming large, complex datasets from diverse sources, structures, and formats Data warehousing (Apache Airflow, AWS DMS, Snowflake) IaC with Terraform or Crossplane for cloud deployments AI tools (GitHub Copilot, Gemini, and ChatGPT) for productivity and code quality Large Language Models (LLMs) and AI applications in software development and operations, Senior Site Reliability Engineer (III) All the requirements of Site Reliability Engineer (II), and: 4+ years in DevOps or SRE roles 3+ years in CI/CD platform development and microservices support Strong observability, problem-solving, and performance optimization skills in complex, distributed systems Hands-on experience with Blue-Green, Canary, and A/B Testing deployment strategies for services and databases Understanding of multi-region and multi-cloud architectures. Circle is on a mission to create an inclusive financial future, with transparency at our core. We consider a wide variety of elements when crafting our compensation ranges and total compensation packages. Starting pay is determined by various factors, including but not limited to: relevant experience, skill set, qualifications, and other business and organizational needs. Please note that compensation ranges may differ for candidates in other locations. Base Pay Range: $152,500 - $205,000 We are an equal opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status, or any other protected status required by the laws in the locations where we hire. Additionally, Circle participates in the E-Verify Program in certain locations, as required by law. Should you require accommodations or assistance in our interview process because of a disability, please reach out to accommodations@circle.com for support. We respect your privacy and will connect with you separately from our interview process to accommodate your needs. #LI-Remote
$152.5k-205k yearly Auto-Apply 3d ago
Senior Site Reliability Engineer
Circle Internet Financial 4.5
Indiana jobs
Circle (NYSE: CRCL) is one of the world's leading internet financial platform companies, building the foundation of a more open, global economy through digital assets, payment applications, and programmable blockchain infrastructure. Circle's platform includes the world's largest regulated stablecoin network anchored by USDC, Circle Payments Network for global money movement, and Arc, an enterprise-grade blockchain designed to become the Economic OS for the internet. Enterprises, financial institutions, and developers use Circle to power trusted, internet-scale financial innovation. Learn more at circle.com. What you'll be part of: Circle is committed to visibility and stability in everything we do. As we grow as an organization, we're expanding into some of the world's strongest jurisdictions. Speed and efficiency are motivators for our success and our employees live by our company values: High Integrity, Future Forward, Multistakeholder, Mindful, and Driven by Excellence. We have built a flexible work environment where new ideas are encouraged and everyone is a stakeholder. What you'll be responsible for: As a Site Reliability Engineer at Circle, you'll harness AI-powered insights to build and maintain our production infrastructure estate, enabling us to serve a rapidly expanding global customer base across multiple public cloud regions. You'll apply your experience, technical expertise, and AI-enabled solutions to keep Circle's products and core systems running consistently, reliably, and with top-tier performance. In a fun, collaborative, and ever-evolving environment, you'll have ample opportunities to develop new skills, partner with diverse and cross-function teams across the organization, and stay on the cutting edge of technology. What you'll work on: Empower agile development teams with a high-performance CI/CD pipeline, ensuring fast, high-quality releases with measurable performance and quality metrics. Design, maintain, and secure cloud infrastructure using Infrastructure-as-Code tools like Terraform and Crossplane. Automate operational tasks using Go, Python, and serverless solutions (AWS Lambda, Kubernetes Jobs). Manage and monitor Kubernetes clusters for multiple production workloads. Develop and maintain blockchain infrastructure, managing nodes across Ethereum, Solana, Arbitrum, Base, Avalanche, and others. Ensure system reliability and security by participating in on-call rotations, troubleshooting disruptions, conducting root cause analysis, and collaborating with Security teams on security-focused tools and frameworks. Plan, test, and implement disaster recovery strategies for a highly available microservices architecture. Leverage AI-powered solutions for managing infrastructure, analyzing logs, detecting anomalies, capacity planning, maintaining predictively, and optimizing performance. Mentor and support team growth, fostering collaboration and scalability. Here is our team hierarchy for individual contributors: Site Reliability Engineer (II) Senior Site Reliability Engineer (III) What you'll bring to Circle (not all required): Site Reliability Engineer (II) 2+ years in DevOps or SRE roles, with a focus on tooling, automation, and infrastructure on a public cloud provider (AWS, Azure, GCP) 1+ years in CI/CD platform development and microservices support Proficiency in Go, Python, and Shell Eagerness to learn emerging AI tools Excellent communication skills-able to break down technical concepts and foster collaboration Observability, troubleshooting, and performance optimization skills in complex, distributed systems Experience with: Kubernetes clusters at scale, containerization, and Helm charts. Modern CI/CD platforms with seemingly complex gates and workflows Distributed blockchain systems and blockchain full nodes Networking (routing, DNS, load balancing, edge networking) APM, RUM, monitoring, and telemetry tools. Database technologies (PostgreSQL, Redis, OpenSearch) Migrating and transforming large, complex datasets from diverse sources, structures, and formats Data warehousing (Apache Airflow, AWS DMS, Snowflake) IaC with Terraform or Crossplane for cloud deployments AI tools (GitHub Copilot, Gemini, and ChatGPT) for productivity and code quality Large Language Models (LLMs) and AI applications in software development and operations, Senior Site Reliability Engineer (III) All the requirements of Site Reliability Engineer (II), and: 4+ years in DevOps or SRE roles 3+ years in CI/CD platform development and microservices support Strong observability, problem-solving, and performance optimization skills in complex, distributed systems Hands-on experience with Blue-Green, Canary, and A/B Testing deployment strategies for services and databases Understanding of multi-region and multi-cloud architectures. Circle is on a mission to create an inclusive financial future, with transparency at our core. We consider a wide variety of elements when crafting our compensation ranges and total compensation packages. Starting pay is determined by various factors, including but not limited to: relevant experience, skill set, qualifications, and other business and organizational needs. Please note that compensation ranges may differ for candidates in other locations. Base Pay Range: $152,500 - $205,000 We are an equal opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status, or any other protected status required by the laws in the locations where we hire. Additionally, Circle participates in the E-Verify Program in certain locations, as required by law. Should you require accommodations or assistance in our interview process because of a disability, please reach out to accommodations@circle.com for support. We respect your privacy and will connect with you separately from our interview process to accommodate your needs. #LI-Remote
$152.5k-205k yearly Auto-Apply 12d ago
Senior Site Reliability Engineer
Circle Internet Financial 4.5
San Francisco, CA jobs
Circle is a financial technology company at the epicenter of the emerging internet of money, where value can finally travel like other digital data - globally, nearly instantly and less expensively than legacy settlement systems. This ground-breaking new internet layer opens up previously unimaginable possibilities for payments, commerce and markets that can help raise global economic prosperity and enhance inclusion. Our infrastructure - including USDC, a blockchain-based dollar - helps businesses, institutions and developers harness these breakthroughs and capitalize on this major turning point in the evolution of money and technology. What you'll be part of: Circle is committed to visibility and stability in everything we do. As we grow as an organization, we're expanding into some of the world's strongest jurisdictions. Speed and efficiency are motivators for our success and our employees live by our company values: High Integrity, Future Forward, Multistakeholder, Mindful, and Driven by Excellence. We have built a flexible and diverse work environment where new ideas are encouraged and everyone is a stakeholder. What you'll be responsible for: The Site Reliability Engineer is responsible for building and maintaining Circle's common libraries and infrastructure to support the rapid development of software features; analyzing requirements, procedures, and problems to improve existing systems and modifying systems; building and owning scalable microservices that are responsible for reliable and secure APIs; working with SRE to improve software shipping experience and improve the speed and quality of iteration; building internal developer platform capabilities; collaborating with Product and Engineering teams to design, test, and ship software, including developing and documenting system design procedures, testing procedures, and quality standards; troubleshooting program and system malfunctions to restore normal functioning; consulting with management to ensure agreement on system principles; writing the infrastructure to deliver great development experiences. What you'll bring to Circle: 2-4 years of professional software development experience, with a strong foundation in object-oriented programming, preferably in languages such as Java or Golang Hands-on experience with major cloud platforms, including AWS, Google Cloud Platform (GCP), and Microsoft Azure Proficient with Kubernetes for container orchestration and managing scalable infrastructure Skilled in SQL database design, including schema modeling and query optimization Experience in the deployment and operation of production-quality, scalable software Emphasis on clean, maintainable code with a focus on speed, quality, and high test coverage to support continuous delivery practices Adaptable and quick learner, comfortable exploring new languages, frameworks, and technologies as needed Computer Science degree or a closely related field (or foreign equivalent) Solid understanding of API design and RESTful architecture, with the ability to derive and communicate well-structured designs Excellent communicator, able to collaborate effectively across remote teams and clearly present technical ideas and solutions Self-motivated with a growth mindset, thrives in fast-paced environments, delivers impactful user-focused software, and continuously seeks to improve without heavy oversight. Circle is on a mission to create an inclusive financial future, with transparency at our core. We consider a wide variety of elements when crafting our compensation ranges and total compensation packages. Starting pay is determined by various factors, including but not limited to: relevant experience, skill set, qualifications, and other business and organizational needs. Please note that compensation ranges may differ for candidates in other locations. Base Pay Range: $147,500 - $195,000 We are an equal opportunity employer and value diversity at Circle. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status. Additionally, Circle participates in the E-Verify Program in certain locations, as required by law. Should you require accommodations or assistance in our interview process because of a disability, please reach out to accommodations@circle.com for support. We respect your privacy and will connect with you separately from our interview process to accommodate your needs. #LI-Remote
$147.5k-195k yearly Auto-Apply 60d+ ago
Senior Cluster Site Reliability Engineer
The Voleon Group 4.1
Berkeley, CA jobs
Voleon is a technology company that applies state-of-the-art machine learning techniques to real-world problems in finance. For nearly two decades, we have led our industry and worked at the frontier of applying machine learning to investment management. We have become a multibillion-dollar asset manager, and we have ambitious goals for the future. As a Senior Cluster Site Reliability Engineer (SRE), you will help scale our research compute cluster to meet our growing needs, and you will leverage engineering skills to ensure high degrees of uptime, reliability, and robustness. Our research clusters are at the core of our R&D, and you will be directly responsible for keeping this key resource available and performant. Your work will provide a world-class HPC platform for researchers to focus on cutting-edge machine learning problems at scale. You will support both on-prem and cloud infrastructure, and work to provide the best experience to our technical staff. You will leverage IaC, Automation, and SRE principles to refine and hone a product that operates 24/7 to support Voleon. The Cluster Operations team works on the frontline to triage and mitigate real-time operational issues. You will be an integral member of this team, solving day-to-day issues with high urgency, while also engineering systemic improvements and architectural fixes to prevent recurring issues. You will collaborate with engineering teams to develop improvements to monitoring/telemetry. You will help design and oversee operational frameworks to ensure the cluster operates within a set of rigorous SLAs. Responsibilities Be a first responder in the event of cluster outages or issues. Triage and resolve urgent issues as they arise Ensure a high degree of cluster uptime (measured in multiple nines), and define + track SLAs to quantify reliability Diagnose systemic/recurring patterns of problems, and engineer precision solutions to them in collaboration with engineering teams Develop robust metrics and observability for cluster health and use those metrics to inform your work. Build out custom observability mechanisms when off-the-shelf ones won't do Help software and research teams design policies around fair cluster usage, and help develop enforcement mechanisms for said policies Assist in forecasting cluster growth, and help select appropriate scale-up strategies. Help optimize operations across dimensions of cost and usability Requirements 5+ years of experience in SRE or DevOps roles, preferably working as a senior engineer or tech lead Knowledge of HPC/batch compute frameworks (Slurm, Kueue, AWS/GCP Batch) and/or machine learning training systems (Kubeflow, MLflow, Horovod) Ability to develop scripts and utilities of moderate complexity in a common scripting language (Python, Ruby, etc.) Familiarity with infrastructure-as-code and configuration management tools (Terraform, Ansible) Experience with cloud infrastructure (AWS or GCP) Familiarity designing and implementing modern observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry) Experience with distributed storage technologies (Lustre, Ceph, S3) Embodies a "system engineer" rather than "system administrator" mindset, thinking systematically and leveraging automation Bachelor degree in computer science Preferred Qualifications Hands-on experience with HPC frameworks (Slurm, Grid Engine) and Kubernetes-based job orchestrators (Airflow, Kueue, Kubeflow Pipelines), along with other distributed computing frameworks (Ray, Modin, Dask, Spark) Familiarity with ML frameworks (PyTorch/Tensorflow, JAX, Horovod, DeepSpeed) Familiarity with hybrid/on-prem environments Experience with containerization (Docker, Podman, Singularity), particularly for HPC/batch compute environments Experience with HPC networking (InfiniBand, RDMA) Solid security/IAM foundations (Identity management systems, AWS/GCP IAM, Zero Trust) The base salary range for this position is $205,000 to $235,000 in the location(s) of this posting. Individual salaries are determined through a variety of factors, including, but not limited to, education, experience, knowledge, skills, and geography. Base salary does not include other forms of total compensation such as bonus compensation and other benefits. Our benefits package includes medical, dental and vision coverage, life and AD&D insurance, 20 days of paid time off, 9 sick days, and a 401(k) plan with a company match. “Friends of Voleon” Candidate Referral ProgramIf you have a great candidate in mind for this role and would like to have the potential to earn $15,000 if your referred candidate is successfully hired and employed by The Voleon Group, please use this form to submit your referral. For more details regarding eligibility, terms and conditions please make sure to review the Voleon Referral Bonus Program. Equal Opportunity EmployerThe Voleon Group is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
$205k-235k yearly Auto-Apply 60d+ ago
Senior Site Reliability Engineer
Morningstar 4.5
Chicago, IL jobs
Our Team: Technology drives our business. Our team is made up of talented software engineers, infrastructure engineers, leaders and UX professionals. We care about technology as a craft and a differentiator. We bring our global products to market with a mix of software, cloud, data centers, infrastructure, design and grit. The Role: Senior Site Reliability Engineer with extensive experience in automation and programming across the software development lifecycle (SDLC). In this role, you will leverage your technical expertise to develop and scale cloud-based solutions. You'll tackle complex challenges, collaborating closely with development and IT operations teams to enhance system visibility, improve communication, and deliver measurable business value. By optimizing feedback loops and building observable systems, you'll drive customer satisfaction and support revenue growth. This role is based in our Chicago office, and we follow a hybrid policy in order to foster continuous collaboration. Job Responsibilities: Responsible for creating and implementing system enhancements that will improve the performance and reliability of the system. Leading contributor individually and as a team member, providing direction and mentoring to others. Work with a highly skilled team of engineers to scale improvements to the cloud and scale them. Own deployment, availability, reliability, and performance of the systems. Proactively identify and reduce issues by designing, testing, and implementing software-based solutions. Be a strong contributor to development of platform services including architecture, provisioning, configuration, deployment, and support. Requirements: A bachelor's degree in computer science or a related field, with 3 to 5 years of software DevOps experience. Strong scripting or programming skills for automating repetitive tasks DevOps (CI/CD, Docker and Terraform) Experience Experience architecting and automating cloud-native technologies, deploying applications, and provisioning infrastructure. Hands-on Experience with Infrastructure as Code, using CloudFormation, Terraform, or other tools. Experience architecting cloud native CI/CD workflows and tools, such as Code Deploy (AWS) Hands-on Experience with microservices and distributed application architecture, such as containers, and/or serverless technology. Good Debugging and troubleshooting skills Experience with the software development lifecycle and delivery using Agile practices. Experience with DB deployment and writing SQL/PLSQL queries Compensation and Benefits At Morningstar we believe people are at their best when they are at their healthiest. That's why we champion your wellness through a wide-range of programs that support all stages of your personal and professional life. Here are some examples of the offerings we provide: Financial Health 75% 401k match up to 7% Stock Ownership Potential Company provided life insurance - 1x salary + commission Physical Health Comprehensive health benefits (medical/dental/vision) including potential premium discounts and company-provided HSA contributions (up to $500-$2,000 annually) for specific plans and coverages Additional medical Wellness Incentives - up to $300-$600 annual Company-provided long- and short-term disability insurance Emotional Health Trust-Based Time Off 6-week Paid Sabbatical Program 6-Week Paid Family Caregiving Leave Competitive 8-24 Week Paid Parental Bonding Leave Adoption Assistance Leadership Coaching & Formal Mentorship Opportunities Annual Education Stipend Tuition Reimbursement Social Health Charitable Matching Gifts program Dollars for Doers volunteer program Paid volunteering days 15+ Employee Resource & Affinity Groups Total Cash Compensation Range $114,100.00 - 193,975.00 USD Annual Inclusive of annual base salary and target incentive If you receive and accept an offer from us, we require that personal and any related investments be disclosed confidentiality to our Compliance team (days vary by region). These investments will be reviewed to ensure they meet Code of Ethics requirements. If any conflicts of interest are identified, then you will be required to liquidate those holdings immediately. In addition, dependent on your department and location of work certain employee accounts must be held with an approved broker (for example all, U.S. employee accounts). If this applies and your account(s) are not with an approved broker, you will be required to move your holdings to an approved broker. Morningstar's hybrid work environment gives you the opportunity to collaborate in-person each week as we've found that we're at our best when we're purposely together on a regular basis. In most of our locations, our hybrid work model is four days in-office each week. A range of other benefits are also available to enhance flexibility as needs change. No matter where you are, you'll have tools and resources to engage meaningfully with your global colleagues. 002_MstarAssocLLC Morningstar Investment Management LLC Legal Entity
$114.1k-194k yearly Auto-Apply 2d ago
Senior Reliability Engineer - PCBA, Harness & Connectors
Figure 4.5
San Jose, CA jobs
Figure is an AI robotics company developing autonomous general-purpose humanoid robots. The goal of the company is to ship humanoid robots with human level intelligence. Its robots are engineered to perform a variety of tasks in the home and commercial markets. Figure is headquartered in San Jose, CA. We are looking for a Senior Reliability Engineer in charge of developing and executing reliability test plans for the printed circuit board assemblies (PCBAs), harnesses, and connectors to ensure they meet our humanoid robot product reliability targets. We need someone who can derive reliability targets and test specs with incomplete field usage knowledge, give data-driven recommendations to the design and manufacturing teams for what to improve, and advocate for design for reliability in the design phase. Responsibilities: Work with cross-functional teams, own hardware reliability requirements and validation strategy. Develop and execute accelerated life tests for PCBAs, electronic components, electrical harness and connectors. Lead DFMEA efforts with design engineers to assess design risks, impacts, controls, and corrective actions. Design reliability test flows and procedures, communicate with internal and external/CM teams to execute tests and report results. Work with test engineers to design setup and fixtures used in reliability testing. Guide and support PCBA, harness, connector failure analysis, design of experiments (DOEs) and corrective action processes with cross-functional teams. Analyze field data, assess field risks, and design tests that correlate to field usage conditions. Requirements: 5+ years of experience in relevant reliability engineering areas. Bachelor's degree or higher in relevant science and engineering fields. Strong knowledge of environmental reliability test principles, models, and methodologies, such as high temperature high humidity, thermal cycle/shock, mechanical vibration/shock. Strong knowledge of industry test standards such as AECQ, JEDEC, IPC standards. Strong knowledge of electrical circuits, PCBA design and relevant SW tools (e.g. Altium). Strong knowledge of PCBA, harness and connector failure modes, mechanisms, and FA techniques. Hands-on experience on field reliability risk analysis and failure prediction methods. Hands-on experience with Weibull++, JMP, or other reliability statistical analysis software. Hands-on experience on electronic circuit debug and relevant tools, e.g. source meter, oscilloscope. Hands-on experience with 3D CAD tool (e.g. CATIA). Bonus Qualifications: Experience of shipping reliable robotics, consumer or automotive products. Knowledge of PCBA, harness and connector manufacturing processes and quality control practices. Hands-on experience of CATIA V6 CAD. Hands-on experience of finite element analysis (FEA) SW, e.g. Ansys. Hands-on experience of FA tools & techniques such as SEM/EDS, CT-Xray, Cross-section, FTIR. Hands-on experience of accelerometers, load cells, strain gauges and relevant setup and data acquisition techniques. The US base salary range for this full-time position is between $150,000 - $225,000 annually. The pay offered for this position may vary based on several individual factors, including job-related knowledge, skills, and experience. The total compensation package may also include additional components/benefits depending on the specific role. This information will be shared if an employment offer is extended.
$150k-225k yearly Auto-Apply 60d+ ago
Staff Site Reliability Engineer
Figure 4.5
San Jose, CA jobs
Figure is an AI robotics company developing autonomous general-purpose humanoid robots. The goal of the company is to ship humanoid robots with human level intelligence. Its robots are engineered to perform a variety of tasks in the home and commercial markets. Figure is headquartered in San Jose, CA. We are looking for a Site Reliability Engineer to own our internal systems infrastructure. This role is responsible for setting up and managing cloud and on-prem infrastructure to deliver highly available, reliable, and automated systems. Responsibilities: Be the go to person for mission critical infrastructure enabling critical operations such as Source Configuration Management, CI/CD systems, software distribution, supplier portals, manufacturing and more. Migrate SaaS to self-hosted solutions to enhance security and reliability. Implement monitoring and alerting systems, and define incident response plans and runbooks. Reduce human workload through automation to automate deployment and scaling. Establish strong relationships with stakeholders to identify infrastructure needs and establish Service Level Objectives. Use a data driven approach to demonstrate service robustness and track optimization work. Partner with the security team to ensure that security remediations and updates are applied in a timely manner. Requirements: Strong experience with Linux/Unix systems administration Proficiency in programming/scripting Extensive experience with cloud platforms (Azure, AWS, GCP) and on-prem hardware architectures Experience designing, deploying, and operating high-availability, fault-tolerant, and distributed systems. Mastery of infrastructure as code (Terraform, CloudFormation, Ansible…) Familiarity with monitoring, logging, and alerting tools (Prometheus, Grafana, Datadog…) Solid understanding of networking fundamentals (TCP/IP, DNS, HTTP, load balancers, firewalls) Experience defining Service Level Objectives (SLO), developing runbooks/incident response plans, facilitating post-mortems and managing systems assets. Ability to work in cross-functional teams with developers, infra, and product teams Excellent verbal and written communication skills The US base salary range for this full-time position is between $175,000 - $250,000 annually. The pay offered for this position may vary based on several individual factors, including job-related knowledge, skills, and experience. The total compensation package may also include additional components/benefits depending on the specific role. This information will be shared if an employment offer is extended.
$175k-250k yearly Auto-Apply 36d ago
Staff Site Reliability Engineer
CME Group 4.4
Chicago, IL jobs
We're looking for a Staff Site Reliability Engineer to join our team, focusing on the core systems that power global financial markets. This isn't just about keeping the lights on; it's about pioneering the future of financial technology. As a member of our Clearing department, you'll be on the front lines, ensuring the integrity and performance of mission-critical systems that facilitate billions of dollars in daily transactions. If you're a builder at heart, driven by a passion for creating ultra-reliable and resilient systems, you'll thrive here. This is a hybrid role. You must be in our office 2+ days a week What You'll Get * A supportive environment fostering career progression, continuous learning, and an inclusive culture. * Broad exposure to CME's diverse products, asset classes, and cross-functional teams. * A competitive salary and comprehensive benefits package. Learn more about our career opportunities here. What You'll Do As a Staff Site Reliability Engineer, you'll be a visionary builder of our resilient infrastructure. You'll move beyond conventional operations to apply software engineering principles to every facet of our clearing systems. * Pioneer solutions to guarantee the reliability, performance, and availability of our CME clearing and risk systems, where every millisecond and every transaction counts. * Architect and implement cutting-edge solutions for application resiliency and fault tolerance. * Drive automation and continuous improvement across the entire system lifecycle, eliminating manual toil and enhancing operational excellence. * Integrate SRE principles directly into the software development lifecycle, embedding reliability from day one. * Collaborate with cross-functional development and platform teams, providing expert-level guidance to deploy and maintain critical applications. * Innovate and lead efforts to prevent incidents, enhance operational processes, and automate solutions at a global scale. * Spearhead the adoption of observability and performance testing, guiding teams to a "build with SRE mindset" culture. * Own the end-to-end operational integrity of products, understanding and contributing to the bigger picture of the organization. What You'll Bring * A strong academic background: Bachelor's degree in Engineering, Computer Science, Information Technology, or a related field is strongly preferred. * Cloud expertise: Hands-on experience deploying and operating applications using IaaS and PaaS on major cloud providers, preferably Google Cloud Services. * Coding fluency: Proficiency in one or more of the following languages: Java, Python, Bash, or Go. Typescript and/or Rust are a significant plus. * Infrastructure as Code (IaC) mastery: Experience with tools such as GKE, Terraform, CloudFormation, and Chef. * Proven reliability engineering skills: Deep knowledge of SRE and security best practices, with a track record of implementing them into workflows. A solid understanding of performance testing tools is essential, along with the ability to help teams resolve complex performance issues. * Automation prowess: Demonstrated experience with automation, CI/CD, orchestration, and configuration management. * Observability knowledge: Familiarity with logging and observability platforms such as OpenTelemetry and Prometheus. * A security-first mindset: Strong understanding of security and compliance frameworks. * Problem-solving abilities: Excellent written and verbal communication skills, with the ability to convey complex technical concepts clearly to both technical and non-technical audiences. * Strong collaboration skills: An agile team player who is self-motivated and can work with minimal supervision while juggling multiple concurrent projects. * A passion for innovation: A continuous desire to learn and stay up-to-date with the latest technologies and industry trends. #LI-JK1 #LI-Hybrid CME Group is committed to offering a competitive total rewards package for our employees that recognizes their contributions to the business and reflects our long-term investment in their future. The pay range for this role is $128,500-$214,100. Actual salary offered will be dependent on a wide array of factors including but not limited to: relevant experience, skills, education and comparison to internal employees (where relevant). Our compensation program also includes an annual target bonus opportunity for all employees, as well as the opportunity to become an owner in the company through our broad-based equity program. Through our benefits program, we strive to offer flexibility, value and choice. From comprehensive health coverage, to a retirement package that includes both a 401(k) and an active pension plan, to highly competitive education reimbursement provisions, paid time off and a mental health benefit, CME Group offers a holistic benefits package for our team and their dependents. CME Group: Where Futures are Made CME Group is the world's leading derivatives marketplace. But who we are goes deeper than that. Here, you can impact markets worldwide. Transform industries. And build a career by shaping tomorrow. We invest in your success and you own it - all while working alongside a team of leading experts who inspire you in ways big and small. Problem solvers, difference makers, trailblazers. Those are our people. And we're looking for more. At CME Group, we embrace our employees' unique experiences and skills to ensure that everyone's perspectives are acknowledged and valued. As an equal-opportunity employer, we consider all potential employees without regard to any protected characteristic. Important Notice: Recruitment fraud is on the rise, with scammers using misleading promises of job offers and interviews to solicit money and personal information from job seekers. CME Group adheres to established procedures designed to maintain trust, confidence and security throughout our recruitment process. Learn more here.
$128.5k-214.1k yearly 60d+ ago
Site Reliability Engineer - Capital Markets
Jefferies Financial Group Inc. 4.8
New York, NY jobs
Jefferies is seeking for Site Reliability Engineer to play an instrumental role in supporting Equity Front office trading application, risk and middle office real time products, developed and used for Equity Cash and ETS application. As part of the wider platform engineering team, you will be working closely with the Business users interactively throughout the day, along with technical, analysis and testing colleagues. Investigation and resolution of the work items at hand will require competent technical skills and a keen intellect. The business is a growth area, with current investments taking place in all the technology, business and middle office areas. Responsibilities: * Front Line Site Reliable Engineering and Support functions for Equity trading systems used by Jefferies clients as well as internal users. * Build monitoring tools for application and infrastructure components. * Implement and manage scalable infrastructure using cloud-native technologies and tools. * Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault finding. * Partner with business, development and infrastructure teams to improve services through rigorous testing and release procedures. * Develop and maintain CI/CD pipelines to streamline deployment processes. * Expedient deployment of new systems. Capacity planning, Platform Management, and support for increasing volumes and business growth. * Create sustainable systems and services through automation. * Collaborate with Application team to establish and enforce production and development standards. * Document procedures, best practices and troubleshooting FAQs. * Resolve complex application and technical problems. * Debugging the system and fixing the production related issues. * Escalate / follow-up on permanent fix for development related issues. * Lead incident response efforts and post-mortem analysis to prevent future occurrences. * Handles complex operational tasks and recommends process and technology changes. * Global support and includes weekend availability to troubleshoot production related issues and perform checkouts. * Ability to work both independently and in groups in an energetic, diverse environment. * Participate in on-call rotations to ensure 24/7 system availability and support. * Support compliance and legal queries. Qualifications: * Strong experience in Windows and Linux/Unix services. * Strong experience in scripting language like Power shell, Python and SQL. * Strong Knowledge of monitoring tools - Nagios, Splunk, OTEL, Datadog * Strong Knowledge of FIX protocol * Strong Domain skills - Must have working experience in Capital Markets across modules and instruments especially - CASH, ETS, Bonds, Options, Futures, Swaps products * Experience in BFSI (Banking and Financial Industry) Domain applications with a proper understanding of the Trade Lifecycle. * Excellent communication, time management and project management skills. Primary Location Full Time Salary Range of $175,000 - $200,000
$175k-200k yearly Auto-Apply 42d ago
Site Reliability Engineer - Capital Markets
Jefferies Financial Group Inc. 4.8
Jersey City, NJ jobs
Jefferies is seeking for Site Reliability Engineer to play an instrumental role in supporting Equity Front office trading application, risk and middle office real time products, developed and used for Equity Cash and ETS application. As part of the wider platform engineering team, you will be working closely with the Business users interactively throughout the day, along with technical, analysis and testing colleagues. Investigation and resolution of the work items at hand will require competent technical skills and a keen intellect. The business is a growth area, with current investments taking place in all the technology, business and middle office areas. Responsibilities: Front Line Site Reliable Engineering and Support functions for Equity trading systems used by Jefferies clients as well as internal users. Build monitoring tools for application and infrastructure components. Implement and manage scalable infrastructure using cloud-native technologies and tools. Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault finding. Partner with business, development and infrastructure teams to improve services through rigorous testing and release procedures. Develop and maintain CI/CD pipelines to streamline deployment processes. Expedient deployment of new systems. Capacity planning, Platform Management, and support for increasing volumes and business growth. Create sustainable systems and services through automation. Collaborate with Application team to establish and enforce production and development standards. Document procedures, best practices and troubleshooting FAQs. Resolve complex application and technical problems. Debugging the system and fixing the production related issues. Escalate / follow-up on permanent fix for development related issues. Lead incident response efforts and post-mortem analysis to prevent future occurrences. Handles complex operational tasks and recommends process and technology changes. Global support and includes weekend availability to troubleshoot production related issues and perform checkouts. Ability to work both independently and in groups in an energetic, diverse environment. Participate in on-call rotations to ensure 24/7 system availability and support. Support compliance and legal queries. Qualifications: Strong experience in Windows and Linux/Unix services. Strong experience in scripting language like Power shell, Python and SQL. Strong Knowledge of monitoring tools - Nagios, Splunk, OTEL, Datadog Strong Knowledge of FIX protocol Strong Domain skills - Must have working experience in Capital Markets across modules and instruments especially - CASH, ETS, Bonds, Options, Futures, Swaps products Experience in BFSI (Banking and Financial Industry) Domain applications with a proper understanding of the Trade Lifecycle. Excellent communication, time management and project management skills. Primary Location Full Time Salary Range of $175,000 - $200,000
$175k-200k yearly Auto-Apply 60d ago