Reliability engineer jobs in Newark, NJ - 816 jobs

All

Reliability Engineer

Process Engineer

Senior Reliability Engineer

Software Engineer II - Site Reliability Engineer
Walt Disney Co 4.6
Reliability engineer job in New York, NY
Technology is at the heart of Disney's past, present, and future. Disney Entertainment and ESPN Product & Technology is a global organization of engineers, product developers, designers, technologists, data scientists, and more - all working to build and advance the technological backbone for Disney's media business globally. The team marries technology with creativity to build world-class products, enhance storytelling, and drive velocity, innovation, and scalability for our businesses. We are Storytellers and Innovators. Creators and Builders. Entertainers and Engineers. We work with every part of The Walt Disney Company's media portfolio to advance the technological foundation and consumer media touch points serving millions of people around the world. Here are a few reasons why we think you'd love working here: Building the future of Disney's media: Our Technologists are designing and building the products and platforms that will power our media, advertising, and distribution businesses for years to come. Reach, Scale & Impact: More than ever, Disney's technology and products serve as a signature doorway for fans' connections with the company's brands and stories. Disney+. Hulu. ESPN. ABC. ABC News…and many more. These products and brands - and the unmatched stories, storytellers, and events they carry - matter to millions of people globally. Innovation: We develop and implement groundbreaking products and techniques that shape industry norms, and solve complex and distinctive technical problems. Product Engineering is a unified team responsible for the engineering of Disney Entertainment & ESPN digital and streaming products and platforms. This includes product engineering, media engineering, quality assurance, engineering behind personalization, commerce, lifecycle, and identity. Job Summary: As a Software Engineer on the COPEX team, you'll design and build the foundational backend systems that directly power the Hulu & Disney+ streaming experience. You will architect mission-critical, high-throughput services for API and content recommendation delivery, while also building the platforms that empower our entire engineering organization to ship code with speed and confidence. You will join a talented team of engineers who build the software that: * Delivers foundational APIs and serves personalized streaming experiences to millions of users daily. * Enables our engineering organization to define, provision, and manage cloud infrastructure programmatically and at scale. * Allows teams to deploy changes to production swiftly and safely through sophisticated, automated CI/CD pipelines. * Provides deep insight into application performance via powerful, self-service observability and testing platforms. * Optimizes system capacity and cloud costs by engineering data-driven, automated solutions. Responsibilities and Duties of the Role: * Architect, build, and scale foundational backend services for API delivery and content recommendation, focusing on high availability, low latency, and massive throughput. * Design, build, and evolve our CI/CD solutions, writing clean, scalable code to automate the entire build, test, and deployment lifecycle. * Architect and develop robust, scalable test automation frameworks that product teams will use for load, integration, and functional testing. * Write software to abstract and automate infrastructure provisioning, creating a seamless, self-service experience for engineering teams using Infrastructure as Code (IaC). * Develop the core software, libraries, and services that form our observability platform, enabling engineers to easily build reliable and performant applications. * Proactively improve system architecture and build software-based solutions to reduce toil, minimize incidents, and automate remediation. Required Education, Experience/Skills/Training: Basic Qualifications * Minimum 3 years of professional experience * Experience in a DevOps or SRE role. * Experience with IaC * Experience with incident response * Experience with containerization * Experience with CI/CD tools * Experience programming in Java or a JVM language * Experience working on cross-team projects. * An ability to work both independently and collaboratively * Strong communication skills and a desire to share and learn Required Education * Bachelor's degree in computer science, Computer Engineering, Information Technology, or a related technical field. The hiring range for this position in New York is $120,300 to $161,300 per year. The base pay actually offered will take into account internal equity and also may vary depending on the candidate's geographic region, job-related knowledge, skills, and experience among other factors. A bonus and/or long-term incentive units may be provided as part of the compensation package, in addition to the full range of medical, financial, and/or other benefits, dependent on the level and position offered. About Disney Entertainment and ESPN Product & Technology: At Disney Entertainment and ESPN Product & Technology, we're blending imagination and innovation to reimagine the ways people experience and engage with the world's most beloved stories and products. Our work is wide-ranging and deeply sophisticated. We create amazing experiences, transform the future of media, and build products and platforms that enable the connection between people everywhere and the stories and sports they love. Disney's ability to marry world-class technology with one-of-a-kind creativity makes us unique. It is at the heart of our past, present, and future. We are Storytellers and Innovators. Creators and Builders. Entertainers and Engineers. About The Walt Disney Company: The Walt Disney Company, together with its subsidiaries and affiliates, is a leading diversified international family entertainment and media enterprise that includes three core business segments: Disney Entertainment, ESPN, and Disney Experiences. From humble beginnings as a cartoon studio in the 1920s to its preeminent name in the entertainment industry today, Disney proudly continues its legacy of creating world-class stories and experiences for every member of the family. Disney's stories, characters and experiences reach consumers and guests from every corner of the globe. With operations in more than 40 countries, our employees and cast members work together to create entertainment experiences that are both universally and locally cherished. This position is with Disney Streaming Technology LLC, which is part of a business we call Disney Entertainment and ESPN Product & Technology. Disney Streaming Technology LLC is an equal opportunity employer. Applicants will receive consideration for employment without regard to race, religion, color, sex, sexual orientation, gender, gender identity, gender expression, national origin, ancestry, age, marital status, military or veteran status, medical condition, genetic information or disability, or any other basis prohibited by federal, state or local law. Disney champions a business environment where ideas and decisions from all people help us grow, innovate, create the best stories and be relevant in a constantly evolving world. Apply Now Apply Later Current Employees Apply via My Disney Career Explore Location
$120.3k-161.3k yearly 58d ago

Looking for a job?

Let Zippia find it for you.

Site Reliability Engineer
Kalshi
Reliability engineer job in New York, NY
Role Roadmap We are building a next-generation financial ecosystem (think NYSE or CME from scratch). We are a small team, which means your responsibilities scale very rapidly, and your contributions are clear and visible, not marginal. There is still a lot of green field at Kalshi and a lot of it (including entire systems) can be yours. What you'll do * Improve observability, reliability and availability by defining and measuring key metrics. * Build automation and improve systems to eliminate toil and operations work. * Collaborate with our core infrastructure team to performance tune and optimize our cloud deployments. (Think Docker, Terraform, Kubernetes, EC2, etc.) * Collaborate with product teams to reduce service disruptions and automate incident response. * Proactively find and analyze reliability problems across our business units and stack, then design and implement software to create step-function improvements. * Educate, mentor and hold accountable the engineering team to improve the reliability of our systems and make reliability a core value of the Kalshi engineering culture. * Write high quality, well tested code to meet the needs of your customers. * Debugging extremely difficult technical problems, and making systems and products both work better and are easier to deploy, own, operate and diagnose. * Review all feature designs within your product area and across the company for cross-cutting projects. * Be an owner of the security, safety, scale, operational integrity, and architectural clarity of these designs. * Build integrations with 3rd party vendors. * Participate in an on-call support rotation to provide timely troubleshooting and resolution of urgent issues. What we're looking for Attributes: * You have at least 4 years of experience in software engineering. * You've designed, built, scaled and maintained production services, and know how to compose a service oriented architecture. * You write high quality, well tested code to meet the needs of your customers. * You're passionate about building an open financial system that brings the world together. * You possess strong technical skills for system design and coding. * Excellent written and verbal communication skills, and a bias toward open, transparent cultural practices. * Strong skills around observability, debugging and performance tuning. * Strong interpersonal skills working with engineers from junior to principal levels * Demonstrated critical thinking under pressure. * A willingness to dive into understanding, debugging, and improving any layer of the stack. * On-call availability to ensure swift resolution of issues. Bonus points * Experience designing and building reliable systems capable of handling high throughput and low latency. * Experience with Datadog. * Experience with Rust, Go and Terraform. * Experience with AWS, GCP, or Azure. * Experience working in a highly regulated environment. * Experience writing company-facing blog posts and training materials. Our Culture Meritocracy is at our core, and we value people who take ownership and figure (usually hard) things out. We dream big. We love our craft deeply and are proud of what we put out in the world. We are committed to our vision of building something big… but also useful: a product that brings more truth through the power of markets. Kalshians are Kalshi's most important asset: we pick Kalshians carefully, so we trust them fully on day 1. NYC Pay Transparency Disclosure: Salary Range: $100,000 to $250,000 annually plus equity and benefits. This salary range is based on the current available market data and represents the expected salary range for this role. Kalshi has minimal hierarchy and few titles, but a broad range of experience is represented within roles. Should you have compensation expectations that exceed these bands, we'd love to hear from you and would welcome you to reach out to discuss further. Commitment to Equal Opportunity Kalshi is committed to creating a culture of inclusion and belonging, and we are proud to be an equal opportunity employer. We believe it is our collective responsibility to uphold these values and encourage candidates from all backgrounds to join us in our mission. All qualified applicants will be treated with respect and receive equal consideration for employment without regard to race, color, creed, religion, sex, gender identity, sexual orientation, national origin, disability, uniform service, veteran status, age, or any other protected characteristic per federal, state, or local law. If you are passionate about what you do and want to use your talents to support our mission and values, we'd love to hear from you.
$100k-250k yearly Auto-Apply 60d+ ago
Reliability Engineer
GE Vernova
Reliability engineer job in Parsippany-Troy Hills, NJ
SummaryAs the Reliability Engineer for Metem a GE Vernova business, you will be an active contributor to the success of the organization by improving the reliability, availability, and performance of our equipment and processes. You will analyze failure data, develop maintenance strategies, and work cross-functionally to implement proactive measures that reduce downtime and increase efficiency.Job Description What you'll do Develop and implement reliability improvement strategies using industry best practices such as RCM (Reliability-Centered Maintenance), FMEA (Failure Mode and Effects Analysis), and Root Cause Analysis (RCA). Monitor and analyze equipment performance and failure data to identify trends and areas for improvement. Collaborate with maintenance, operations, engineering, and safety teams to design and implement preventive and predictive maintenance programs. Establish key performance indicators (KPIs) for equipment reliability, and track progress against targets. Drive continuous improvement initiatives aimed at reducing equipment downtime and maintenance costs. Lead investigations into equipment failures and chronic issues, identifying root causes and implementing long-term solutions. Provide technical support for asset management, including equipment life cycle analysis and spare parts optimization. Participate in the design and installation of new equipment, ensuring reliability is considered from the outset (Design for Reliability). Eligibility Requirements This role requires use of technical data subject to U.S. Government export restrictions and this posting is only for U.S. Persons (U.S. Citizens, lawful permanent residents and protected individuals (e.g., certain refugees and asylees)). GE will require proof of status prior to employment. This is an onsite position based in Parsippany, NJ. Must be open to travel requirements. Ability to travel to the Allentown, PA facility approximately 1 time per week and to Hungary on average of 2 times a year. What you'll bring (Basic Qualifications) Bachelor's degree from an accredited university in Mechanical, Electrical, or Industrial Engineering. Minimum of 7 years of experience in reliability, maintenance, or engineering Strong knowledge of reliability engineering tools and methodologies (e.g., FMEA, RCA, Weibull analysis, MTBF/MTTR). Strong knowledge of engineering concepts and maintenance repair methods. Ability to interpret blueprints, specifications, drawings, and schematics. Experience with Maintenance Management Systems. Project Management skills and experience. What will make you stand out You have completed your CMRP certification. You have a Six Sigma certification. You are detail oriented with good organizational skills. You have excellent verbal and written communication skills. You have experience in chemical manufacturing operations and/or CNC machining facilities. You have a Process Safety Management background. This role requires access to U.S. export-controlled information. If applicable, final offers will be contingent on ability to obtain authorization for access to U.S. export-controlled information from the U.S. Government. Additional Information GE Vernova offers a great work environment, professional development, challenging careers, and competitive compensation. GE Vernova is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, national or ethnic origin, sex, sexual orientation, gender identity or expression, age, disability, protected veteran status or other characteristics protected by law. GE Vernova will only employ those who are legally authorized to work in the United States for this opening. Any offer of employment is conditioned upon the successful completion of a drug screen (as applicable). Relocation Assistance Provided: Yes For candidates applying to a U.S. based position, the pay range for this position is between $123,700.00 and $206,200.00. The Company pays a geographic differential of 110%, 120% or 130% of salary in certain areas. The specific pay offered may be influenced by a variety of factors, including the candidate's experience, education, and skill set.Bonus eligibility: discretionary annual bonus.This posting is expected to remain open for at least seven days after it was posted on December 01, 2025.Available benefits include medical, dental, vision, and prescription drug coverage; access to Health Coach from GE Vernova, a 24/7 nurse-based resource; and access to the Employee Assistance Program, providing 24/7 confidential assessment, counseling and referral services. Retirement benefits include the GE Vernova Retirement Savings Plan, a tax-advantaged 401(k) savings opportunity with company matching contributions and company retirement contributions, as well as access to Fidelity resources and financial planning consultants. Other benefits include tuition assistance, adoption assistance, paid parental leave, disability benefits, life insurance, 12 paid holidays, and permissive time off.GE Vernova Inc. or its affiliates (collectively or individually, “GE Vernova”) sponsor certain employee benefit plans or programs GE Vernova reserves the right to terminate, amend, suspend, replace, or modify its benefit plans and programs at any time and for any reason, in its sole discretion. No individual has a vested right to any benefit under a GE Vernova welfare benefit plan or program. This document does not create a contract of employment with any individual.
$123.7k-206.2k yearly Auto-Apply 60d+ ago
Staff Site Reliability Engineer
Altana
Reliability engineer job in New York, NY
Altana is the network for trusted trade. Our AI-powered product network empowers governments and businesses to build a more resilient and secure global economy while keeping trade flowing. The Opportunity at Altana At Altana, we believe that software that ships must be reliable and efficient. As a Staff Site Reliability Engineer, you will be instrumental in ensuring the availability, performance, and scalability of Altana's critical production services, with a strong focus on our cloud-native environments and data pipelines. You will apply Google-style SRE principles, embedding reliability into our architecture and operations through automation, proactive monitoring, and a commitment to reducing toil. You will work hands-on with engineering teams, influencing system design for operability and contributing to the development of robust, self-healing infrastructure. This role emphasizes a deep understanding of observability practices to gain comprehensive insights into system behavior, proactive incident prevention, and efficient incident response. Success will be measured by the resilience of our production systems, the effectiveness of our observability stack, and our continuous improvement in operational efficiency and reliability. Your Responsibilities Reliability Engineering: Champion and implement SRE principles, including establishing and monitoring Service Level Objectives (SLOs) and error budgets for critical services. Drive initiatives to improve system reliability, availability, performance, and efficiency. Observability & Monitoring: Design, implement, and maintain advanced monitoring, logging, and tracing solutions for our cloud-native applications and infrastructure (e.g., Kubernetes, microservices). Develop dashboards, alerts, and runbooks that provide deep insights into system health and behavior. Automation & Toil Reduction: Identify and automate repetitive operational tasks and manual processes across our production environment. Develop tools and scripts to enhance system operations, deployment pipelines, and incident response. Incident Management & Postmortems: Actively participate in the incident response lifecycle, including detection, triage, mitigation, and resolution of production issues. Lead thorough blameless postmortems to identify root causes and implement preventative measures and lasting improvements. System Design & Optimization: Collaborate closely with development teams to influence the design of new services, ensuring they are built for operability, reliability, and cost-efficiency. Proactively identify and address performance bottlenecks and architectural weaknesses. On-Call Rotation: Participate in a periodic on-call rotation, responding to critical alerts and ensuring rapid resolution of production incidents. Data Reliability: Implement and maintain reliability and observability for critical data pipelines and data infrastructure, ensuring data integrity, availability, and timely processing. About You 5+ years of hands-on experience in a Site Reliability Engineering (SRE), DevOps, or equivalent role focusing on production system reliability and operations. Strong understanding and practical application of Site Reliability Engineering (SRE) principles, including SLOs, error budgets, toil reduction, and blameless culture. Expertise in designing, implementing, and managing observability platforms for cloud-native environments (e.g., Prometheus, Grafana, Datadog, ELK stack, OpenTelemetry, Jaeger). Proficiency in at least one programming/scripting language (e.g., Python, Go) for automation and tool development. Extensive hands-on experience with cloud platforms (AWS, Azure, or GCP), including their compute, networking, and database services. Demonstrated experience with containerization technologies (Docker) and container orchestration platforms (Kubernetes). Experience with Infrastructure as Code (IaC) tools (e.g., Terraform, OpenTofu, CloudFormation) for managing cloud resources. Proven experience participating in and improving incident management processes for critical systems. Knowledge of modern software delivery paradigms, including microservices architectures and CI/CD pipelines. Excellent problem-solving, analytical, and troubleshooting skills in complex distributed systems. Strong communication and collaboration skills, with the ability to work effectively across engineering teams. Experience with data engineering concepts, including building or operating reliable data pipelines, data streaming technologies, or managing large-scale data infrastructure. This role can be based in New York City, or the San Francisco Bay Area with an expectation of occasional travel as needed. US Salary Range and Benefits $170,000 - $220,000 The salary range, to the extent specified for this role, is a good faith statement of the minimum and maximum levels of the annual based salary for the position. The base salary offered to a successful candidate will depend on a wide range of compensation factors, including, but not limited to, work experience, education and/or training, critical skills, and/or business considerations. Competitive equity grants are included in the majority of full time offers; and are considered part of Altana's total compensation package. Altana also offers either a discretionary bonus or a variable compensation plan depending on the role. Additionally, Altana offers top-tier benefits for full-time employees, including: Flexible Time Off: Altana operates with a Flexible Time Oﬀ (FTO) policy that gives you agency over your own time oﬀ so you can maximize your work-life balance. Parental Leave: We oﬀer industry leading Paid Parental Leave (PPL), providing 14 weeks of leave for non-birthing, adoptive, and foster parents and up to 26 weeks of leave for birthing parents, all paid at 100% of your base salary. Health Benefits: We have a full suite of medical, vision, and dental benefits with generous employer contributions, designed to give you flexibility and choice for your individual health situation. Our high deductible health plan is 100% employer paid for employees and supplemented with an employer contribution to your Health Savings Account (HSA). There is also a Flexible Spending Account (FSA) option. Supplemental Benefits: Altana provides life, short- and long-term disability, and AD&D insurance coverage, all at no cost to you, so you know that you and your loved ones are covered in case of an emergency. 401(k) Savings: Save for and invest in your future using our Guideline 401(k) retirement savings program. Commuter Benefits: Save money on your commute by setting aside pre-tax funds for public transit or parking! Wellness: Because we value mental and emotional health, every Altana employee has access to a free premium subscription to Calm, the #1 app for meditation, sleep, and mindfulness. Pet Insurance: Pets are family too! Keep them healthy with Wishbone insurance and / or our Total Pet vet service and telehealth discount plan. Employee Assistance Program: Free access to confidential personal support. Dependent Care FSA: You will have access to a Dependent Care FSA, which allows you to set aside pre-tax funds for childcare expenses The recruiter assigned to this role can share more information about the specific compensation and benefit details associated with this role during the hiring process. Equal Opportunity Statement At Altana, we believe that a diverse workforce enables greater creativity, performance, and adaptability. We're proud to be an equal opportunity employer and welcome you to join us as you are. Our employment opportunities and decisions are based on business needs and individual qualifications, without regard to race, color, religious creed, national origin, ancestry, age, physical or mental disability, medical condition, marital status, sexual orientation, gender identity or expression, genetic information, family care or medical leave status, military or veteran status, or any other characteristic protected by the laws or regulations in the areas in which we operate. We prohibit discrimination and harassment of any type, in any situation. Offers related to employment at Altana will come from an Altana.ai email address. We will never ask for payment as part of the interview or onboarding process. Our Values Our values are the core beliefs that shape who we are, what we stand for, and how we behave.They form the foundation of Altana's culture and integrity and guide how we hire, design, build, and connect with each other and our customers. Trust: Our customers and partners entrust us with missions of the highest importance. We honor that by keeping our word, meeting commitments, and ensuring every action we take reinforces confidence in us. We rely on each other to deliver, to speak openly, and to hold ourselves accountable. Resilience: In a world of uncertainty and complexity, our work must withstand challenges, evolve with conditions, and ensure reliability over time. Resilience is both how we operate and what we deliver. It's how we respond when things don't go to plan - we adapt, we support each other, and we keep moving forward. Stewardship: We are stewards of every mission we touch. Because our work impacts lives and futures, we hold ourselves accountable to delivering mission impact and never compromising. Our responsibility extends beyond individual projects to the broader system of global trade. We believe that stewardship starts from within so that we can bring focus, creativity, and excellence to our work. Each of us is personally responsible for fostering a workplace where people can thrive. And we are stewards of the greater good of the company. By holding ourselves and each other accountable, we build a culture of innovation and collective success that reflects the scale of our mission. Courage: Courage is what unlocks the seemingly impossible for our customers. It's the core value that drives us make bold moves and take on big, complicated network problems-the ones others avoid. We know success isn't guaranteed, but we have the audacious vision to believe a solution is possible and to build it. Courage fuels our growth mindset. It means embracing challenges that make us stronger, and it's demonstrated by how we approach hard conversations and complex projects. At Altana, we believe that a diverse workforce enables greater creativity, performance, and adaptability. We're proud to be an equal opportunity employer and welcome you to join us as you are. Our employment opportunities and decisions are based on business needs and individual qualifications, without regard to race, color, religious creed, national origin, ancestry, age, physical or mental disability, medical condition, marital status, sexual orientation, gender identity or expression, genetic information, family care or medical leave status, military or veteran status, or any other characteristic protected by the laws or regulations in the areas in which we operate. We prohibit discrimination and harassment of any type, in any situation. Offers related to employment at Altana will come from an Altana.ai email address. We will never ask for payment as part of the interview or onboarding process.
$170k-220k yearly Auto-Apply 60d+ ago
Staff Site Reliability Engineer
Gradle 4.1
Reliability engineer job in New York, NY
Who We Are Develocity is a first-of-its-kind toolchain observability and acceleration platform that helps software teams adopt and improve DORA capabilities (including continuous delivery) in order to achieve software delivery excellence. It combines build and test acceleration with deep observability for builds and tests with Gradle Build Tool, Apache Maven™, sbt, npm, and Python, and applies to both CI and local builds and tests. Ultimately, Develocity provides an operational layer across an organization's toolchains to speed up, troubleshoot, and optimize local developer and remote CI feedback loops. Our software is used by some of the world's leading software organizations, such as Netflix, Airbnb, SAP, several top ten banks, and many other major customers across all verticals. We regularly collaborate with these and other users to make our products continuously better. We have partnered with the Apache Software Foundation, the Commonhaus Foundation, the Scala Center, the Micronaut Foundation, and other OSS projects like Spring, Quarkus, Kotlin, JUnit, AndroidX, and many more to bring the values of Develocity also to the OSS Community. Our Values Seek to Understand: Everything starts with listening and understanding, and we strive to understand different viewpoints, problems, and motivations. Before we take action, we ensure we truly grasp the challenges, perspectives, and goals. Know the Why: We approach our work with a clear sense of purpose, ensuring every step is deliberate and focused. We take meaningful action with urgency, but never at the expense of thoughtful consideration. Innovate & Iterate: We embrace challenges and are not afraid to try new things, even if they might fail. With deep understanding and a clear purpose, we can develop creative and bold solutions to tackle challenges. Own the Outcome: We are empowered to take initiative and we maintain transparency in our work and its outcomes. When we execute, we take responsibility for our decisions, measure the success of our innovations, and learn from the results. Who You Are We're building a new SRE team and looking for founding members to help shape how we operate. As a Lead SRE, you'll be a technical and operational leader for reliability across Develocity. You'll help define our SRE vision, set standards for how we operate production services, and mentor other SREs as the team grows. This is a hands-on role with broad influence across engineering, cloud platform, and customer-facing teams. The SRE team will be responsible for the reliability, performance, and availability of Develocity instances serving paying customers, open-source projects, and public-facing services, plus supporting infrastructure like artifact registries. You'll work on our internally-built Cloud Application Platform, Kubernetes on AWS, and develop deep expertise in it. When incidents happen, you'll troubleshoot issues across the stack, from application to infrastructure. You'll collaborate with the Cloud Platform team to improve the tooling you depend on, and with engineering teams to build reliability into how we ship software. If you like automating things and hate doing the same task twice, you'll fit in well. You'll be part of a distributed, remote-first team that values asynchronous communication and written documentation. Strong self-direction and clear communication across time zones are essential. Responsibilities Operate and maintain all Develocity instances and supporting services in production. Define and evolve SRE standards, practices, and operating models, including on-call, incident response, postmortems, and SLOs. Participate in a follow-the-sun on-call rotation, acting as a technical escalation point for complex or high-severity incidents. Lead incident response and blameless retrospectives, ensuring learnings result in measurable reliability improvements. Set reliability priorities using risk, customer impact, business goals, SLOs, and error budgets. Identify systemic reliability risks and continuously evolve Develocity's SaaS operations as the platform and customer base grow. Lead and influence architectural and design reviews to ensure reliability, scalability, and operability. Drive automation across deployment, upgrades, monitoring, self-healing, recovery, and operational workflows. Build and maintain comprehensive observability for all managed services, including logging, metrics, tracing, and alerting. Own disaster recovery, backups, and business continuity planning and execution. Partner with engineering leadership to balance feature delivery with reliability and operational excellence. Mentor and coach SREs, supporting technical growth and strong operational practices. Help onboard new SREs and contribute to hiring by defining and assessing SRE excellence at Develocity. Communicate clearly with customers during incidents and maintenance windows. Optimize performance, resource utilization, and operational costs. Minimum qualifications 7+ years in SRE, DevOps, or an equivalent role operating production services at scale. Experience leading reliability initiatives across multiple teams or services. Demonstrated ability to influence technical direction without direct authority. Experience designing and operating systems with SLOs and error budgets, and exercising strong judgment in balancing reliability, velocity, and cost. Strong Kubernetes experience in production environments. Cloud infrastructure expertise, preferably AWS (EKS, RDS, S3, EC2). Proficiency with observability tools (Prometheus, Grafana) and Infrastructure as Code (Terraform). Track record of incident management and response in a 24/7 on-call environment. Scripting proficiency (Python, Bash) for automation. Strong written and verbal English communication skills. Preferred qualifications Experience as a founding or early SRE establishing practices in a growing SaaS organization. Familiarity with Develocity. JVM language experience (Java, Kotlin). Experience with customer-facing and executive-level incident communications. What We Offer A ground-floor role in a new SRE team - you'll shape how we do things, not inherit someone else's decisions. Real ownership of production systems used by engineers at companies you've heard of. Direct interaction with customers when things go wrong (and when they go right). A culture that values automation over heroics. In-person meetings, such as our annual company offsite and team meetings. Work from home in a remote-first environment. Competitive salaries and equity grants. Compensation The US salary range for this position is $180-220k which reflects the target ranges for all US locations. Within this range, individual pay is determined by geographic location and additional factors including but not limited to experience, relevant skills, qualifications, seniority, performance, and travel requirements. Our recruiting team can share more information about the specific salary range for your location during the hiring process. Location Remote from anywhere in EST timezone. While our team works remotely and is spread across the globe, we deeply value daily interactions and collaboration.
$180k-220k yearly Auto-Apply 12d ago
Principal Site Reliability Engineer
Jpmorgan Chase & Co 4.8
Reliability engineer job in Jersey City, NJ
JobID: 210642025 JobSchedule: Full time JobShift: Base Pay/Salary: Palo Alto,CA $204,250.00-$285,000.00; Jersey City,NJ $204,250.00-$285,000.00 Join a globally recognized financial organization and advance your profession to new heights by contributing to revolutionary projects. You've discovered the perfect environment to have a major impact. As a Principal Site Reliability Engineer at JPMorgan Chase within the Enterprise Technology, AI/ML & Data Platforms division, you will utilize your expertise to create innovative solutions that improve critical incident management and streamline the software development lifecycle throughout the organization. Your role will involve overseeing, designing, and deploying infrastructure components to enhance reliability and ensure operational efficiency. Job responsibilities * Architect and implement observability platforms and tools for proactive detection and continuous improvement. * Lead the design and development of core observability services, including metrics pipelines and log aggregation. * Leverage modern technologies such as Open Telemetry and AI/ML for anomaly detection and automated insights. * Collaborate with engineering and SRE teams to define service-level objectives (SLOs) and error budgets. * Provide technical leadership and mentorship to engineering teams, ensuring best practices in system design. * Champion observability as a first-class concern in the software development lifecycle. * Influence platform strategy and roadmap through deep technical insight and alignment with business priorities. * Write advanced documentation and create executive presentations that translate technical issues into business impact. * Participate in industry professional forums and monitor relevant industry technologies and standards. * Lead medium to large projects by bringing together the proper perspective and integrating feedback from team members. * Participate in support responsibilities for coverage of critical applications. Required qualifications, capabilities, and skills * Formal training or certification on site reliability engineering concepts and 10+ years applied experience. * Ability to determine how each system relates to each other and build automation to improve reliability. * Experience with translating research, analysis, and tests into business recommendations. * Ability to balance and be accountable for the work of multiple architects and designers. * Understands and leads partnerships across job functions to develop efficient systems. * Engages team members and expresses complex ideas with appropriate level of detail, while providing constructive feedback. * Self-motivated and able to work well under pressure with minimal supervision. * Ability to tackle a problem by using a logical, systematic, sequential approach. Preferred qualifications, capabilities, and skills * Experience with cloud-native instrumentation and streaming data platforms. * Influence technology and policy decisions while fostering commitment and confidence in team members. * Develop effective solutions and analyze competitive positions by considering market trends. * Support the introduction of innovative methods and communicate clearly to persuade audiences. * Demonstrate concern and meet the needs of both internal and external customers. #LI-RB3
$204.3k-285k yearly Auto-Apply 60d+ ago
Site Reliability Engineer - Capital Markets
Jefferies Financial Group Inc. 4.8
Reliability engineer job in Jersey City, NJ
Jefferies is seeking for Site Reliability Engineer to play an instrumental role in supporting Equity Front office trading application, risk and middle office real time products, developed and used for Equity Cash and ETS application. As part of the wider platform engineering team, you will be working closely with the Business users interactively throughout the day, along with technical, analysis and testing colleagues. Investigation and resolution of the work items at hand will require competent technical skills and a keen intellect. The business is a growth area, with current investments taking place in all the technology, business and middle office areas. Responsibilities: Front Line Site Reliable Engineering and Support functions for Equity trading systems used by Jefferies clients as well as internal users. Build monitoring tools for application and infrastructure components. Implement and manage scalable infrastructure using cloud-native technologies and tools. Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault finding. Partner with business, development and infrastructure teams to improve services through rigorous testing and release procedures. Develop and maintain CI/CD pipelines to streamline deployment processes. Expedient deployment of new systems. Capacity planning, Platform Management, and support for increasing volumes and business growth. Create sustainable systems and services through automation. Collaborate with Application team to establish and enforce production and development standards. Document procedures, best practices and troubleshooting FAQs. Resolve complex application and technical problems. Debugging the system and fixing the production related issues. Escalate / follow-up on permanent fix for development related issues. Lead incident response efforts and post-mortem analysis to prevent future occurrences. Handles complex operational tasks and recommends process and technology changes. Global support and includes weekend availability to troubleshoot production related issues and perform checkouts. Ability to work both independently and in groups in an energetic, diverse environment. Participate in on-call rotations to ensure 24/7 system availability and support. Support compliance and legal queries. Qualifications: Strong experience in Windows and Linux/Unix services. Strong experience in scripting language like Power shell, Python and SQL. Strong Knowledge of monitoring tools - Nagios, Splunk, OTEL, Datadog Strong Knowledge of FIX protocol Strong Domain skills - Must have working experience in Capital Markets across modules and instruments especially - CASH, ETS, Bonds, Options, Futures, Swaps products Experience in BFSI (Banking and Financial Industry) Domain applications with a proper understanding of the Trade Lifecycle. Excellent communication, time management and project management skills. Primary Location Full Time Salary Range of $175,000 - $200,000
$175k-200k yearly Auto-Apply 41d ago
Site Reliability Engineer
Quanta Search
Reliability engineer job in New York, NY
Our client, a boutique US HedgeFund, is seeking an SRE who will sit at the center of trading operations and infrastructure to ensure the automation and delivery of productive and efficient trading systems. This is a newly created position which will provide autonomy to identify gaps, improve the whole life-cycle of services-from inception and design, through deployment, operation and refinement of our trading systems. The ideal candidate will demonstrate ownership of the Devops processes through a systematic problem-solving approach and a desire to build robust, cutting edge and scalable systems. As an SRE, you will: Have a deep understanding of the trading workflow, ensuring its effectiveness across teams. Assist in automating the release and deployment of software. Streamline the build process. Work in cluster environments. Monitor and manage trading workflow, research and trading. Build, streamline, organize and design framework. Requirements 5+ years of experience in a DevOps/Cluster Engineer role. Bachelor's Degree in Computer Science. Knowledge of the Linux operating system, permissions, NFS. Effective verbal and written communication skills. Experience managing entire pipelines, working with tools such as Jenkins, Airflow and Ansible. Highly productive in python, bash. Experience in distributed computing/ parallized computing, cluster running computational jobs, developing system in this area. Experience in data recording, storage, and maintenance (backups, redundancy, compression and archiving). Knowledge of CMake. Knowledge of the continuous integration and deployment of code as well as the development toolchain. Hardware and software expertise. Strong problem solving aptitude. Additional skills/experience that will reflect favorably Experience installing/configuring the ELK stack, including logstash input/output plugins. Experience building Grafana/Kibana dashboards. Familiarity with jupyter notebooks and Docker. Trading operations experience, build environment experience, development experience in python and C++ preferred. Production Environment Monitoring and Maintenance (diagnostics and troubleshooting live trading) Cloud (aws/gcp/azure) and Distributed Computing (slurm, spark, etc) Thank you for illuminating hiring with Quanta Search! ********************
$90k-125k yearly est. 60d+ ago
Staff Site Reliability Engineer
Weight Loss, Better Sex, Fuller Hair, Improved Skin and More Online
Reliability engineer job in New York, NY
Ro is a direct-to-patient healthcare company with a mission of helping patients achieve their health goals by delivering the easiest, most effective care possible. Ro is the only company to offer nationwide telehealth, labs, and pharmacy services. This is enabled by Ro's vertically integrated platform that helps patients achieve their goals through a convenient, end-to-end healthcare experience spanning from diagnosis, to delivery of medication, to ongoing care. Since 2017, Ro has helped millions of patients, including one in every county in the United States, and in 98% of primary care deserts. Ro has been recognized as a Fortune Best Workplace in New York and Health Care for four consecutive years (2021-2024). In 2023, Ro was also named Best Workplace for Parents for the third year in a row. In 2022, Ro was listed as a CNBC Disruptor 50. The Role: At Ro, our mission is to provide world-class healthcare by putting patients first - and that mission depends on reliable, secure, and scalable systems. As a Staff SRE on the infrastructure team, you'll sit at the core of that effort: owning the reliability of our production systems, hardening infrastructure and building tools that empower our engineers to ship safely and confidently. You will work across teams to drive uptime, performance and observability - partnering closely with product, platform and security engineers. From designing resilient systems to shaping incident response practices, this is a role for engineers who thrive on impact and care deeply about operational excellence.What You'll Do: Design and implement resilient infrastructure to support high availability at scale Build and contribute to tools and platforms that streamline deployment, monitoring and recovery of systems Drive incident response and harness learnings, leading efforts to minimize downtime and improve MTTR Partner with engineering teams to bake best practices for reliability, resilience and observability into services Automate infrastructure workflows using IaC and other cloud native tools Champion a culture of operational excellence, guiding engineers through reliability practices and raising the bar across the engineering org What You'll Bring to the Team: Deep understanding of systems and infrastructure, with experience operating distributed services in production. We are mostly in AWS and leverage a lot of its primitives - EKS, RDS, Route53, S3, Elasticache to name a few Strong programming and automation skills using Go (bonus points for Python) Proficiency with infrastructure as code - Terraform / Pulumi A passion for observability, with hands-on experience in metrics, logging, tracing using Datadog Strong cross-functional communication, able to collaborate with product, platform, security and other teams An operational mindset that puts reliability and resilience as a core product requirement A mission-driven attitude, motivated by the opportunity to make healthcare better. We've Got You Covered: Full medical, dental, and vision insurance + OneMedical membership Healthcare and Dependent Care FSA 401(k) with company match Flexible PTO Wellbeing + Learning & Growth reimbursements Paid parental leave + Fertility benefits Pet insurance Student loan refinancing Virtual resources for mindfulness, counseling, and fitness The target base salary for this position ranges from $211,700 to $292,000, in addition to a competitive equity and benefits package (as applicable). When determining compensation, we analyze and carefully consider several factors, including location, job-related knowledge, skills and experience. These considerations may cause your compensation to vary. Ro recognizes the power of in-person collaboration, while supporting the flexibility to work anywhere in the United States. For our Ro'ers in the tri-state (NY) area, you will join us at HQ on Tuesdays and Thursdays. For those outside of the tri-state area, you will be able to join in-person collaborations throughout the year (i.e., during team on-sites). At Ro, we believe that our diverse perspectives are our biggest strengths - and that embracing them will create real change in healthcare. As an equal opportunity employer, we provide equal opportunity in all aspects of employment, including recruiting, hiring, compensation, training and promotion, termination, and any other terms and conditions of employment without regard to race, ethnicity, color, religion, sex, sexual orientation, gender identity, gender expression, familial status, age, disability and/or any other legally protected classification protected by federal, state, or local law. See our California Privacy Policy here.
$90k-125k yearly est. Auto-Apply 60d+ ago
Site Reliability Engineer
Clay Labs
Reliability engineer job in New York, NY
About Clay Our mission is to help organizations turn any growth idea into reality. We see growth as a creative practice, not a formula. Finding and reaching your best-fit customers takes unique ideas and constant iteration, especially in a world where AI rewards the teams who think differently. We're already helping thousands of customers - including Anthropic, Waste Management, Figma, and Ramp - go to market with unique data, signals, and AI research. In 2025, we crossed $100M in revenue and raised a $100M Series C at a $3.1B valuation, backed by world-class investors including Sequoia, CapitalG, and First Round. We also completed our first first employee tender offer and launched a community equity round, for our customers, agency partners, and club members. Some things to know about us: * Our community includes 11,000+ customers, 150+ integration partners, 125+ agencies, and 50+ Clay clubs. * Our culture is unique inside and outside of work. Our team members are also DJs, activists, writers, clowns, marathoners, skydivers, psychedelic therapists, social workers, and more. * All employees can work for free with world-class coaches who specialize in creativity, management, and more. * Our operating principles - including negative maintenance and non-attached action - guide our work. * Read about us in the NYT, Forbes, First Round Review, and more. Hear from our employees directly on our Glassdoor page! SRE @ Clay In this role, you'll join our growing infrastructure team in building and fine-tuning our infrastructure to keep our services running smoothly. We're looking for someone who's excited about automation and continuous improvement. While your main focus will be on infrastructure, coding skills are a must. As a growing startup, we all jump in where needed, so you'll need to be comfortable taking on a variety of roles. What You'll Do * Architect, design, implement, and manage robust, scalable, and secure infrastructure solutions. * Develop, maintain, and enforce best practices for CI/CD, infrastructure as code, and automation. * Oversee the management and optimization of cloud infrastructure, ensuring high availability, performance, and cost-efficiency. * Implement monitoring, logging, and alerting solutions to maintain system health and quickly resolve issues. * Lead incident response efforts, troubleshooting and resolving complex issues in a timely manner. * Participate in an oncall rotation. * Work with teams across the company to ensure we achieve the right balance of developer velocity, reliability and performance, and cost efficiency. What You'll Bring * 5+ years of experience * Experience with containerization and orchestration tools * Strong understanding of CI/CD concepts and tools * Knowledge of infrastructure automation tools * Experience with oncall and incident response * Proficiency in one or more programming languages * Familiarity with our stack or ability to learn unfamiliar technologies quickly: * Aurora Postgres RDS, Elasticache Redis, Docker + ECS, Lambda, OpenSearch * Terraform and Atlantis * CircleCI, Netlify, Playwright * Cloudwatch, Datadog, Mezmo * Typescript, Python
$90k-125k yearly est. 40d ago
Site Reliability Engineering
Forhyre
Reliability engineer job in New York, NY
Job Description Forhyre is looking for engineers who can bring unique perspectives and innovative ideas to all areas of development and are interested in continuing to improve our platform through the ever-changing technology landscape. To be successful in this role You'll have the opportunity to design and implement major infrastructure components, systems, and developer-friendly capabilities to improve the availability, scalability, latency, and efficiency of our services You will provide technical leadership to cross-functional engineering, infrastructure, and product teams, and evangelize cloud best practices while building a culture of reliability and observability Engage in and improve the end to end lifecycle of software development--from inception and design, through deployment, operation and refinement of a highly distributed system running in public cloud Serve as subject matter expert in an SRE mindset, best practices, and cloud-native principles Scale systems sustainably through automation to improve reliability and velocity Assist with all aspects of operational security and compliance Run software performance analysis and system tuning Design and implement tools to collect data from various sources and provide actionable insights Participate in critical incident management and timely post-mortems of production incidents to drive practices around blameless analysis, resolution, and continuous improvement work with cross-functional teams Develop the rest of the team by conducting code reviews, providing mentorship, pairing, and training opportunities Qualification & Skills We are looking for Principal SRE with proven experience in running distributed systems at scale, in production You have 15+ years of experience in relevant skills gained and developed in the same or similar role Strong knowledge of container orchestration, preferably Kubernetes and networking technology Hands-on experience in one or more languages, such as Node JS, Python, Go, Perl, Ruby, and Bash Experience with SOA, Microservices architecture, API Management & Enterprise system Integrations Strong production experience with cloud infrastructure, AWS, Azure & Google Cloud Strong sense of ownership, and an ability to drive tasks to completion Experience developing and monitoring distributed systems Experience working in an Agile Environment with great collaboration skills
$90k-125k yearly est. 8d ago
Site Reliability Engineer/ Terraform Developer - C67711 6.0 New York, NYC
CapB Infotek
Reliability engineer job in New York, NY
SRE Perform low level component design · Develop and build the code Create confidence and certainty in deployments with immutable infrastructure built and tested using CI/CD. Should have experience with the following technologies: AWS Terraform Container Orchestration ( Kubernetes, ECS) Configuration Management tools (Chef, Puppet) Infrastructure as Code (Terraform, CloudFormation) Experience level 3-5 yrs.
$90k-125k yearly est. 60d+ ago
Reliability Engineer
Actalent
Reliability engineer job in Allendale, NJ
Join an innovative and dynamic team as a Reliability Engineer, where you will play a pivotal role in maintaining and enhancing the performance of our systems. You will be responsible for ensuring data integrity, supporting quality investigations, and driving continuous improvements in asset management. Responsibilities * Maintain and update P&ID's and AutoCAD drawings for clean rooms, physical tags, BMS systems, and other relevant databases. * Ensure data integrity of all P&ID items through comprehensive asset walkdowns. * Support deviations, CAPAs, audits, quality investigations, and change controls across GMP environments. * Create and standardize site asset management lists with a focus on continuous improvement in planning, tracking, and performance. Essential Skills * Proficiency in AutoCAD and reliability engineering. * Experience with CMMS, Maximo, and Blue Mountain RAM (BMRAM). * GMP experience and documentation control. * Ability to work independently with a strong attention to detail. Additional Skills & Qualifications * Bachelor's degree in Engineering or a related discipline. * Analytical skills with the ability to read and interpret blueprints, plans, and manuals. * Excellent customer service skills with a desire to exceed customer expectations. * Experience in a cGMP or aseptic environment is preferred. Job Type & Location This is a Permanent position based out of Allendale, NJ. Pay and Benefits The pay range for this position is $110000.00 - $120000.00/yr. * (1) week of paid sick time • (2) weeks of paid vacation + accrued paid time off • Paid Federal Holidays + (4) floating holidays paid. • Fidelity for 401k plan with a 6%-7% match program. Workplace Type This is a fully onsite position in Allendale,NJ. Application Deadline This position is anticipated to close on Jan 2, 2026. About Actalent Actalent is a global leader in engineering and sciences services and talent solutions. We help visionary companies advance their engineering and science initiatives through access to specialized experts who drive scale, innovation and speed to market. With a network of almost 30,000 consultants and more than 4,500 clients across the U.S., Canada, Asia and Europe, Actalent serves many of the Fortune 500. The company is an equal opportunity employer and will consider all applications without regard to race, sex, age, color, religion, national origin, veteran status, disability, sexual orientation, gender identity, genetic information or any characteristic protected by law. If you would like to request a reasonable accommodation, such as the modification or adjustment of the job application process or interviewing due to a disability, please email actalentaccommodation@actalentservices.com for other accommodation options.
$110k-120k yearly 13d ago
Site Reliability Engineer
Nexuscorp
Reliability engineer job in New Providence, NJ
CTH 6\-12 months Client Fiserv MUST be a USC or GCH Must be local to Berkeley Heights, NJ Description: What does a successful Site Reliability Engineer do at Fiserv? A successful Site Reliability Engineer at Fiserv blends software engineering principles with operational discipline to create high\-performing, reliable software systems. They design and implement tools, processes, and systems to improve the reliability, scalability, and performance of large\-scale applications and services. Requirements Automate operational tasks and health checks to create sustainable systems and services. Monitor the production environment to ensure system health using observability tools like Dynatrace and Splunk. Identify reliability gaps through process reengineering and analyze performance metrics. Collaborate with development operations for system design consulting, platform management, and capacity planning. Create and maintain detailed documentation, including SOPs, configurations, and infrastructure maps. What you will need to have: 5+ years of experience in Site Reliability Engineering (SRE) within a Fintech or product organization. 4+ years of experience with automation tools like Python, Java, Ansible, or PowerShell. 4+ years of experience with observability and monitoring tools such as Dynatrace, Splunk, Moogsoft, or Grafana. Bachelor's degree in computer science or related technical field and\/or 7+ years of relevant work experience. What would be great to have: Experience managing CI\/CD pipelines and automation tools like GitLab, Harness, Nexus, Terraform, or SonarQube. Strong problem\-solving and critical thinking skills for root cause analysis and proactive solution implementation. Effective communication skills for collaboration with cross\-functional teams and customer interactions. "}}],"is Mobile":false,"iframe":"true","job Type":"Contract","apply Name":"Apply Now","zsoid":"695381556","FontFamily":"Verdana, Geneva, sans\-serif","job OtherDetails":[{"field Label":"Employment Type","uitype":100,"value":"C2C"},{"field Label":"Industry","uitype":2,"value":"Employment \- Recruiting \- Staffing"},{"field Label":"Work Authorization","uitype":100,"value":"US Citizen;GC"},{"field Label":"Salary","uitype":1,"value":"65\/hr"},{"field Label":"City","uitype":1,"value":"Berkeley Heights"},{"field Label":"State\/Province","uitype":1,"value":"New Jersey"},{"field Label":"Zip\/Postal Code","uitype":1,"value":"07922"}],"header Name":"Site Reliability Engineer","widget Id":"**********00072311","is JobBoard":"false","user Id":"**********00268007","attach Arr":[],"custom Template":"3","is CandidateLoginEnabled":false,"job Id":"**********06106005","FontSize":"12","location":"Berkeley Heights","embedsource":"CareerSite","indeed CallBackUrl":"https:\/\/recruit.zoho.com\/recruit\/JBApplyAuth.do","logo Id":"vgtkw21b67ab9913e491893119e6f375ff5ba"}
$87k-121k yearly est. 60d+ ago
Reliability Engineer
Mini-Circuits 4.1
Reliability engineer job in New York, NY
Description Mini-Circuits designs, manufactures and distributes integrated circuits, modules, and sub-systems for high-performance radio frequency (RF) and microwave applications. With design, sales and manufacturing locations in over 30 countries, Mini-Circuits' products are used in a range of wired and wireless communications applications. Our products are also used in detection, measurement and imaging applications, including military communication, guidance and electronic countermeasure systems, commercial, scientific, military land, sea and aircraft; automotive systems, medical systems, and industrial test equipment. Mini-Circuits' sells its products to over 20,000 customers globally through our direct sales force, applications engineering staff, sales representatives, as well as through our extensive website. Position Summary: The Reliability Engineer is responsible for conducting reliability studies of existing products and coordinating new product qualifications prior to market release. The candidate will work in collaboration with various teams including Reliability, Design Engineering, Product Engineering, Failure Analysis and Project Management teams. Salary Range: $99,000 - $117,000 per year Job Function: Participate in product development meetings and guide the team to develop reliable products that meet internal specifications and customer requirements. Develop qualification plans for new products, primarily MMICs but also support other product lines including but not limited to Low Temperature Co-Fired Ceramics, PCBA products, RF accessories and Core & Wire Products. Analyze new products for similarity with existing released products in terms of package, die process and design to determine Qualification by Similarity, thus streamlining qualification testing. Design and execute both device level and package level qualification tests including but not limited to MSL pre-conditioning, Thermal cycling, UHAST, HTSL, ESD and Life Tests. Define ESD Human Body Model (HBM) and Charged Device Model (CDM) tests as per JEDEC standards. Collaborate with Engineering Test Teams to execute Accelerate Life Tests, High Temperature Operating Life Test. Execute Mechanical stresses such as Vibration, Mechanical Shock, Constant Acceleration & Bend Testing. Co-ordinate with external labs for outsourced tests. Review RF Test data before and after stresses to analyze changes is performance. Collaborate with Failure Analysis teams to understand the root cause of failures. Identify and record any non-conformities. Monitor solution implementations to verify effectiveness of corrective actions. Ensure On Time Completion of Qualification activities and escalate any potential delays. Present Qualification results with all relevant stakeholders to help Design teams initiate changes to improve reliability performance. Prepare written reports summarizing the results of product performance and failure analysis for both internal purposes as well as customer review. Interface with customers and suppliers on product reliability as required. Interface with supplier to purchase lab equipment. Support reliability assessments originating from production of released products or customer returns. Makes decisions within area of specialty, manages medium to large projects. Promotes ISO9001/AS9100 Quality. The duties, responsibilities and expectations described above are not a comprehensive list and additional tasks may be assigned to the member, within the scope of the position. Qualifications: BS in Mechanical Engineering, Electrical Engineering, Materials, Reliability, Industrial Engineering or Physics. Advanced degree preferred. 3-5 years' experience as a Reliability Engineer in Semiconductor or equivalent industry. Familiarity with common industry standards including JEDEC, MIL-STD-883, MIL-STD-202 and AEC-Q. Experience with Reliability Qualification by Similarity. Experience with Environmental, Mechanical and ESD stresses. Experience with problem solving methodologies and leading root cause analysis. Experience with customer returns failure analysis support. Must have familiarity with failure analysis techniques including Scanning Acoustic Microscopy (SEM), Radiographic Inspection (X-Ray), Cross-Section methods. Familiarity with MTTF, MTBF Calculations. Experience with Reliability prediction modeling and tools like Weibull++ (or equivalent reliability software) Experience with Data analysis tools including Advanced Excel, JMP, Minitab. Ability to analyze component performance data in reliability tests, including large variety of test parts and multiple design variations. Experience with Design of Experiments, FMEA, product design reviews and DFM. Excellent written and oral communication skills. Physical Demands: The physical demands described here are representative of those that must be met by an employee to successfully perform the essential functions of this job. While performing the duties of this job, the employee is regularly required to talk and hear. The employee frequently is required to stand, walk, sit and use hands to operate a computer keyboard. The employee is occasionally required to reach with hands and arms. The employee must occasionally lift and/or move up to 10 pounds. Specific vision abilities required by this job include close vision, and ability to adjust focus. Reasonable accommodations may be made to enable individuals with disabilities to perform the essential functions. Cultural Focus: Displays enthusiasm and Passion for their work. Works to the standard of Uncompromising Quality by meeting or exceeding stated objectives and embracing continuous improvement. Exercises sound Business Judgement, ensuring that efforts are on track with the Company's goals. Operates with the mindset of Customer Obsession - by meeting or exceeding expectations to both internal and external customers. Takes Accountability by taking ownership and accepting responsibility for their conduct and contributions. Demonstrates a strong sense of urgency and accomplishes tasks with Speed and attention to detail. Cooperates, collaborates and contributes to shared goals with a strong sense of Teamwork. Conducts themselves with Honesty & Integrity and treats all members with Trust & Respect. Additional Requirements/Skills: Comply, understand, and support corporate safety initiatives to ensure a safe work environment. Ability and willingness to abide by Company's Code of Conduct. Occasional travel, some overnight, as required (15%). Benefits: We offer a comprehensive package of benefits including [paid time off, medical/dental/vision insurance and 401(k)] to eligible employees. Comprehensive Medica, Dental and Vision Plans 401k and Profit -Sharing Programs Disability Insurance Life Insurance Employer-Sponsored Wellness Plans Commuter Benefits Hospital & Accident Indemnity Insurance Employee Benefit Advocate & Employee Assistance Program. Disclaimer: The listed qualifications and requirements for each position are intended as guidelines. Mini-Circuits reserves the right to hire outside of these guidelines at Management's discretion.Mini-Circuits is an Equal Opportunity Employer and does not discriminate on the basis of actual or perceived age, race, creed, color, national origin, sexual orientation, military status, sex, disability, predisposing genetic characteristics, marital status, familial status, gender identity, gender dysphoria, pregnancy-related condition, and domestic violence victim status or protected class characteristic, or any other protected characteristic as established by federal or state law.
$99k-117k yearly Auto-Apply 60d+ ago
Cloud Site Reliability Engineer
Ayr Global It Solutions 3.4
Reliability engineer job in New York, NY
AYR Global IT Solutions is a national staffing firm focused on cloud, cyber security, web application services, ERP, and BI implementations by providing proven and experienced consultants to our clients. Our competitive, transparent pricing model and industry experience make us a top choice of Global System Integrators and enterprise customers with federal and commercial projects supported nationwide. Job Description Role: Cloud Site Reliability Engineer Location: NYC or Boston Duration: Fulltime Permanent Qualifications Description: As Cloud Site Reliability Engineer will deploy solutions in the public cloud (e.g., AWS) using configuration, provisioning and management tools (e.g., AWS CloudFormation). He / she are required to design configuration templates that are used to provision infrastructure components (i.e., AWS EFS, EC2, RDS, etc.) in a scalable manner. The configuration developer works closely with the scrum master to understand project requirements within an agile software development environment. Responsibilities: •Provide inputs to major architectural designs to ensure consistency, security, maintainability and flexibility with respect to the overall system architecture •Support architects in designing highly scalable and automated deployments for a wide range of applications •Templatize configurations using architecture blueprints •Develop stable and scalable services across public cloud environments like AWS and GCP •Configure and assess overall compliance of infrastructure resources against policy rules •Make recommendations to improve process efficiency and effectiveness •Handle support escalations from developers requiring troubleshooting with existing configuration details •Fulfill requests from developers for designing new configuration templates as per their needs Requirements: •Prior experience in scripting and creating configuration templates using cloud provider tools (e.g., AWS CloudFormation) •Experience as an infrastructure and / or platform developer with scripting languages (e.g., Python, Ruby) •General knowledge of the following: IT concepts, strategies and methodologies, IT architectures and technical standards •General knowledge of layered systems architectures •General understanding of shared software concepts •General knowledge of cloud-centric architectures and technical standards •General knowledge of agile software development concepts and processes •Consultative skills, including the ability to understand and assist in applying customer requirements •Ability in drawing out unforeseen implications and making recommendations for design, •Ability to define design reasoning with an understanding of potential impacts to design requirements •Familiarity in developing stable and scalable services in public and private cloud environments Bachelors Degree in a Computer related field 5+ yrs experience. Additional Information If anyone might be interest, please share your resume at *************************** or you can directly contact me at ************
$104k-148k yearly est. Easy Apply 12h ago
Site Reliability Engineer III
JPMC
Reliability engineer job in Jersey City, NJ
There's nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems. As a Site Reliability Engineer III at JPMorgan Chase within the Technology division, you will solve complex and broad business problems with simple and straightforward solutions. Through code and cloud infrastructure, you will configure, maintain, monitor, and optimize applications and their associated infrastructure to independently decompose and iteratively improve on existing solutions. You are a significant contributor to your team by sharing your knowledge of end-to-end operations, availability, reliability, and scalability of your application or platform. Job Responsibilities Guides and assists others in the areas of building appropriate level designs and gaining consensus from peers where appropriate Collaborates with other software engineers and teams to design and implement deployment approaches using automated continuous integration and continuous delivery pipelines Collaborates with other software engineers and teams to design, develop, test, and implement availability, reliability, scalability, and solutions in their applications Implements infrastructure, configuration, and network as code for the applications and platforms in your remit Collaborates with technical experts, key stakeholders, and team members to resolve complex problems Understands service level indicators and utilizes service level objectives to proactively resolve issues before they impact customers Supports the adoption of site reliability engineering best practices within the team Required qualifications, capabilities and skills Formal training or certification on site reliability concepts and 3+ years applied experience Proficient in site reliability culture and principles and familiarity with how to implement site reliability within an application or platform Proficient in at least one programming language such as Python, Java/Spring Boot, and .Net Proficient knowledge of software applications and technical processes within a given technical discipline (e.g., Cloud, artificial intelligence, Android, etc.) Experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others Experience with continuous integration and continuous delivery tools like Jenkins, GitLab, or Terraform Familiarity with container and container orchestration such as ECS, Kubernetes, and Docker Familiarity with troubleshooting common networking technologies and issues Ability to contribute to large and collaborative teams by presenting information in a logical and timely manner with compelling language and limited supervision Ability to proactively recognize road blocks and demonstrates interest in learning technology that facilitates innovation Ability to identify new technologies and relevant solutions to ensure design constraints are met by the software team
$87k-121k yearly est. Auto-Apply 60d+ ago
Site Reliability Engineer
Rapinno Tech
Reliability engineer job in Piscataway, NJ
Role: Site Reliability Engineer Duration: Long Term Contract Domain: Largest Enterprise Telecom Client Middleware tech WebSphere/WebLogic/tomcat, Shell scripting, AWS, Ansible/jenkins - Must have Some Production support exp Description As a member of the Platform as a Service team, you will be responsible for the design and development of medium to highly complex systems. This includes the design and implementation of infrastructure from specifications, configuration and deployment of applications, connecting to back-end resources, and advanced troubleshooting of moderately complex software applications. Deployment, middleware administration and operational support of (production, staging, test and development) environments for multiple projects using WebSphere, Weblogic, and Tomcat Application Server. Monitors systems capacity and performance, plans and executes disaster recovery procedures, and provides Tier 2 technical support. In addition, this role requires the candidate to be highly flexible in hours of work because of its customer-facing, highly available infrastructure requirements. Work closely with Dev, QA and production support team members to align and orchestrate resolutions on open issues/defects. Provides high level written communications to upper management regarding production issues. Required Skills 3-5 years managing and administrating middleware technologies(Weblogic, Websphere, Tomcat). 3+ years hands-on experience with Solaris, Linux (RHEL, CentOS, Ubuntu), in bare-metal and Cloud-based infrastructure (AWS, OpenStack) Experience with cloud platforms AWS( Auto scaling , AVI, security, EC2 , EFS , EBS , S3 , KMS) Strong experience with Installing IBM WebSphere MQ and creating multi instance Queue manager in AWS by using EBS/EFS volumes, creating MQ objects, clusters, channels etc. Experience with configuring the clustered Queue managers for HA and load-balancing as well troubleshooting in clustered environment Installing open source Rabbit MQ on AWS EC2 instances with the use of CFTs/ansible and automating it by using Jenkins. Also creating Classic Load balancer to distribute traffic among those Rabbit MQ instances Experience with migrating applications from monolithic to kubernetes container platform Experience with APIGEE Proxy configurations and troubleshooting Hands on experience with CI/CD tools such as Jenkins, Ansible Working knowledge of monitoring tools like CA Wily, New Relic, and Datadog Experience with Elasticsearch, Kibana, and Logstash Execution on all release engineering aspects of DevOps including the configuration management , Build and Deployment Management, Continuous Integration and Delivery Ansible based deployment and configuration automation solutions. Experience with web based services and protocols ( HTTP , HTTPS, REST , Apache , Tomcat) Experience with micro-service architectures and deployment. Knowledge on L2/L3 protocols , IPv4/IPv6 and TCP/IP stack . Proficiency in high level script languages (Python preferred) as well as script environments like bash Experience with DevOps workflow automation (Jenkins, Ansible, Puppet) Strong analytical & troubleshooting skills. Experience with tools like JIRA, Confluence, Stash M.S. or relevant experience required. Preferred to have: AWS Certification
$87k-121k yearly est. 60d+ ago
Staff Site Reliability Engineer
Ro 4.0
Reliability engineer job in New York, NY
Job DescriptionRo is a direct-to-patient healthcare company with a mission of helping patients achieve their health goals by delivering the easiest, most effective care possible. Ro is the only company to offer nationwide telehealth, labs, and pharmacy services. This is enabled by Ro's vertically integrated platform that helps patients achieve their goals through a convenient, end-to-end healthcare experience spanning from diagnosis, to delivery of medication, to ongoing care. Since 2017, Ro has helped millions of patients, including one in every county in the United States, and in 98% of primary care deserts. Ro has been recognized as a Fortune Best Workplace in New York and Health Care for four consecutive years (2021-2024). In 2023, Ro was also named Best Workplace for Parents for the third year in a row. In 2022, Ro was listed as a CNBC Disruptor 50. The Role: At Ro, our mission is to provide world-class healthcare by putting patients first - and that mission depends on reliable, secure, and scalable systems. As a Staff SRE on the infrastructure team, you'll sit at the core of that effort: owning the reliability of our production systems, hardening infrastructure and building tools that empower our engineers to ship safely and confidently. You will work across teams to drive uptime, performance and observability - partnering closely with product, platform and security engineers. From designing resilient systems to shaping incident response practices, this is a role for engineers who thrive on impact and care deeply about operational excellence.What You'll Do: Design and implement resilient infrastructure to support high availability at scale Build and contribute to tools and platforms that streamline deployment, monitoring and recovery of systems Drive incident response and harness learnings, leading efforts to minimize downtime and improve MTTR Partner with engineering teams to bake best practices for reliability, resilience and observability into services Automate infrastructure workflows using IaC and other cloud native tools Champion a culture of operational excellence, guiding engineers through reliability practices and raising the bar across the engineering org What You'll Bring to the Team: Deep understanding of systems and infrastructure, with experience operating distributed services in production. We are mostly in AWS and leverage a lot of its primitives - EKS, RDS, Route53, S3, Elasticache to name a few Strong programming and automation skills using Go (bonus points for Python) Proficiency with infrastructure as code - Terraform / Pulumi A passion for observability, with hands-on experience in metrics, logging, tracing using Datadog Strong cross-functional communication, able to collaborate with product, platform, security and other teams An operational mindset that puts reliability and resilience as a core product requirement A mission-driven attitude, motivated by the opportunity to make healthcare better. We've Got You Covered: Full medical, dental, and vision insurance + OneMedical membership Healthcare and Dependent Care FSA 401(k) with company match Flexible PTO Wellbeing + Learning & Growth reimbursements Paid parental leave + Fertility benefits Pet insurance Student loan refinancing Virtual resources for mindfulness, counseling, and fitness The target base salary for this position ranges from $211,700 to $292,000, in addition to a competitive equity and benefits package (as applicable). When determining compensation, we analyze and carefully consider several factors, including location, job-related knowledge, skills and experience. These considerations may cause your compensation to vary. Ro recognizes the power of in-person collaboration, while supporting the flexibility to work anywhere in the United States. For our Ro'ers in the tri-state (NY) area, you will join us at HQ on Tuesdays and Thursdays. For those outside of the tri-state area, you will be able to join in-person collaborations throughout the year (i.e., during team on-sites). At Ro, we believe that our diverse perspectives are our biggest strengths - and that embracing them will create real change in healthcare. As an equal opportunity employer, we provide equal opportunity in all aspects of employment, including recruiting, hiring, compensation, training and promotion, termination, and any other terms and conditions of employment without regard to race, ethnicity, color, religion, sex, sexual orientation, gender identity, gender expression, familial status, age, disability and/or any other legally protected classification protected by federal, state, or local law. See our California Privacy Policy here.
$80k-104k yearly est. 17d ago
Senior Engineer, Validation and Reliability
Apidel Technologies 4.1
Reliability engineer job in Mahwah, NJ
Job Description The Reliability Engineer will be responsible for developing and implementing strategic asset reliability and maintenance programs across the manufacturing plant. The position also serves as a subject matter expert (SME), focusing on developing and sustaining standards and best practices. These efforts aim to ensure consistency and standardization in reliability program operations throughout the site working closely with local plant engineering, operations, continuous improvement, and facilities maintenance counterparts to facilitate the achievement of organizational objectives. Improve the performance of physical assets and establish strategies to enhance maintenance and reliability performance using KPIs. Collaborate with plant leadership, engineering teams, corporate functions, external partners, and contractors to achieve these goals. Create and implement reliability strategies, including Asset Criticality Assessments, FMEAs, Proactive Maintenance (PM) strategies, and Spare Parts analyses. Develop and monitor KPIs, dashboards, and reporting systems to monitor and improve asset performance. Align proactive maintenance strategies (e.g., preventive, and predictive tasks) with asset criticality to ensure efficient and effective processes. Conduct and facilitate Root Cause Failure Analysis (RCFA) for significant or recurring failures, including FMEA when necessary, and implement strategy improvements to prevent recurrence Ensure best practices in CMMS utilization, including maintaining accurate preventive and predictive maintenance procedures as well as asset master data, hierarchy, and configuration. Analyze CMMS data to enhance reliability and reduce maintenance costs while ensuring visibility of performance metrics to site managers. Drive the implementation and sustainability of effective workflow processes, including work request management, work request authorization, work order planning, kitting, scheduling, execution, and documentation/closure Help lead TPM events to improve equipment performance and identify preventative maintenance improvements and opportunities to enhance autonomous maintenance activities.
$108k-151k yearly est. 22d ago

Learn more about reliability engineer jobs

How much does a reliability engineer earn in Newark, NJ?

The average reliability engineer in Newark, NJ earns between $75,000 and $140,000 annually. This compares to the national average reliability engineer range of $76,000 to $144,000.

Average reliability engineer salary in Newark, NJ

$102,000

$75,00010%

$102,000Median

$140,00090%

What are the biggest employers of Reliability Engineers in Newark, NJ?

The biggest employers of Reliability Engineers in Newark, NJ are:

Job type you want

Full Time

Part Time

Internship

Temporary

Reliability engineer jobs in Newark, NJ - 816 jobs

Software Engineer II - Site Reliability Engineer

Site Reliability Engineer