Senior Reliability Engineer jobs at MRI Software - 688 jobs
Site Reliability Engineer/ Lead - Required Exp: 12+ years | Arlington, TX (Hybrid **LOCALs only)
Prudent Technologies and Consulting, Inc. 4.3
Arlington, TX jobs
Role: Site Reliability Engineer Lead - (Required Exp: 12 - 15 years)
Job Type: Contract-to-Hire *
The client would look to convert these resources to a full-time employee within 6 months, potentially sooner, or within 12-months. It could vary.
Location: 2-days onsite in Arlington, TX (4001 Embarcadero Dr, Arlington, TX 76014) | 3-days remote **(LOCALS only)
Skills & Experience
Strong software development background in languages such as Java, C#, and scripting (PowerShell, Bash).
Expertise in DevOps, pipeline automation, and cloud platforms (Azure).
Familiarity with AI-assisted development tools and practices.
Experience in Agile environments with a track record of driving improvements.
Role Requirements:
SREs must have a software development background- fully understand design patterns and development architecture, who can troubleshoot broadly, understand architectural design patterns (microservices), etc. .
It would be an added bonus for candidates to already present strong practical, hands-on experience in Site Reliability Engineering; not simply surface knowledge (e.g. not just knowing "blue-green deployment" but how to implement it).
The clients screening will focus on actual hands-on experience, not checklists or surface knowledge.
Interview Process:
Codility Test:
Interview Rounds (Video + In person).
$83k-115k yearly est. 20h ago
Looking for a job?
Let Zippia find it for you.
Remote Site Reliability Engineer - Windows, AD & ITIL
Iron Mountain 4.3
Boston, MA jobs
A global leader in information management is looking for a talented Systems Engineer in Boston. This role requires U.S. Citizenship and the ability to obtain government clearance. Responsibilities include troubleshooting, providing support, and performing system documentation. Ideal candidates will have a Bachelor's degree, strong technical skills, and experience with Windows Server and Linux. The expected salary range is between $93,400 to $124,500, offering opportunities for professional growth in a fast-paced environment.
#J-18808-Ljbffr
$93.4k-124.5k yearly 3d ago
Remote Site Reliability Engineer - Build Resilient Systems
Booz Allen Hamilton 4.9
Washington, DC jobs
A leading consulting firm in the U.S. is seeking a Site Reliability Engineer skilled in building resilient infrastructure and automating processes. You will lead teams, optimize systems, and implement monitoring tools. The ideal candidate has extensive experience in cloud technologies, Unix/Linux, and application troubleshooting, along with a master's degree or equivalent experience. This role offers a competitive salary range between $99,000 and $225,000 annually, with a flexible work model.
#J-18808-Ljbffr
$99k-225k yearly 1d ago
Principal - Quality Engineering
Dell 4.8
Texas jobs
Quality Senior Principal Engineer (Supportability) Quality is everything at Dell Technologies. In Quality Engineering we play a central role by developing, implementing as well as maintaining technical quality assurance and control systems. We look for ways to improve the inspection, verification and validation of our finished products. Our team defines strategy - not to mention governance - for the implementation of quality standards and methodology. In addition to participating in the review of engineering designs, we assist customers or engineers in gathering and analyzing data.
Join us to do the best work of your career and make a profound social impact as a Quality Senior Principal Engineer (Supportability) on our Supportability team in Round Rock, Texas, United States.
As a Quality Senior Principal Engineer (Supportability) , you will be responsible for the quality performance of assigned platforms within the PowerEdge LOB through data analysis, dispatch profiling, working with other core team functions for root cause investigation, and containment of possible issues. Direct interfacing with customers is possible in this role.
Monitor field performance of systems shipped in the PowerEdge AI portfolio to identify issues and allow for early investigation and mitigation
Assist investigation, capture root cause failure analysis, lead containment and corrective actions, and document lessons learned to prevent issue recurrence
Act as liaison between engineering, quality, and services to improve field performance by increasing reliability and implementing fixes for shipped systems
Prepare detailed documentation for the implementation of fixes for both hardware and software-related items on PowerEdge AI systems
Experience and knowledge of servers and related equipment; Strong computer skills (Excel proficiency required, PowerBI and SQL a plus)
~ Strong data analysis skills, statistical knowledge a plus
~ Ability to work effectively with global team members and strong communication, leadership, project management, and change management skills
BA/BS Degree in the Engineering field (electrical, mechanical, systems, computer) with 8+ years of related experience or MS degree with 4+ years of related experience
Advanced understanding of AI system architecture, statistics, and data analytics and experience utilizing Machine Learning, statistical modeling, or pattern recognition for predictive failure analysis and anomaly detection
If you're looking for an opportunity to grow your career with some of the best minds and most advanced tech in the industry, we're looking for you.
Dell Technologies is a unique family of businesses that helps individuals and organizations transform how they work, live and play. Application closing date: 30 January 2026
Read the full Equal Employment Opportunity Policy here .
$110k-143k yearly est. 1d ago
Packaging Engineer
Enterprise Solutions Inc. 4.1
Beachwood, OH jobs
Job Title: Supplier Audit Specialist - Packaging
Pay Rate: $35-45/hr
Job Type: Contract
Industry Requirement: Packaging
Lead supplier audits and risk assessments for packaging suppliers to ensure quality, compliance, and reliable supply. This role focuses on onsite audits, supplier performance evaluation, risk management, and cross-functional collaboration to drive continuous improvement.
Key Responsibilities
Audit Strategy & Execution
Plan and conduct onsite supplier audits for packaging commodities.
Evaluate supplier quality systems, processes, capacity, and compliance.
Risk Assessment
Develop and maintain supplier risk registers.
Apply FMEA / Supplier Risk Management tools to identify and mitigate risks.
Material, Process & Packaging Validation
Assess materials and processes for packaging commodities such as Cloth, Liners, Paper, and Fiber.
Supplier Performance Management
Use scorecards to track OTD, DPPM, NCRs, audit results, response time, and certifications.
Drive supplier performance improvement and maturity.
Compliance & Governance
Ensure suppliers meet standards such as ISO 9001, ISO 13485, IATF, and related requirements.
Stakeholder Collaboration
Work closely with internal teams and suppliers to resolve issues and improve outcomes.
Required Qualifications
Bachelor's degree in Mechanical, Production, Packaging, Industrial Engineering, or equivalent.
8-10 years of experience in supplier auditing or supplier quality within the packaging industry.
Proven track record of supplier audits with measurable quality and performance improvements.
Core Skills & Competencies
Onsite supplier assessments (Quality, Supply Chain, Capacity).
Value Stream Mapping; material and process flow analysis.
Root cause analysis (5-Why, Pareto, 8D), SPC, and trend analysis.
Risk modeling (FMEA/SRM) and dashboard reporting.
Strong supplier communication, negotiation, documentation, and stakeholder management.
Confident decision-making in fast-paced or ambiguous situations.
$35-45 hourly 3d ago
Senior Site Reliability Engineer
Gradle 4.1
San Francisco, CA jobs
Who We Are
Develocity is a first-of-its-kind toolchain observability and acceleration platform that helps software teams adopt and improve DORA capabilities (including continuous delivery) in order to achieve software delivery excellence. It combines build and test acceleration with deep observability for builds and tests with Gradle Build Tool, Apache Maven™, sbt, npm, and Python, and applies to both CI and local builds and tests. Ultimately, Develocity provides an operational layer across an organization's toolchains to speed up, troubleshoot, and optimize local developer and remote CI feedback loops.
Our software is used by some of the world's leading software organizations, such as Netflix, Airbnb, SAP, several top ten banks, and many other major customers across all verticals. We regularly collaborate with these and other users to make our products continuously better.
We have partnered with the Apache Software Foundation, the Commonhaus Foundation, the Scala Center, the Micronaut Foundation, and other OSS projects like Spring, Quarkus, Kotlin, JUnit, AndroidX, and many more to bring the values of Develocity also to the OSS Community.
Our Values
Seek to Understand: Everything starts with listening and understanding, and we strive to understand different viewpoints, problems, and motivations. Before we take action, we ensure we truly grasp the challenges, perspectives, and goals.
Know the Why: We approach our work with a clear sense of purpose, ensuring every step is deliberate and focused. We take meaningful action with urgency, but never at the expense of thoughtful consideration.
Innovate & Iterate: We embrace challenges and are not afraid to try new things, even if they might fail. With deep understanding and a clear purpose, we can develop creative and bold solutions to tackle challenges.
Own the Outcome: We are empowered to take initiative and we maintain transparency in our work and its outcomes. When we execute, we take responsibility for our decisions, measure the success of our innovations, and learn from the results.
Who You Are
We're building a new SRE team and looking for founding members to help shape how we operate. You'll be responsible for the reliability, performance, and availability of Develocity instances serving paying customers, open-source projects, and public-facing services, plus supporting infrastructure like artifact registries.
You'll work on our internally-built Cloud Application Platform, Kubernetes on AWS, and develop deep expertise in it. When incidents happen, you'll troubleshoot issues across the stack, from application to infrastructure. You'll collaborate with the Cloud Platform team to improve the tooling you depend on, and with engineering teams to build reliability into how we ship software. If you like automating things and hate doing the same task twice, you'll fit in well.
You'll be part of a distributed, remote-first team that values asynchronous communication and written documentation. Strong self-direction and clear communication across time zones are essential.
Responsibilities
Operate and maintain all Develocity instances and supporting services.
Participate in a follow-the-sun on-call rotation, owning incident response and troubleshooting issues across the stack.
Drive automation across application deployment, upgrades, monitoring, self-healing, and recovery.
Build and maintain observability for all managed services (logging, metrics, tracing, and alerting).
Work with engineering teams to build reliability into features from the start.
Run incident response and retrospectives, and make sure we learn from them.
Own disaster recovery, backups, and business continuity.
Communicate with customers during incidents and maintenance windows.
Optimize performance, resource usage, and costs.
Help evolve our SaaS operations as we grow.
Minimum qualifications
5+ years in SRE, DevOps, or equivalent role operating production services at scale.
Strong Kubernetes experience in production environments.
Cloud infrastructure expertise, preferably AWS (EKS, RDS, S3, EC2).
Proficiency with observability tools (Prometheus, Grafana) and Infrastructure as Code (Terraform).
Track record of incident management and response.
Knowledge of SRE best practices (SLAs, SLOs).
Scripting proficiency (Python, Bash) for automation.
Experience with 24/7 on-call rotations.
Strong written and verbal English communication.
Preferred qualifications
Experience operating SaaS platforms at scale.
Familiarity with Develocity.
JVM language experience (Java, Kotlin).
Disaster recovery planning and execution experience.
Customer-facing incident communication skills.
Experience establishing SRE practices in new or growing teams.
What We Offer
A ground-floor role in a new SRE team-you'll shape how we do things, not inherit someone else's decisions.
Real ownership of production systems used by engineers at companies you've heard of.
Direct interaction with customers when things go wrong (and when they go right).
A culture that values automation over heroics.
In-person meetings, such as our annual company offsite and team meetings.
Work from home in a remote-first environment.
Competitive salaries and equity grants.
Compensation
The US salary range for this position is $150-190k which reflects the target ranges for all US locations. Within this range, individual pay is determined by geographic location and additional factors including but not limited to experience, relevant skills, qualifications, seniority, performance, and travel requirements. Our recruiting team can share more information about the specific salary range for your location during the hiring process.
Location
Remote from anywhere in PST timezone.
While our team works remotely and is spread across the globe, we deeply value daily interactions and collaboration.
$150k-190k yearly Auto-Apply 36d ago
Senior Site Reliability Engineer- Central Platforms
Intralinks 4.7
Remote
As a leading financial services and healthcare technology company based on revenue, SS&C is headquartered in Windsor, Connecticut, and has 27,000+ employees in 35 countries. Some 20,000 financial services and healthcare organizations, from the world's largest companies to small and mid-market firms, rely on SS&C for expertise, scale, and technology.
Job Description
We are seeking a Site Reliability Engineer (SRE) to join our Internal Platform Services team, responsible for the reliability, scalability, and performance of the core services that power our internal engineering ecosystem. You will work at the intersection of development and operations, enabling product teams to move quickly and safely by building and maintaining robust, self-service infrastructure components like Kubernetes clusters, internal databases, CI/CD pipelines, observability tools, and cloud APIs.
Key Responsibilities
Ensure reliability, scalability, and performance of services through SLIs/SLOs, capacity planning, and incident response.
Drive automation of infrastructure operations to minimize toil.
Develop and support monitoring, alerting, and observability systems to support proactive issue detection.
Partner with internal engineering teams to define service-level objectives, improve deployment workflows, and integrate infrastructure with development needs.
Contribute to on-call rotations and incident management, helping ensure high availability of services.
Drive post-incident reviews and blameless retrospectives to improve reliability.
Stay current with emerging technologies and recommend improvements to existing systems and practices.
Qualifications
Required:
3+ years of experience as an SRE, DevOps Engineer, or Infrastructure Engineer.
Solid experience with Kubernetes administration and tooling (e.g., Helm, ArgoCD, Kustomize).
Strong expertise in cloud platforms (e.g., AWS, GCP, or Azure).
Experience managing databases in production environments (e.g., backups, replication, tuning).
Proficiency in programming or scripting (e.g., Go, Python, Bash).
Deep understanding of CI/CD pipelines and infrastructure automation.
Familiarity with monitoring/observability tools (e.g., Prometheus, Grafana).
Strong communication skills and ability to collaborate with software engineering teams.
Preferred:
Experience in multi-tenant infrastructure environments.
Exposure to compliance and security best practices in infrastructure environments.
Why Join Us
Be a key driver of internal engineering productivity and reliability.
Work with modern, cloud-native technologies in a high-impact environment.
Join a collaborative, learning-focused team where your ideas shape the platform.
Competitive compensation, flexible work arrangements, and ongoing professional growth.
Unless explicitly requested or approached by SS&C Technologies, Inc. or any of its affiliated companies, the company will not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services.
SS&C offers excellent benefits including health, dental, 401k plan, tuition and professional development reimbursement plan.
SS&C Technologies is an Equal Employment Opportunity employer and does not discriminate against any applicant for employment or employee on the basis of race, color, religious creed, gender, age, marital status, sexual orientation, national origin, disability, veteran status or any other classification protected by applicable discrimination laws.
Salary is determined by various factors including, but not limited to, relevant work experience, job related knowledge, skills, abilities, business needs, and geographic regions.Jersey City, NJ: Salary range for the position: 175000 USD to 185000 USD. NY: Salary range for the position: 175000 USD to 185000 USD. Washington: Salary range for the position: 175000 USD to 185000 USD. California: Salary range for the position: 175000 USD to 185000 USD. Colorado: Salary range for the position: 175000 USD to 185000 USD.
$102k-148k yearly est. Auto-Apply 9d ago
Senior Site Reliability Engineer
Sciencelogic 4.5
Remote
To comply with U.S. federal government requirements, U.S. citizenship is required for this position.
Who we are...
ScienceLogic is redefining IT operations for the modern enterprise. Our AIOps platform empowers organizations to achieve Autonomic IT - where systems are self-healing, self-optimizing, and seamlessly aligned with business outcomes. We help enterprises and service providers gain unified visibility across hybrid and multi-cloud environments, automate workflows, and unlock performance at scale.
We're accelerating digital transformation through the power of automation, AI, and analytics - giving IT and business leaders the tools to deliver superior customer experiences, drive efficiency, and innovate with confidence.
What we're looking for…
We are looking for a Senior or Principal Site Reliability Engineer who is well versed in building cloud technologies in a secure manner, has an automation mindset, and is an ardent follower of the SRE discipline. If this sounds like you, then our team will benefit from your skillset!
What you'll be doing…
Lead design reviews and buildout of secure systems for delivering new Artificial Intelligence Product in SaaS, aiming for 99.99% uptime.
Design, automate, test, and monitor the use of cloud native technologies as a foundation for a service platform.
Spend 75% of your time on forward looking priorities designing and building SaaS systems while remaining on supporting the Operations and Maintenance of the current SaaS infrastructure.
Investigate and resolve customer and operational issues with the mentality of fixing and not just mitigating issues.
Identify and automate measurement of operations SLAs and SLOs
Triage incident response, document SOPs, Runbooks, and train NOC team members
Writing automation can be easily supported and extended by others.
Collaborate across the organization to design, build and operationalize SaaS services conforming to various security standards like FedRAMP, SOC2, ISO etc.
Participate in the on-call rotation as assigned.
Take full responsibility for the availability and performance of the platform.
Work on special projects as assigned.
Qualities you possess…
8-12 years of site reliability engineering, cloud operations or equivalent experience
Proven experience in managing complex Kubernetes environments in multiple Production systems.
Working with Cloud Automation tools like CloudFormation, Terraform, aws-cli/CDK, Cloudformation
Scripting languages like Python, Bash, Perl etc.
Exposure to Linux administration skills.
Proven track record of operating production SaaS environments within security standards like FedRAMP, SOC2, ISO, PCI.
Skilled at problem solving, algorithms, and data structures conforming to the modern SaaS security requirements.
Building tools and scripting frameworks from scratch.
Familiarity with basic networking, security and cloud engineering concepts
Highly collaborative with effective written and verbal communication skills
Ability to work against tight deadlines and occasionally after-hours, part of on-call scheduling.
Occasionally work during off-hours and participate in weekly on-call schedule.
Take full responsibility for the availability and performance of the platform.
Bachelors or Master's degree in Computer Science, Information Systems or similar field.
Benefits & Perks
A remote flexible workplace.
Comprehensive medical, dental and vision plans.
401(k) plan with employer match.
Flexible Paid Time Off (FTO) so that you can take the time that you need to re-energize.
Volunteer Time Off (VTO) - take two days off per calendar year to volunteer with your preferred charitable organization.
5-year Service Milestone Sabbatical.
Paid parental leave.
Generous employee referral bonus program.
Pet insurance.
HQ Office centrally located in Reston Town Center featuring a well-stocked kitchen with rotating snacks and beverages, and catered lunch on Thursdays.
Regular virtual company-wide events, including cooking classes, yoga, meditation and more.
The opportunity to learn and develop from some of the best and brightest minds in the industry!
Don't meet every single requirement? Studies have shown that women and people of color are less likely to apply to jobs unless they meet every single qualification. At ScienceLogic, we are dedicated to building a diverse, inclusive and authentic workplace, so if you're excited about this role but your past experience doesn't align perfectly with every qualification in the job description, we encourage you to apply anyway. You may be just the right candidate for this or other roles.
All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, or any other applicable legally protected characteristics in the location in which you are applying
About ScienceLogic
ScienceLogic is a leader in IT Operations Management, providing modern IT operations with actionable insights to resolve and predict problems faster in a digital, ephemeral world. Its solution sees everything across cloud and distributed architectures, contextualizes data through relationship mapping, and acts on this insight through integration and automation.
********************
All ScienceLogic employees have the responsibility to protect information assets, adhere to access controls, report suspicious activity, and comply with security and privacy policies.
#LI-Remote
$106k-135k yearly est. Auto-Apply 60d+ ago
Senior Site Reliability Engineer (AWS, AI/ML, & APM)
Granicus 4.3
Remote
The Company
Serving the People Who Serve the People
Granicus is driven by the excitement of building, implementing, and maintaining technology that is transforming the Govtech industry by bringing governments and its constituents together. We are on a mission to support our customers with meeting the needs of their communities and implementing our technology in ways that are equitable and inclusive. Granicus has consistently appeared on the GovTech 100 list over the past 5 years and has been recognized as the best companies to work on BuiltIn.
Over the last 25 years, we have served 5,500 federal, state, and local government agencies and more than 300 million citizen subscribers power an unmatched Subscriber Network that use our digital solutions to make the world a better place. With comprehensive cloud-based solutions for communications, government website design, meeting and agenda management software, records management, and digital services, Granicus empowers stronger relationships between government and residents across the U.S., U.K., Australia, New Zealand, and Canada. By simplifying interactions with residents, while disseminating critical information, Granicus brings governments closer to the people they serve-driving meaningful change for communities around the globe.
Want to know more? See more of what we do here.
Job Summary
Granicus is seeking an experienced and highly skilled Senior Site Reliability Engineer (SRE) to join our SRE team. As a Senior SRE, you will play a pivotal role in ensuring the reliability, scalability, and performance of our services. You will lead efforts in building and maintaining a robust infrastructure, automating processes, and guiding the team to implement best practices in site reliability.
What Your Impact Will Look Like
On-call Production Support: Provide production support on a shift according to the team on-call roster.
Work on the customer and internal engineering/implementation team raised tickets while not on-call for production support. For example, a client may request to correct some data on the database server which cannot be done through the web interface.
Work on SREs backlog items.
Monitor and Maintain Systems: Continuously monitor the health and performance of our services, systems, and infrastructure. Respond to alerts and incidents promptly to ensure high availability.
Automate Processes: Develop and maintain automation scripts and tools to streamline operations and reduce manual intervention.
Incident Management: Assist in troubleshooting and resolving incidents, performing root cause analysis, and implementing long-term fixes to prevent recurrence.
System Improvements: Participate in designing and implementing system improvements to enhance reliability, scalability, and performance.
Collaboration: Work closely with software engineers to understand application requirements, provide feedback on design and architecture, and support deployment and release processes.
Documentation: Create and maintain documentation for processes, procedures, and troubleshooting guides to ensure knowledge sharing within the team.
Capacity Planning: Assist in capacity planning activities to anticipate future needs and ensure that our infrastructure can handle growth.
Security: Implement and adhere to security best practices to protect our systems and data.
You Will Love This Job If You Have
5+ years in site reliability engineering, system administration, or a similar role, with a proven track record of managing large-scale, high-availability systems. Experience supporting AI/ML infrastructure, including model deployment, inference optimization, and integration with services like AWS Bedrock is highly desirable.
Expertise in Linux/Unix systems, and cloud platforms (AWS, Azure, or Google Cloud).
Strong proficiency in scripting languages (Python, Bash, Ruby) and programming languages (Go, Java, C++).
Familiarity with AI/ML operations, including model lifecycle management, vector databases, and inference performance tuning.
Experience with the ELK Stack (Elasticsearch, Logstash, Kibana) for centralized logging, monitoring, and observability.
Experience with configuration management tools (Ansible, Chef, Puppet).
Exposure to AI/ML toolchains, including AWS Bedrock, SageMaker, and LLMOps frameworks.
Certifications: Relevant certifications such as AWS Certified DevOps Engineer, AWS Certified Machine Learning - Specialty, Google Cloud Professional DevOps Engineer, or similar are a plus.
Pay Range USD $80,000.00 - USD $100,000.00 /Yr. About Us
Don't have all the skills/experience mentioned above? At Granicus, we are trying to build diverse, inclusive teams. We do not have degree requirements for most of our roles. If you don't meet every requirement above but are excited to learn more, we encourage you to apply. We might just be able to find another role that could be a perfect fit!
Security and Privacy Requirements
Responsible for Granicus information security by appropriately preserving the Confidentiality, Integrity, and Availability (CIA) of Granicus information assets in accordance with the company's information security program.
Responsible for ensuring the data privacy of our employees and customers, their data, as well as taking all required privacy training in a timely manner, in accordance with company policies.
The Team
We are a remote-first company with a globally distributed workforce across the United States, Canada, United Kingdom, India, Armenia, Australia, and New Zealand.
The Culture
At Granicus, we are building a transparent, inclusive, and safe space for everyone who wants to be
a part of our journey.
A few culture highlights include - Employee Resource Groups to encourage diverse voices
Coffee with Mark sessions - Our employees get to interact with our CEO on very important and
sometimes difficult issues ranging from mental health to work-life balance and current affairs.
Microsoft Teams communities focused on wellness, art, furbabies, family, parenting, and more.
We bring in special guests from time to time to discuss issues that impact our employee
population
The Impact
We are proud to serve dynamic organizations around the globe that use our digital solutions to make the world a better place - quite literally. We have so many powerful success stories that illustrate how our solutions are impacting the world. See more of our impact here.
The Benefits At Granicus, we offer a comprehensive and flexible benefits package designed to support your well-being, growth, and work-life balance-starting from day one.
Here's what you can expect as a U.S.-based team member:
Flexibility & Balance
Flexible Time Off - Take the time you need to rest, recharge, and live your life.
Company-Wide Wellbeing Days - Paid days off to unplug and focus on your mental health.
Work From Home Reimbursement - Support a productive home office environment.
Health & Wellness
Multiple Health Plan Options - Including a 100% employer-paid plan.
Employer HSA Contributions - When enrolled in a High-Deductible Health Plan.
Fitness Reimbursement Program - Stay active, your way.
On-Demand Mental Health Support - Access to Headspace and other wellness tools.
Family & Future
Paid Parental Leave - For both birthing and non-birthing parents.
Traditional & Roth 401(k) - With a generous company match.
Life & AD&D Insurance - 100% employer-paid coverage for peace of mind.
Growth & Recognition
Online Learning Platforms - Fuel your professional development.
Competitive Salary & Bonuses - Your contributions are valued and rewarded.
Equal Opportunity Employer Granicus is committed to providing equal employment opportunities. All qualified applicants and employees will be considered for employment and advancement without regard to race, color, religion, creed, national origin, ancestry, sex, gender, gender identity, gender expression, physical or mental disability, age, genetic information, sexual or affectional orientation, marital status, status with regard to public assistance, familial status, military or veteran status or any other status protected by applicable law.
$80k-100k yearly Auto-Apply 60d+ ago
Senior Site Reliability Engineer
Bumble 4.8
Austin, TX jobs
Inclusion at Bumble Inc. Bumble Inc. is an equal opportunity employer and we strongly encourage people of all ages, colour, lesbian, gay, bisexual, transgender, queer and non-binary people, veterans, parents, people with disabilities, and neurodivergent people to apply. We're happy to make any reasonable adjustments that will help you feel more confident throughout the process, please don't hesitate to let us know how we can help.In your application, please feel free to note which pronouns you use (For example: she/her, he/him, they/them, etc).
We are looking for an experienced engineer with strong Linux and system-level expertise who can operate autonomously in complex production environments. You must be able to independently troubleshoot incidents, lead and support post-incident service recovery, and drive improvements to overall system stability, performance, and observability. We are looking for a hands-on Site Reliability Engineer (SRE) with a strong background in Linux infrastructure and third-party system operations.
This role focuses on managing and optimizing large-scale environments (5,000+ hosts) running technologies like Kafka, Redis, and Kubernetes.
The position does not involve application development but requires deep operational expertise and solid troubleshooting skills.Qualifications
5+ years of experience in Linux system administration or SRE roles
Proven experience managing large-scale infrastructure environments
Experience with cloud infrastructure (Google Cloud)
Strong troubleshooting and performance tuning skills at the infrastructure level
Basic scripting/automation experience (Bash, Python)
Familiarity with IaC tools (e.g., Ansible, Puppet)
Knowledge of distributed systems and container orchestration (Kafka, Kubernetes, etc.)
Excellent communication and problem-solving skills
Location
This role is based in Austin, and we ask that you're within a commutable distance to this office, so that you're able to come onsite regularly to collaborate across engineering teams.
We have a hybrid environment that requires you to be in the office 3x/week
On-call rotation: one week every 4-5 weeks (24x7 coverage).
Regular maintenance outside of business hours is generally not expected.
Please note: We are unable to offer Visa sponsorship at this time
Why join us?
Own meaningful projects that directly impact millions of Bumble users.
Learn and grow in a high-performing engineering team committed to mentorship and learning.
Be part of a culture that values respect, excellence, curiosity, courage and joy.
Enjoy competitive compensation, equity, and world-class benefits.
$185,000 - $225,000 a year
For base compensation, we set standard ranges for all roles based on function, level, and geographic location. This position is also typically eligible to participate in our short- and long-term incentive programs. Benefits include Medical, Dental, Vision, 401(k) match, Unlimited Paid Time Off Policy.
Maven Fertility: $10,000 lifetime benefit for fertility, adoption, abortion care, and more.26 Weeks Parental Leave: For both primary and secondary caregivers.Family & Compassionate Leave: Inclusive of domestic violence recovery.Unlimited Paid Time Off: Take the time you need.Company-wide Week Off: Annual collective rest for the entire company.Focus Fridays: No meetings, emails, or deadlines-just deep work.
Inclusion at Bumble Inc.
Bumble Inc. is an equal opportunity employer and we strongly encourage people of all ages, colour, lesbian, gay, bisexual, transgender, queer and non-binary people, veterans, parents, people with disabilities, and neurodivergent people to apply. We're happy to make any reasonable adjustments that will help you feel more confident throughout the process, please don't hesitate to let us know how we can help.In your application, please feel free to note which pronouns you use (For example: she/her, he/him, they/them, etc).
AI in Bumble Inc. Hiring At Bumble, we may use AI tools to support parts of our recruitment process - such as helping us record, transcribe, and summarize conversations, and supporting job alignment by comparing resumes and s to highlight skills and potential roles that may be a good match. These tools help us work more efficiently and stay focused on you during our conversations. Importantly, all hiring decisions are made by people. AI is used only to support our team's efficiency and improve the candidate experience - not to evaluate or decide on your candidacy. Participation in AI-supported interviews and conversations is completely voluntary and will not impact your candidacy. If you'd prefer to opt out, simply let your recruiter or interviewer know at the start of a call, or anytime during the interview or conversation. Summaries and related data are retained only as long as needed in line with our internal data retention policies. If at any point you'd like a transcription or summary deleted, please contact your recruiter directly.For further information on how we hold and manage your data, please refer to our Privacy Policy.
About UsBumble Inc. is the parent company of Bumble Date, BFF, and Badoo. The Bumble platform enables people to build healthy and equitable relationships, through Kind Connections. Founded by Whitney Wolfe Herd in 2014, Bumble was one of the first dating apps built with women at the center and connects people across dating (Bumble Date) and friendship (BFF). BFF is a friendship app where people in all stages of life can meet people nearby and create meaningful platonic connections and community based on shared interests. Badoo, which was founded in 2006, is one of the pioneers of web and mobile dating products.
AI in Bumble Inc. Hiring At Bumble, we may use AI tools to support parts of our recruitment process - such as helping us record, transcribe, and summarize conversations, and supporting job alignment by comparing resumes and job descriptions to highlight skills and potential roles that may be a good match. These tools help us work more efficiently and stay focused on you during our conversations. Importantly, all hiring decisions are made by people. AI is used only to support our team's efficiency and improve the candidate experience - not to evaluate or decide on your candidacy. Participation in AI-supported interviews and conversations is completely voluntary and will not impact your candidacy. If you'd prefer to opt out, simply let your recruiter or interviewer know at the start of a call, or anytime during the interview or conversation. Summaries and related data are retained only as long as needed in line with our internal data retention policies. If at any point you'd like a transcription or summary deleted, please contact your recruiter directly.For further information on how we hold and manage your data, please refer to our Privacy Policy.
$185k-225k yearly Auto-Apply 60d+ ago
Senior Reliability Engineer
Trane Technologies 4.7
Tyler, TX jobs
At Trane Technologies TM and through our businesses including Trane and Thermo King , we create innovative climate solutions for buildings, homes, and transportation that challenge what's possible for a sustainable world. We're a team that dares to look at the world's challenges and see impactful possibilities. We believe in a better future when we uplift others and enable our people to thrive at work and at home. We boldly go.
**What's in it for you:**
Be a part of our mission! As a world leader in creating comfortable, sustainable, and efficient environments, it's our responsibility to put the planet first. For us at Trane Technologies, sustainability is not just how we do business-it is our business. Do you dare to look at the world's challenges and see impactful possibilities? Do you want to contribute to making a better future? If the answer is yes, we invite you to consider joining us in boldly challenging what's possible for a sustainable world.
**Thrive at work and at home:**
+ **Benefits** kick in on **DAY ONE** for you and your family, including health insurance and holistic wellness programs that include generous incentives - **WE DARE TO CARE!**
+ Family building benefits include fertility coverage and adoption/surrogacy assistance.
+ **401K** match up to 6%, plus an additional 2% core contribution = up to **8%** company contribution.
+ **Paid time off** , including in support of **volunteer** and **parental leave** needs.
+ Educational and training opportunities through company programs along with tuition **assistance** and student debt support.
+ Learn more about our benefits here!
**Where is the work:**
From Monday to Thursday, work on-site with your colleagues. On Fridays, choose your work location, balancing what your work requires!
As a **Senior Reliability Engineer** , you will play a key role in ensuring the reliable performance of our commercial HVAC products. Your main focus will be to develop and implement strategies that enhance the reliability and performance of our products and critical components. You will collaborate closely with engineering teams from design, quality, and test labs to identify potential reliability risks and optimize product life through thorough analysis and testing.
**What you will do:**
+ Develop and implement reliability strategies to enhance the performance, availability, and lifespan of HVAC products.
+ Support failure mode and effects analysis (FMEA) to identify potential failures and recommend proactive solutions to mitigate risks.
+ Perform root cause and corrective action (RCCA) using an 8D or 9 Step approach on product and component failures, identify corrective actions, and implement long-term fixes to prevent recurrence.
+ Analyze product life data from development testing, quality reports, and field records to identify trends, predict failures, and develop life improvement strategies.
+ Collaborate with maintenance and operations teams to develop and refine preventative and predictive maintenance programs.
+ Monitor key performance indicators (KPIs) for equipment reliability, including mean time between failures (MTBF) and mean time to failure (MTTF).
+ Lead continuous improvement projects aimed at reducing failures and increasing reliability of products.
+ Ensure compliance with industry standards and regulations for equipment reliability and safety.
+ Provide reliability expertise to the design engineering and program management teams on developing new products.
+ Analyze performance data of products and propose new Design for Reliability projects to ensure products meet or exceed life expectations.
+ Train and mentor team members on reliability best practices.
+ Work cross-functionally across other departments.
+ Potential overnight travel up to 25%.
**What you will bring:**
+ Bachelor's Degree in Engineering and at least 5-years of engineering work experience is required.
+ At least some work experience in reliability testing and/or casting product life is required.
+ Strong knowledge of reliability principles, tools, and methodologies (e.g., DFR, FMEA, DVPR, RCCA).
+ Strong analytical and data-driven mindset to assess equipment performance and make informed decisions.
**Compensation:**
**Annual Salary Range: $80,000 - $145,000**
Disclaimer: This pay range is based on US national averages. Actual base pay could be a result of seniority, merit, geographic location where the work is performed.
**Equal Employment Opportunity:**
We offer competitive compensation and comprehensive benefits and programs. We are an equal opportunity employer; all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, pregnancy, age, marital status, disability, status as a protected veteran, or any legally protected status.
$80k-145k yearly 14d ago
Senior Reliability Engineer
Trane Technologies 4.7
Grand Rapids, MI jobs
At Trane Technologies TM and through our businesses including Trane and Thermo King , we create innovative climate solutions for buildings, homes, and transportation that challenge what's possible for a sustainable world. We're a team that dares to look at the world's challenges and see impactful possibilities. We believe in a better future when we uplift others and enable our people to thrive at work and at home. We boldly go.
**What's in it for you:**
Be a part of our mission! As a world leader in creating comfortable, sustainable, and efficient environments, it's our responsibility to put the planet first. For us at Trane Technologies, sustainability is not just how we do business-it is our business. Do you dare to look at the world's challenges and see impactful possibilities? Do you want to contribute to making a better future? If the answer is yes, we invite you to consider joining us in boldly challenging what's possible for a sustainable world.
**Thrive at work and at home:**
+ **Benefits** kick in on **DAY ONE** for you and your family, including health insurance and holistic wellness programs that include generous incentives - **WE DARE TO CARE!**
+ Family building benefits include fertility coverage and adoption/surrogacy assistance.
+ **401K** match up to 6%, plus an additional 2% core contribution = up to **8%** company contribution.
+ **Paid time off** , including in support of **volunteer** and **parental leave** needs.
+ Educational and training opportunities through company programs along with tuition **assistance** and student debt support.
+ Learn more about our benefits here!
**Where is the work:**
From Monday to Thursday, work on-site with your colleagues. On Fridays, choose your work location, balancing what your work requires!
As a **Senior Reliability Engineer** , you will play a key role in ensuring the reliable performance of our commercial HVAC products. Your main focus will be to develop and implement strategies that enhance the reliability and performance of our products and critical components. You will collaborate closely with engineering teams from design, quality, and test labs to identify potential reliability risks and optimize product life through thorough analysis and testing.
**What you will do:**
+ Develop and implement reliability strategies to enhance the performance, availability, and lifespan of HVAC products.
+ Support failure mode and effects analysis (FMEA) to identify potential failures and recommend proactive solutions to mitigate risks.
+ Perform root cause and corrective action (RCCA) using an 8D or 9 Step approach on product and component failures, identify corrective actions, and implement long-term fixes to prevent recurrence.
+ Analyze product life data from development testing, quality reports, and field records to identify trends, predict failures, and develop life improvement strategies.
+ Collaborate with maintenance and operations teams to develop and refine preventative and predictive maintenance programs.
+ Monitor key performance indicators (KPIs) for equipment reliability, including mean time between failures (MTBF) and mean time to failure (MTTF).
+ Lead continuous improvement projects aimed at reducing failures and increasing reliability of products.
+ Ensure compliance with industry standards and regulations for equipment reliability and safety.
+ Provide reliability expertise to the design engineering and program management teams on developing new products.
+ Analyze performance data of products and propose new Design for Reliability projects to ensure products meet or exceed life expectations.
+ Train and mentor team members on reliability best practices.
+ Work cross-functionally across other departments.
+ Potential overnight travel up to 25%.
**What you will bring:**
+ Bachelor's Degree in Engineering and at least 5-years of engineering work experience is required.
+ At least some work experience in reliability testing and/or casting product life is required.
+ Strong knowledge of reliability principles, tools, and methodologies (e.g., DFR, FMEA, DVPR, RCCA).
+ Strong analytical and data-driven mindset to assess equipment performance and make informed decisions.
**Compensation:**
**Annual Salary Range: $80,000 - $145,000**
Disclaimer: This pay range is based on US national averages. Actual base pay could be a result of seniority, merit, geographic location where the work is performed.
**Equal Employment Opportunity:**
We offer competitive compensation and comprehensive benefits and programs. We are an equal opportunity employer; all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, pregnancy, age, marital status, disability, status as a protected veteran, or any legally protected status.
$80k-145k yearly 14d ago
Senior Reliability Engineer
Trane Technologies Plc 4.7
Grand Rapids, MI jobs
At Trane TechnologiesTM and through our businesses including Trane and Thermo King, we create innovative climate solutions for buildings, homes, and transportation that challenge what's possible for a sustainable world. We're a team that dares to look at the world's challenges and see impactful possibilities. We believe in a better future when we uplift others and enable our people to thrive at work and at home. We boldly go.
What's in it for you:
Be a part of our mission! As a world leader in creating comfortable, sustainable, and efficient environments, it's our responsibility to put the planet first. For us at Trane Technologies, sustainability is not just how we do business-it is our business. Do you dare to look at the world's challenges and see impactful possibilities? Do you want to contribute to making a better future? If the answer is yes, we invite you to consider joining us in boldly challenging what's possible for a sustainable world.
Thrive at work and at home:
* Benefits kick in on DAY ONE for you and your family, including health insurance and holistic wellness programs that include generous incentives - WE DARE TO CARE!
* Family building benefits include fertility coverage and adoption/surrogacy assistance.
* 401K match up to 6%, plus an additional 2% core contribution = up to 8% company contribution.
* Paid time off, including in support of volunteer and parental leave needs.
* Educational and training opportunities through company programs along with tuition assistance and student debt support.
* Learn more about our benefits here!
Where is the work:
From Monday to Thursday, work on-site with your colleagues. On Fridays, choose your work location, balancing what your work requires!
As a Senior Reliability Engineer, you will play a key role in ensuring the reliable performance of our commercial HVAC products. Your main focus will be to develop and implement strategies that enhance the reliability and performance of our products and critical components. You will collaborate closely with engineering teams from design, quality, and test labs to identify potential reliability risks and optimize product life through thorough analysis and testing.
What you will do:
* Develop and implement reliability strategies to enhance the performance, availability, and lifespan of HVAC products.
* Support failure mode and effects analysis (FMEA) to identify potential failures and recommend proactive solutions to mitigate risks.
* Perform root cause and corrective action (RCCA) using an 8D or 9 Step approach on product and component failures, identify corrective actions, and implement long-term fixes to prevent recurrence.
* Analyze product life data from development testing, quality reports, and field records to identify trends, predict failures, and develop life improvement strategies.
* Collaborate with maintenance and operations teams to develop and refine preventative and predictive maintenance programs.
* Monitor key performance indicators (KPIs) for equipment reliability, including mean time between failures (MTBF) and mean time to failure (MTTF).
* Lead continuous improvement projects aimed at reducing failures and increasing reliability of products.
* Ensure compliance with industry standards and regulations for equipment reliability and safety.
* Provide reliability expertise to the design engineering and program management teams on developing new products.
* Analyze performance data of products and propose new Design for Reliability projects to ensure products meet or exceed life expectations.
* Train and mentor team members on reliability best practices.
* Work cross-functionally across other departments.
* Potential overnight travel up to 25%.
What you will bring:
* Bachelor's Degree in Engineering and at least 5-years of engineering work experience is required.
* At least some work experience in reliability testing and/or casting product life is required.
* Strong knowledge of reliability principles, tools, and methodologies (e.g., DFR, FMEA, DVPR, RCCA).
* Strong analytical and data-driven mindset to assess equipment performance and make informed decisions.
Compensation:
Annual Salary Range: $80,000 - $145,000
Disclaimer: This pay range is based on US national averages. Actual base pay could be a result of seniority, merit, geographic location where the work is performed.
Equal Employment Opportunity:
We offer competitive compensation and comprehensive benefits and programs. We are an equal opportunity employer; all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, pregnancy, age, marital status, disability, status as a protected veteran, or any legally protected status.
$80k-145k yearly 10d ago
Senior Site Reliability Engineer
Red Hat 4.6
Remote
Red Hat is seeking a Senior Site Reliability Engineer (SRE) to develop, scale, and operate our OpenShift managed cloud services. OpenShift is Red Hat's enterprise Kubernetes distribution. As an SRE you will contribute to running OpenShift at scale by enabling customer self-service, making our monitoring system more sustainable, and eliminating work through automation.
At Red Hat, our commitment to open source innovation extends beyond our products - it's embedded in how we work and grow. Red Hatters embrace change - especially in our fast-moving technological landscape - and have a strong growth mindset. That's why we encourage our teams to proactively, thoughtfully, and ethically use AI to simplify their workflows, cut complexity, and boost efficiency. This empowers our associates to focus on higher-impact work, creating smart, more innovative solutions that solve our customers' most pressing challenges.
On the SRE team, you will have the opportunity to influence the complex challenges of scale which are unique to Red Hat managed cloud services, while using your skills in coding, operations, and large-scale distributed system design.
Red Hat relies on teamwork and openness for its success. We are a global team and strive to cultivate a transparent environment that makes room for different voices. We learn from our failures in a blameless environment to support the continuous improvement of the team. At Red Hat, your individual contributions have more visibility than most large companies, and visibility means career opportunities and growth.
What you will do
The day-to-day responsibilities of an SRE involve working with live systems and coding automation. As an SRE you will be expected to:
Contribute code to increase the scalability and reliability of the service
Contribute software tests and participate in peer review to increase the quality of our codebase
Help and develop peers' capabilities through knowledge sharing, mentoring, and collaboration
Participate in a regular on-call schedule, including occasional paid weekends and holidays
Practice sustainable incident response and blameless postmortems
Resolve customer issues escalated from the Red Hat Global Support team
Work within a small agile team to develop and improve SRE software, support your peers, plan and self-improve
Collaborate with cross-functional teams to identify opportunities for AI integration within the software development lifecycle, driving continuous improvement and innovation in engineering practices; share use cases for successful experiments with stakeholders for broader use.
What you will bring
A bachelor's degree in Computer Science or a related technical field involving software or systems engineering is required. However, hands-on experience that demonstrates your ability and interest in Site Reliability Engineering are valuable to us, and may be considered in lieu of degree requirements. You must have some experience programming in at least one of these languages: Python, Golang, Java, C, C++ or another object-oriented language. You must have experience working with public clouds such as AWS, GCP, or Azure. You must also have the ability to collaboratively troubleshoot and solve problems in a team setting.
As an SRE you will be most successful if you have some experience troubleshooting an as-a-service offering (SaaS, PaaS, etc.) and some experience working with complex distributed systems. Direct experience with Kubernetes or OpenShift is a plus. We like to see a demonstrated ability to debug, optimize code and automate routine tasks. We are Red Hat, so you need a basic understanding of Unix/Linux operating systems.
Desired skills
5+ years of experience managing Linux servers running Red Hat Enterprise Linux (RHEL), CentOS, or Fedora hosted at a cloud provider such as Amazon Web Services (AWS), Google Compute Engine (GCE), or Microsoft Azure
3+ years of experience with enterprise systems monitoring; knowledge of Prometheus is a plus
3+ years of experience with enterprise configuration management software like Ansible by Red Hat, Puppet, or Chef
2+ years of experience programming with at least one object-oriented language; Golang, Java, or Python are preferred
2+ years of experience delivering a hosted service
Demonstrated ability to quickly and accurately troubleshoot system issues
Solid understanding of standard TCP/IP networking and common protocols like DNS and HTTP
Solid communications skills and experience working directly with and presenting to customers
1+ year(s) of experience with Kubernetes is a plus
1+ year(s) of experience with docker-based containers is a plus
The salary range for this position is $111,260.00 - $183,580.00. Actual offer will be based on your qualifications.
Pay Transparency
Red Hat determines compensation based on several factors including but not limited to job location, experience, applicable skills and training, external market value, and internal pay equity. Annual salary is one component of Red Hat's compensation package. This position may also be eligible for bonus, commission, and/or equity. For positions with Remote-US locations, the actual salary range for the position may differ based on location but will be commensurate with job duties and relevant work experience.
About Red Hat
Red Hat is the world's leading provider of enterprise open source software solutions, using a community-powered approach to deliver high-performing Linux, cloud, container, and Kubernetes technologies. Spread across 40+ countries, our associates work flexibly across work environments, from in-office, to office-flex, to fully remote, depending on the requirements of their role. Red Hatters are encouraged to bring their best ideas, no matter their title or tenure. We're a leader in open source because of our open and inclusive environment. We hire creative, passionate people ready to contribute their ideas, help solve complex problems, and make an impact.
Benefits
● Comprehensive medical, dental, and vision coverage
● Flexible Spending Account - healthcare and dependent care
● Health Savings Account - high deductible medical plan
● Retirement 401(k) with employer match
● Paid time off and holidays
● Paid parental leave plans for all new parents
● Leave benefits including disability, paid family medical leave, and paid military leave
● Additional benefits including employee stock purchase plan, family planning reimbursement, tuition reimbursement, transportation expense account, employee assistance program, and more!
Note: These benefits are only applicable to full time, permanent associates at Red Hat located in the United States.
Inclusion at Red Hat
Red Hat's culture is built on the open source principles of transparency, collaboration, and inclusion, where the best ideas can come from anywhere and anyone. When this is realized, it empowers people from different backgrounds, perspectives, and experiences to come together to share ideas, challenge the status quo, and drive innovation. Our aspiration is that everyone experiences this culture with equal opportunity and access, and that all voices are not only heard but also celebrated. We hope you will join our celebration, and we welcome and encourage applicants from all the beautiful dimensions that compose our global village.
Equal Opportunity Policy (EEO)
Red Hat is proud to be an equal opportunity workplace and an affirmative action employer. We review applications for employment without regard to their race, color, religion, sex, sexual orientation, gender identity, national origin, ancestry, citizenship, age, veteran status, genetic information, physical or mental disability, medical condition, marital status, or any other basis prohibited by law.
Red Hat does not seek or accept unsolicited resumes or CVs from recruitment agencies. We are not responsible for, and will not pay, any fees, commissions, or any other payment related to unsolicited resumes or CVs except as required in a written contract between Red Hat and the recruitment agency or party requesting payment of a fee.Red Hat supports individuals with disabilities and provides reasonable accommodations to job applicants. If you need assistance completing our online job application, email application-assistance@redhat.com. General inquiries, such as those regarding the status of a job application, will not receive a reply.
$111.3k-183.6k yearly Auto-Apply 7d ago
Sr. Specialist Site Reliability Engineer
Waystar 4.6
Atlanta, GA jobs
** We are looking for a talented and driven Site Reliability Engineering (SRE) Specialist to support our engineering team, which manages the infrastructure and services that power our Waystar products. This role is ideal for an experienced engineer who thrives in data-intensive environments and is passionate about building reliable, scalable systems that ensure data integrity, availability, and performance.
As an SRE Specialist, you'll work closely with engineering, product, and data teams to ensure our data licensing platforms are resilient, observable, and continuously improving.
**WHAT YOU'LL DO**
+ **System Reliability & Performance**
+ Design and implement reliability solutions for data ingestion, processing, and delivery pipelines.
+ Define and maintain SLIs/SLOs for data licensing services and manage error budgets.
+ Build automation for deployment, monitoring, and incident response.
+ **Observability & Monitoring**
+ Enhance system observability through metrics, logging, and tracing.
+ Develop and maintain dashboards and alerts to proactively detect and resolve issues.
+ **Incident Response & Postmortems**
+ Participate in on-call rotations and lead incident response efforts.
+ Conduct root cause analysis and drive post-incident improvements.
+ Maintain runbooks and operational documentation.
+ **Collaboration & Continuous Improvement**
+ Partner with software and data engineers to embed reliability into system design.
+ Contribute to blameless postmortems and reliability reviews.
+ Share knowledge and mentor junior team members.
**WHAT YOU'LL NEED**
+ 7+ years of experience in SRE, DevOps, or infrastructure engineering.
+ Strong understanding of cloud platforms (AWS, GCP, or Azure), container orchestration (Kubernetes), and infrastructure-as-code (Terraform, CloudFormation).
+ Experience with observability tools (e.g., Prometheus, Grafana, Splunk) and CI/CD pipelines.
+ Familiarity with data platforms, ETL pipelines, and distributed systems.
+ Excellent problem-solving and communication skills.
+ Experience with Python, Powershell, and other similar languages
+ Active use of artificial intelligence (AI) tools and techniques to enhance performance, drive innovation, and improve decision-making across business functions
+ Ability to leverage AI tools and platforms to streamline workflows, improve decision-making, and drive innovation
+ Curiosity and adaptability in exploring emerging AI technologies, with a mindset for continuous learning and experimentation
Preferred Qualifications
+ Experience with data licensing, data governance, or data compliance frameworks.
+ Exposure to data pipeline tools (e.g., Apache Airflow, Kafka, Spark).
+ Familiarity with regulatory requirements related to data usage and distribution.
**ABOUT WAYSTAR**
Through a smart platform and better experience, Waystar helps providers simplify healthcare payments and yield powerful results throughout the complete revenue cycle.
Waystar's healthcare payments platform combines innovative, cloud-based technology, robust data, and unparalleled client support to streamline workflows and improve financials so providers can focus on what matters most: their patients and communities. Waystar is trusted by 1M+ providers, 1K+ hospitals and health systems, and is connected to over 5K commercial and Medicaid/Medicare payers. We are deeply committed to living out our organizational values: honesty; kindness; passion; curiosity; fanatical focus; best work, always; making it happen; and joyful, optimistic & fun.
Waystar products have won multiple Best in KLAS or Category Leader awards since 2010 and earned multiple #1 rankings from Black Book surveys since 2012. The Waystar platform supports more than 500,000 providers, 1,000 health systems and hospitals, and 5,000 payers and health plans. For more information, visit waystar.com or follow @Waystar (**************************** on Twitter.
**WAYSTAR PERKS**
+ Competitive total rewards (base salary + bonus, if applicable)
+ Customizable benefits package (3 medical plans with Health Saving Account company match)
+ We offer generous paid time off for our non-exempt team members, starting with 3 weeks + 13 paid holidays, including 2 personal floating holidays. We also offer flexible time off for our exempt team members + 13 paid holidays
+ Paid parental leave (including maternity + paternity leave)
+ Education assistance opportunities and free LinkedIn Learning access
+ Free mental health and family planning programs, including adoption assistance and fertility support
+ 401(K) program with company match
+ Pet insurance
+ Employee resource groups
Waystar is proud to be an equal opportunity workplace. We celebrate, value, and support diversity and inclusion. Qualified applicants will receive consideration for employment without regard to race, color, religion, age, sex, national origin, disability status, genetics, marital status, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state, or local laws.
This applies to all terms and conditions of employment, including recruiting, hiring, placement, promotion, termination, layoff, recall, transfer, leaves of absence, compensation, and training.
**Job Category:** Technology/Engineering
**Job Type:** Full time
**Req ID:** R2880
$99k-128k yearly est. 44d ago
Sr. Specialist Site Reliability Engineer
Waystar 4.6
Atlanta, GA jobs
We are looking for a talented and driven Site Reliability Engineering (SRE) Specialist to support our engineering team, which manages the infrastructure and services that power our Waystar products. This role is ideal for an experienced engineer who thrives in data-intensive environments and is passionate about building reliable, scalable systems that ensure data integrity, availability, and performance.
As an SRE Specialist, you'll work closely with engineering, product, and data teams to ensure our data licensing platforms are resilient, observable, and continuously improving.
WHAT YOU'LL DO
* System Reliability & Performance
* Design and implement reliability solutions for data ingestion, processing, and delivery pipelines.
* Define and maintain SLIs/SLOs for data licensing services and manage error budgets.
* Build automation for deployment, monitoring, and incident response.
* Observability & Monitoring
* Enhance system observability through metrics, logging, and tracing.
* Develop and maintain dashboards and alerts to proactively detect and resolve issues.
* Incident Response & Postmortems
* Participate in on-call rotations and lead incident response efforts.
* Conduct root cause analysis and drive post-incident improvements.
* Maintain runbooks and operational documentation.
* Collaboration & Continuous Improvement
* Partner with software and data engineers to embed reliability into system design.
* Contribute to blameless postmortems and reliability reviews.
* Share knowledge and mentor junior team members.
WHAT YOU'LL NEED
* 7+ years of experience in SRE, DevOps, or infrastructure engineering.
* Strong understanding of cloud platforms (AWS, GCP, or Azure), container orchestration (Kubernetes), and infrastructure-as-code (Terraform, CloudFormation).
* Experience with observability tools (e.g., Prometheus, Grafana, Splunk) and CI/CD pipelines.
* Familiarity with data platforms, ETL pipelines, and distributed systems.
* Excellent problem-solving and communication skills.
* Experience with Python, Powershell, and other similar languages
* Active use of artificial intelligence (AI) tools and techniques to enhance performance, drive innovation, and improve decision-making across business functions
* Ability to leverage AI tools and platforms to streamline workflows, improve decision-making, and drive innovation
* Curiosity and adaptability in exploring emerging AI technologies, with a mindset for continuous learning and experimentation
Preferred Qualifications
* Experience with data licensing, data governance, or data compliance frameworks.
* Exposure to data pipeline tools (e.g., Apache Airflow, Kafka, Spark).
* Familiarity with regulatory requirements related to data usage and distribution.
ABOUT WAYSTAR
Through a smart platform and better experience, Waystar helps providers simplify healthcare payments and yield powerful results throughout the complete revenue cycle.
Waystar's healthcare payments platform combines innovative, cloud-based technology, robust data, and unparalleled client support to streamline workflows and improve financials so providers can focus on what matters most: their patients and communities. Waystar is trusted by 1M+ providers, 1K+ hospitals and health systems, and is connected to over 5K commercial and Medicaid/Medicare payers. We are deeply committed to living out our organizational values: honesty; kindness; passion; curiosity; fanatical focus; best work, always; making it happen; and joyful, optimistic & fun.
Waystar products have won multiple Best in KLAS or Category Leader awards since 2010 and earned multiple #1 rankings from Black Book surveys since 2012. The Waystar platform supports more than 500,000 providers, 1,000 health systems and hospitals, and 5,000 payers and health plans. For more information, visit waystar.com or follow @Waystar on Twitter.
WAYSTAR PERKS
* Competitive total rewards (base salary + bonus, if applicable)
* Customizable benefits package (3 medical plans with Health Saving Account company match)
* We offer generous paid time off for our non-exempt team members, starting with 3 weeks + 13 paid holidays, including 2 personal floating holidays. We also offer flexible time off for our exempt team members + 13 paid holidays
* Paid parental leave (including maternity + paternity leave)
* Education assistance opportunities and free LinkedIn Learning access
* Free mental health and family planning programs, including adoption assistance and fertility support
* 401(K) program with company match
* Pet insurance
* Employee resource groups
Waystar is proud to be an equal opportunity workplace. We celebrate, value, and support diversity and inclusion. Qualified applicants will receive consideration for employment without regard to race, color, religion, age, sex, national origin, disability status, genetics, marital status, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state, or local laws.
This applies to all terms and conditions of employment, including recruiting, hiring, placement, promotion, termination, layoff, recall, transfer, leaves of absence, compensation, and training.
$99k-128k yearly est. Auto-Apply 43d ago
Sr. Specialist Site Reliability Engineer
Waystar 4.6
Atlanta, GA jobs
We are looking for a talented and driven Site Reliability Engineering (SRE) Specialist to support our engineering team, which manages the infrastructure and services that power our Waystar products. This role is ideal for an experienced engineer who thrives in data-intensive environments and is passionate about building reliable, scalable systems that ensure data integrity, availability, and performance.
As an SRE Specialist, you'll work closely with engineering, product, and data teams to ensure our data licensing platforms are resilient, observable, and continuously improving.
WHAT YOU'LL DO
System Reliability & Performance
Design and implement reliability solutions for data ingestion, processing, and delivery pipelines.
Define and maintain SLIs/SLOs for data licensing services and manage error budgets.
Build automation for deployment, monitoring, and incident response.
Observability & Monitoring
Enhance system observability through metrics, logging, and tracing.
Develop and maintain dashboards and alerts to proactively detect and resolve issues.
Incident Response & Postmortems
Participate in on-call rotations and lead incident response efforts.
Conduct root cause analysis and drive post-incident improvements.
Maintain runbooks and operational documentation.
Collaboration & Continuous Improvement
Partner with software and data engineers to embed reliability into system design.
Contribute to blameless postmortems and reliability reviews.
Share knowledge and mentor junior team members.
WHAT YOU'LL NEED
7+ years of experience in SRE, DevOps, or infrastructure engineering.
Strong understanding of cloud platforms (AWS, GCP, or Azure), container orchestration (Kubernetes), and infrastructure-as-code (Terraform, CloudFormation).
Experience with observability tools (e.g., Prometheus, Grafana, Splunk) and CI/CD pipelines.
Familiarity with data platforms, ETL pipelines, and distributed systems.
Excellent problem-solving and communication skills.
Experience with Python, Powershell, and other similar languages
Active use of artificial intelligence (AI) tools and techniques to enhance performance, drive innovation, and improve decision-making across business functions
Ability to leverage AI tools and platforms to streamline workflows, improve decision-making, and drive innovation
Curiosity and adaptability in exploring emerging AI technologies, with a mindset for continuous learning and experimentation
Preferred Qualifications
Experience with data licensing, data governance, or data compliance frameworks.
Exposure to data pipeline tools (e.g., Apache Airflow, Kafka, Spark).
Familiarity with regulatory requirements related to data usage and distribution.
ABOUT WAYSTAR
Through a smart platform and better experience, Waystar helps providers simplify healthcare payments and yield powerful results throughout the complete revenue cycle.
Waystar's healthcare payments platform combines innovative, cloud-based technology, robust data, and unparalleled client support to streamline workflows and improve financials so providers can focus on what matters most: their patients and communities. Waystar is trusted by 1M+ providers, 1K+ hospitals and health systems, and is connected to over 5K commercial and Medicaid/Medicare payers. We are deeply committed to living out our organizational values: honesty; kindness; passion; curiosity; fanatical focus; best work, always; making it happen; and joyful, optimistic & fun.
Waystar products have won multiple Best in KLAS or Category Leader awards since 2010 and earned multiple #1 rankings from Black Book™ surveys since 2012. The Waystar platform supports more than 500,000 providers, 1,000 health systems and hospitals, and 5,000 payers and health plans. For more information, visit waystar.com or follow @Waystar on Twitter.
WAYSTAR PERKS
Competitive total rewards (base salary + bonus, if applicable)
Customizable benefits package (3 medical plans with Health Saving Account company match)
We offer generous paid time off for our non-exempt team members, starting with 3 weeks + 13 paid holidays, including 2 personal floating holidays. We also offer flexible time off for our exempt team members + 13 paid holidays
Paid parental leave (including maternity + paternity leave)
Education assistance opportunities and free LinkedIn Learning access
Free mental health and family planning programs, including adoption assistance and fertility support
401(K) program with company match
Pet insurance
Employee resource groups
Waystar is proud to be an equal opportunity workplace. We celebrate, value, and support diversity and inclusion. Qualified applicants will receive consideration for employment without regard to race, color, religion, age, sex, national origin, disability status, genetics, marital status, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state, or local laws.
This applies to all terms and conditions of employment, including recruiting, hiring, placement, promotion, termination, layoff, recall, transfer, leaves of absence, compensation, and training.
$99k-128k yearly est. Auto-Apply 16d ago
Senior Site Reliability Engineer, ARO (OpenShift/Kubernetes/Azure, Golang, Linux)
Red Hat 4.6
Remote
Red Hat is seeking a Senior Site Reliability Engineer (SRE) to develop, scale, and operate our Azure Red Hat OpenShift managed cloud services. OpenShift is Red Hat's enterprise Kubernetes distribution. As an SRE you will contribute to running OpenShift at scale by enabling customer self-service, making our monitoring system more sustainable, and eliminating work through automation.
On the SRE team, you will have the opportunity to influence the complex challenges of scale which are unique to Red Hat managed cloud services, while using your skills in coding, operations, and large-scale distributed system design.
Red Hat relies on teamwork and openness for its success. We are a global team and strive to cultivate a transparent environment that makes room for different voices. We learn from our failures in a blameless environment to support the continuous improvement of the team. At Red Hat, your individual contributions have more visibility than most large companies, and visibility means career opportunities and growth.
**What You Will Do**
The day-to-day responsibilities of an SRE involve working with live systems and coding automation. As an SRE you will be expected to:
+ Contribute code to increase the scalability and reliability of the service
+ Contribute software tests and participate in peer review to increase the quality of our codebase
+ Help and develop peers' capabilities through knowledge sharing, mentoring, and collaboration
+ Participate in a regular on-call schedule, including occasional paid weekends and holidays
+ Practice sustainable incident response and blameless postmortems
+ Resolve customer issues escalated from the Red Hat Global Support team
+ Work within a small agile team to develop and improve SRE software, support your peers, plan and self-improve
+ Explore and experiment with emerging AI technologies relevant to software development, proactively identifying opportunities to incorporate new AI capabilities into existing workflows and tooling
**What You Will Bring**
+ Bachelor's degree in Computer Science, Engineering, or related field; equivalent practical experience will also be considered.
+ Strong experience (5+ years) in at least one programming language (Golang, C, C++, Python, Java) and software life cycles
+ Hands-on experience with public cloud platforms (AWS, GCP, Azure). Preferably Azure
+ Direct experience with Kubernetes or OpenShift is a major plus. We like to see a demonstrated ability to debug, optimize code and automate routine tasks, 4+ years is desired.
+ Experience with Docker based containers
+ Strong collaboration and problem-solving skills in distributed, team-based environments.
+ Experience troubleshooting as-a-service offerings (SaaS/PaaS) and working with complex distributed systems.
+ Working knowledge of Linux/Unix operating systems.
+ Proven ability to automate repetitive tasks and debug performance issues.
+ Have the ability to collaboratively troubleshoot and solve problems in a remote and distributed team setting.
+ As an SRE you will be most successful if you have some experience troubleshooting an as-a-service offering (SaaS, PaaS, etc.) and some experience working with complex distributed systems.
**Following are considered a plus:**
+ Strong experience managing Linux servers running Red Hat Enterprise Linux (RHEL), CentOS, or Fedora hosted at a cloud provider such as Microsoft Azure, Amazon Web Services (AWS) or Google Compute Engine (GCE)
+ Strong experience with enterprise systems monitoring; knowledge of Prometheus is a plus
+ Experience with enterprise configuration management software like Ansible by Red Hat, Puppet, or Chef
+ Strong experience delivering a hosted service
+ Demonstrated ability to quickly and accurately troubleshoot system issues
+ Solid understanding of standard TCP/IP networking and common protocols like DNS and HTTP
+ Solid communications skills and experience working directly with and presenting to customers
Don't quite match above but have strong developer skills? We'd still love to hear from you. We would like to welcome mid-level SRE with great potential as well.
\#LI-SH4
**About Red Hat**
Red Hat (************************ is the world's leading provider of enterprise open source (******************************************** software solutions, using a community-powered approach to deliver high-performing Linux, cloud, container, and Kubernetes technologies. Spread across 40+ countries, our associates work flexibly across work environments, from in-office, to office-flex, to fully remote, depending on the requirements of their role. Red Hatters are encouraged to bring their best ideas, no matter their title or tenure. We're a leader in open source because of our open and inclusive environment. We hire creative, passionate people ready to contribute their ideas, help solve complex problems, and make an impact.
**Inclusion at Red Hat**
Red Hat's culture is built on the open source principles of transparency, collaboration, and inclusion, where the best ideas can come from anywhere and anyone. When this is realized, it empowers people from different backgrounds, perspectives, and experiences to come together to share ideas, challenge the status quo, and drive innovation. Our aspiration is that everyone experiences this culture with equal opportunity and access, and that all voices are not only heard but also celebrated. We hope you will join our celebration, and we welcome and encourage applicants from all the beautiful dimensions that compose our global village.
**Equal Opportunity Policy (EEO)**
Red Hat is proud to be an equal opportunity workplace and an affirmative action employer. We review applications for employment without regard to their race, color, religion, sex, sexual orientation, gender identity, national origin, ancestry, citizenship, age, veteran status, genetic information, physical or mental disability, medical condition, marital status, or any other basis prohibited by law.
**Red Hat does not seek or accept unsolicited resumes or CVs from recruitment agencies. We are not responsible for, and will not pay, any fees, commissions, or any other payment related to unsolicited resumes or CVs except as required in a written contract between Red Hat and the recruitment agency or party requesting payment of a fee.**
**Red Hat supports individuals with disabilities and provides reasonable accommodations to job applicants. If you need assistance completing our online job application, email** **application-assistance@redhat.com** **. General inquiries, such as those regarding the status of a job application, will not receive a reply.**
$90k-115k yearly est. 60d+ ago
Senior Site Reliability Engineer
Gearbox Software 4.1
Frisco, TX jobs
The Gearbox Entertainment Company is an award-winning creator and distributor of entertainment for people around the world. Gearbox Entertainment develops and publishes products through its subsidiaries, Gearbox Software and Gearbox Publishing. Gearbox Entertainment has become widely known for successful game franchises including Brothers in Arms and Borderlands, as well as acquired properties Duke Nukem and Homeworld. Gearbox's ambition is to entertain the world and its key driving objectives include the pursuit of happiness for our talent, partners and customers, the prioritization of entertainment and creativity and a measured respect for profitability. For more information, visit ****************
To further drive our vision of premier stability and rapid feature delivery, we are looking for a Senior Site Reliability Engineer to join our team. As a Senior SRE, you should feel exceptionally comfortable bringing architectural design proposals to the table for consideration among your colleagues on our platform and infrastructure development teams. You will be one of the principal technical designers helping push our cloud-native platform toward the future. You will be responsible for driving the implementation of flexible cloud architectures with an automation-first emphasis; manual user intervention likely makes you uneasy and maybe even a little twitchy. We would expect a successful candidate for this position to be a self-starter with the ability to complete tasks independently. Though you will have access to technical leadership and seniorengineers at your disposal, you should feel well acquainted with tackling complex problems without significant oversight. Observability is paramount. If we can't measure it, we can't prove it works; if we can't prove it works, it must be assumed it doesn't work. This is a philosophy you hopefully love (and preferably obsess over). If we can't observe how a new feature is behaving, our SRE team is excited to dive into the application code and make the necessary improvements. Typical Day Tl;dr: You will be deeply immersed in Go and Python observability stacks; plenty of AWS and Terraform sprinkled in as well. This is a very hands-on SeniorEngineering role where your days will be filled with building solutions to technical challenges in the observability and availability of our SHiFT online services. You will evangelize for and be obsessed with user experience as it relates to the services you support. You will help manage and orchestrate each of these by leaning heavily on technologies like Go, Terraform, Docker, and Bash. On any given day, you should expect to spend at least 80% of your time actively engineering and developing solutions; the rest will be a mixture of planning, reviewing code from your colleagues, participating in design meetings, documentation, and self-development. This position will eventually require you to carry a company-paid mobile device and participate in 24/7 on-call rotations alongside your engineering colleagues. Don't worry though, our on-call experience doesn't suck. Core Responsibilities: Design, engineer, and develop solutions for ensuring the observability and reliability of our online platform Be a trusted voice in the evangelism of reliability engineering throughout the team with an eagerness for mentoring other developers on the team Help define and oversee short and mid-term project roadmaps for the future of our SRE team Participate in after-hours on-call support rotations Must Have (the non-negotiable parts): Candidates must have at least 4 years of professional experience instrumenting complex observability stacks in object oriented programming languages, preferably Go. Proficiency in AWS container management, orchestration, and observability features (ECS, Fargate, Aurora, AppConfig, CloudWatch, etc.) Professional Experience managing AWS access and security services (IAM, kms, Secrets Manager, WAFv2, etc.) Professional Experience in Terraform and/or CloudFormation Minimum of 2 years experience with containers in a professional setting, preferably Docker Adept understanding of observability stack management (otel, tracing, monitoring, alerting, structured logging, APM, etc.) Comfortable communicator, able to clearly detail designs and implementations on an individual level and in large group settings Should Have (some wiggle room): Extensive hands-on experience with OpenTelemetry Hands-on experience developing and maintaining CI/CD pipelines, preferably in git/GitLab Understanding of RESTful and Websocket based APIs Bachelor's degree in computer science, related field, or equivalent training and professional experience Now you're just showing off: Familiarity with Datadog Familiarity with Atlassian products (OpsGenie, JIRA, Confluence) Experience working with developers in an agile environment Experience in the games industry, preferably launching multiple online-enabled AAAs Knowledge about Gearbox-owned IPs
Gearbox Entertainment believes that all team members should be able to enjoy a work environment free from all forms of discrimination and harassment. We are committed to reflecting the diversity of the world we strive to entertain. As an Equal Opportunity Employer, we provide fair and equal treatment to all team members and applicants. We do not discriminate on the basis of race, color, religion, sex, sexual orientation, gender identity or expression, national origin, disability, genetic information, pregnancy or maternity, veteran status, or any other status protected by applicable national, federal, state or local law.
$110k-144k yearly est. Auto-Apply 60d+ ago
Site Reliability Engineer (SRE) with strong Middleware expertise
Hexaware Technologies, Inc. 4.2
Remote
What Working at Hexaware offers:
Hexaware is a dynamic and innovative IT organization committed to delivering cutting-edge solutions to our clients worldwide. We pride ourselves on fostering a collaborative and inclusive work environment where every team member is valued and empowered to succeed.
Hexaware provides access to a vast array of tools that enhance, revolutionize, and advance professional profile. We complete the circle with excellent growth opportunities, chances to collaborate with highly visible customers, chances to work alongside bright brains, and the perfect work-life balance.
With an ever-expanding portfolio of capabilities, we delve deep into and identify the source of our motivation. Although technology is at the core of our solutions, it is still the people and their passion that fuel Hexaware's commitment towards creating smiles.
“At Hexaware we encourage to challenge oneself to achieve full potential and propel growth. We trust and empower to disrupt the status quo and innovate for a better future. We encourage an open and inspiring culture that fosters learning and brings talented, passionate, and caring people together.”
We are always interested in, and want to support, the professional and personal you. We offer a wide array of programs to help expand skills and supercharge careers. We help discover passion-the driving force that makes one smile and innovate, create, and make a difference every day.
The Hexaware Advantage: Your Workplace Benefits
· Excellent Health benefits with low-cost employee premium.
· Wide range of voluntary benefits such as Legal, Identity theft and Critical Care Coverage
· Unlimited training and upskilling opportunities through Udemy and Hexavarsity
Who we are?
At Hexaware Technologies, we are a leading global IT Services company, dedicated to driving digital transformation and innovation for businesses around the world. Founded in 1990, Hexaware has grown into a global trusted partner for enterprises, offering comprehensive AI empowered services including IT Consulting, Application Development, Infrastructure and Cloud Management and Business Process services.
At Hexaware we are a community of creative, diverse, and open-minded Hexawarians creating smiles through the power of great people and technology.
We pride ourselves on our people-centric culture and commitment to sustainability. Our diverse team of over 30,000 professionals across 30 countries is driven by a shared passion for innovation and excellence. We foster a collaborative environment where creativity and continuous learning are encouraged, enabling our employees to thrive and grow.
Position: Site Reliability Engineer (SRE) with strong Middleware expertise
Location: Plano TX- (5 Days onsite & 24x7 Rotational)
Shift: Rotational (Shift 1 (8 AM - 5 PM), Shift 2 (4 PM - 1 AM), Shift 3 (12 AM - 9 AM)) also on weekend based upon Roaster
Duties and Responsibilities:
Job Summary: We are seeking a Site Reliability Engineer (SRE) with strong Middleware expertise to design, operate, and continuously improve highly available, secure, and scalable enterprise platforms. This role blends deep middleware operations (WebLogic, API gateways, Java platforms) with SRE principles such as automation, observability, SLIs/SLOs, error budgets, and incident reduction. The ideal candidate will partner with application, infrastructure, security, and DevOps teams to ensure platform reliability while driving automation, standardization, and operational excellence.
Key Responsibilities:
Reliability & SRE Practices:
Define, implement, and track SLIs, SLOs, and error budgets for middleware and platform services
Drive MTTR reduction, availability improvements, and operational resilience
Lead incident response, root cause analysis (RCA), and post-incident reviews
Implement proactive monitoring and alerting to reduce noise and prevent outages
Middleware Platform Engineering:
Administer and support enterprise middleware platforms including:
Oracle WebLogic, Apache, NGINX
API Gateways (Apigee Edge / X)
Java application servers and JVM-based services
Perform patching, upgrades, configuration tuning, and capacity planning
Manage certificates, keystores, truststores, and TLS configurations
Ensure platform security, compliance, and performance standards
Observability & Monitoring:
Design and maintain end-to-end observability using tools such as:
Dynatrace, ELK/Kibana, Splunk (or equivalent)
Build executive and operational dashboards for real-time health visibility
Reduce alert fatigue through smart alerting, thresholds, and suppression
Monitor JVM metrics, GC behavior, thread utilization, and API performance
Automation & Infrastructure Efficiency:
Develop automation and self-healing solutions using:
Shell scripting, Python, Ansible, Terraform, or similar tools
Automate routine operational tasks (restarts, validations, health checks)
Enable CI/CD-friendly middleware deployments and configuration management
Standardize environments across DEV / QA / UAT / PROD
Cloud, Containers & Modern Platforms:
Support middleware workloads on:
Kubernetes / OpenShift
Public or hybrid cloud environments (AWS, Azure, GCP)
Integrate platform reliability into containerized and microservices architectures
Collaborate with DevOps teams on deployment pipelines and release strategies
Collaboration & Leadership:
Act as a reliability advisor to application and development teams
Partner with Unix/Linux, Database, Network, and Security teams
Provide mentoring, documentation, and best-practice guidance
Participate in on-call rotations and production support leadership
Required Skills & Experience:
Technical Skills:
6+ years of experience in Middleware / Platform Operations / SRE
Strong expertise in WebLogic, Java middleware, Apache/NGINX
Hands-on experience with observability platforms (Dynatrace, ELK, Splunk)
Solid understanding of Linux/Unix systems and networking fundamentals
Experience with API platforms (Apigee preferred)
Automation and scripting skills (Shell, Python, Ansible, Terraform)
Experience with Kubernetes/OpenShift and containerized workloads
SRE & Operational Excellence:
Practical experience implementing SRE principles in production
Strong troubleshooting skills (thread dumps, heap analysis, GC logs)
Experience with incident management, RCA, and change management
Ability to balance reliability vs delivery velocity
Nice-to-Have:
Experience with cloud-native architectures and service meshes
Knowledge of IAM / Security integrations (OAuth, SAML, mTLS)
Exposure to CI/CD tools (Jenkins, GitHub Actions, GitLab CI)
Experience supporting 24x7 enterprise environments
ITIL or SRE certifications
What you'll get from us:
Insert US/employee benefits here e.g.:
• Competitive Salary
• Company Pension Scheme
• Comprehensive Health Insurance
• Flexible Work Hours and Hybrid Work Options
• XX days paid annual holidays + public holidays.
• Professional Development and Training Opportunities
• Employee Assistance Program (EAP)
• Diversity, Equity, and Inclusion Initiatives
• Company Events and Team-Building Activities
Equal Opportunities Employer:
Hexaware Technologies is an equal opportunity employer. We are dedicated to providing a work environment free from discrimination and harassment. All employment decisions at Hexaware are based on business needs, job requirements, and individual qualifications. We do not discriminate based on race including colour, nationality, ethnic or national origin, religion or belief, sex, age, disability, marital status, sexual orientation, parental status, gender reassignment, or any other status protected by law. We encourage candidates of all backgrounds to apply.