Senior Reliability Engineer jobs at Planet DDS

- 1107 jobs

Site Reliability Engineer
Matlen Silver 3.7
Columbus, OH jobs
Title: Senior Cloud Security Engineer/Architect Environment: Onsite Duration: 6 month contract to hire Contract pay: $68-$90/hour W2 Conversion salary: $150k-$188k NO C2C ** Due to client requirements, US Citizen or GC Holder ONLY ** Requirements Minimum 13+ years of professional experience in Cloud Infrastructure, DevOps, or Site Reliability Engineering. Strong Infrastructure as Code (IaC) expertise with Terraform-hands-on experience creating and managing EKS clusters, repositories, and Terraform modules. Architect, implement, and manage Azure IaaS infrastructure encompassing VNets, subnets, network security groups, VPN gateways, CDNs, Traffic Manager, peering, custom routes, DNS, DHCP, and virtual appliances. Proven proficiency across Azure and/or AWS (multi-cloud experience preferred). Strong security mindset with practical experience in IAM, vulnerability remediation, encryption, and patching. Solid understanding of DNS, Docker, Kubernetes, and containerization best practices. Experience with Windows and Linux/Unix system and network administration (8+ years). Proficiency in one or more programming/scripting languages: Python, Go, Bash, or Ruby. Expertise in Terraform, Ansible, or Chef for automation and configuration management. Hands-on experience with cloud services (AWS, Azure, GCP) - including EC2, S3, Kubernetes, and serverless environments. Knowledge of networking fundamentals: DNS, firewalls, load balancing, and VPNs. Experience with container orchestration using Docker, Kubernetes, or OpenShift. Experience with monitoring and observability tools such as Prometheus, Grafana, Datadog, or New Relic. CI/CD pipeline development using Jenkins, GitLab CI, GitHub Actions, or CircleCI. Bonus: Experience with HashiCorp Vault and advanced Terraform module design. Deep understanding of access control, encryption standards, secure coding practices, and regulatory frameworks Skilled in incident management, root cause analysis, automation, and performance tuning. Understanding of SLOs/SLAs, system scalability, redundancy, and resilience best practices.
$150k-188k yearly 4d ago
Sr. Site Reliability Engineer (SRE)
Avenue Code 3.5
Mountain View, CA jobs
About the Opportunity: We're seeking an experienced, highly collaborative SRE to partner with product teams and tackle our most critical infrastructure challenges. You'll be hands-on in designing, building, and operating our cloud platform-and driving the reliability, performance, and security that empower our engineering organization. Responsibilities: Infrastructure as Code & CI/CD: Automate provisioning and deployments with Terraform and integrate best-practice pipelines (GitHub Actions, ArgoCD, etc.). Reliability Engineering: Define SLIs/SLOs, manage error budgets, and build dashboards & alerts to proactively measure and improve system health. Security & Compliance: Enforce least-privilege IAM policies, automate vulnerability scans, and maintain audit logging for compliance. Monitoring & Observability: Instrument services with metrics, logs, and distributed tracing to enable rapid troubleshooting, aid teams in alerting, custom metrics, and dashboarding Incident Management: Own on-call rotations, lead real-time incident response, conduct post-mortems, and drive continuous improvements. Cost Optimization: Implement tagging strategies, right-size resources, and leverage concrete data to decide on optimal methods to control cloud spend at scale. Documentation & Mentorship: Author runbooks, standards, and best-practice guides-and coach dev teams on implementing modern DevOps, reliability, and security patterns. Required Qualifications: Have 5+ years of experience running production critical systems. Deep proficiency with the AWS Cloud and Cloud-Native best practices. Experience with Kubernetes (EKS, GKE) and Container Orchestration at scale. Skilled in Terraform to declaratively provision and maintain infrastructure services. Working knowledge of managing and debugging databases like Redis and Postgres. Strong familiarity with VPC, VPN, Load Balancing, and cloud networking components. Proficiency with Git workflows, branching strategies, and CI/CD systemintegrations. Solid understanding of web and network protocols and standards (HTTP, REST, TLS, DNS, etc...) Professional proficiency in English (both written and spoken) is required for this role. Nice to Have Skills: Bachelor's degree, or equivalent in Computer Science, Engineering, or a related field. Experience with ArgoCD, GitHub Actions, Jenkins, or other CI/CD pipeline solutions. Working knowledge of Python, Golang, and Helm templating languages. Node.js experience a plus, including running scalable, resilient Node microservices. Grasp of foundational security best practices for cloud infrastructure. Awareness of Terragrunt, managing Terraform state, and optimal project structure. Seasoned in production readiness fundamentals amidst a fast-moving team. Avenue Code reinforces its commitment to privacy and to all the principles guaranteed by the most accurate global data protection laws, such as GDPR, LGPD, CCPA and CPRA. The Candidate data shared with Avenue Code will be kept confidential and will not be transmitted to disinterested third parties, nor will it be used for purposes other than the application for open positions. As a Consultancy company, Avenue Code may share your information with its clients and other Companies from the CompassUol Group to which Avenue Code's consultants are allocated to perform its services.
$144k-188k yearly est. 4d ago
Site Reliability Engineer
Ascendum Solutions 4.5
Cincinnati, OH jobs
On-site role, 5 days/week. Candidates should be eligible to work for any employer in the United States without needing Visa sponsorship. As a Site Reliability Engineer/DevOps Engineer, you will be responsible for ensuring the availability, performance, and reliability of Fulfillment Technology solutions to support an omni-channel strategy. You will work closely with the development, testing, and operations teams to design, implement, and maintain scalable, reliable, and efficient solutions for the production environment. You will also troubleshoot and resolve any issues that may arise in the production systems, using various tools and techniques such as monitoring, logging, alerting, automation, and incident management. You will also contribute to the continuous improvement of the DevOps practices and processes, such as CI/CD, configuration management, infrastructure as code, and cloud computing. You will have a strong background in software engineering, system administration, networking, and cloud technologies. You will also have excellent communication and collaboration skills, as well as a passion for learning new technologies and solving complex problems. Minimum Position Qualifications 4+ years of experience in the cloud SRE/DevOps/Infrastructure, or any related fields 4+ years experience working with databases, web applications and micro-services, event-driven applications, messaging systems, REST APIs and integrations, cloud, support tools, observability and containerization technologies. Knowledge of Java, Spring boot, Microservices, Kafka, Cassandra & SQL Server Proficiency in scripting languages such as Python / Shell scripting 1 year of experience managing System Observability tools (DynaTrace, ELK, PagerDuty, Datadog, Azure Monitor, Grafana, etc) Hands-on experience with GitActions for CI/CD automations Knowledge of Linux architecture, security, administration, performance monitoring/tuning, troubleshooting, and production operations Demonstrated skill in working in an Agile environment Demonstrated skill in working with multi-location global teams Proven ability to think and contribute at the strategic level Demonstrated knowledge of eCommerce, Fulfillment, or Retail Technology solutions Demonstrated written, oral and presentation/public speaking communication skills Desired Previous Experience/Education 4+ years of experience in designing/working in high volume eCommerce applications 2+ years of experience configuring and managing cloud infrastructure (Azure, AWS, GCP) 1 year of experience with technologies such as Apache Kafka, Azure Cosmos DB, Apache Cassandra, Ansible, Terraform, Docker and Kubernetes Experience with Nginx, HAProxy, Squid Experience with CI/CD pipelines using tools such as Jenkins, Spinnaker, Azure DevOps, TeamCity, etc. Proficient in implementing and managing RoyalTS or similar cross-platform remote management solutions, ensuring secure and efficient remote access and system administration across diverse environments. Responsibilities Partner and collaborate with application engineering, observability, and other support teams as well as our business operation partners and third parties (as appropriate) to prioritize, address and drive the resolution of issues and incidents that impact customer pickup or delivery domains Drive root-cause analysis of critical business and production issues to prevent future occurrences and review/approve potential solutions Lead Major Incident calls impacting the Pickup Fulfillment domain and provide clear, timely updates on status of service restoration to key stakeholders Work with the engineering teams to continuously implement and improve reliable and speedy build environments Increase automation to improve efficiency and quality Ensure traceability, observability, and retrievability of system behavior Build logging, monitoring, and alerting systems to identify bottlenecks and assist with debugging, analysis, and optimization in cloud, on-prem and store environments Craft solid and clearly explained designs, playbooks, and documentation Participate in an off-hours on-call rotation, and perform periodic off-hours work during maintenance windows
$75k-99k yearly est. 1d ago
Quality Engineer with Automation (Only W2 resources)
Tek Leaders Inc. 3.9
Cincinnati, OH jobs
Expert-level automation experience in: Selenium for UI validation Playwright for modern and responsive web interfaces. Karate Framework for REST and SOAP API validation testing (using tools such as Postman, SoapUI, or Rest Assured). Experience testing in Retail Point-of-Sale (POS) environments or similar retail systems. Git, Jenkins, CI/CD pipelines, and integration with test management tools (e.g., QMetry, JIRA, Zephyr). BDD/TDD, data-driven testing, and mocking service layers. coding/scripting skills in Java, JavaScript, or Python.
$61k-80k yearly est. 4d ago
Quality Validation Engineer (USC/GC/EAD)
Raas Infotek 4.1
San Diego, CA jobs
Hii, Hope you are doing well. I have an immediate requirement, please let me know if you are interested in this role . Job Title : Quality Validation Engineer (USC/GC/EAD) Mode : Contract Type : C2C/W2 Job Description : Product software as a medical device - verification and validation activities for new products (Quality Engineer supporting the R&D team). Medical device - R&D product software . resource shall have a background in the medical/pharma domain. resource shall have product software validation experience and a minimum of 2 to 3 years of experience in Quality. shall have experience as a Software Quality Engineer or Validation Engineer and Quality Engineer. Job Summary: Provides technical and quality system guidance related to establishing product software as a medical device requirements. Provide quality oversight for product software as a medical device verification and validation activities for new products in accordance with design planning procedures. This includes, but is not limited to, reviewing and approving software test case protocols and reports, review of software development plans, and review of other system and software documentation. Leads meetings to prioritize, review and/or approve of action plans for addressing issues captured in problem resolution systems during development. Leads risk evaluation and associated management activities related to product software development including Risk assessments (e.g. FMEA), product risk analysis, and mitigation of software issues. Participates in technical and management reviews to ensure design plans, product design and deliverables related to product software are met. Represent the quality engineering function for the review and approval of designated design controls. May provide quality oversight for non-product software validation by assessing the need for validation and preparing and/or supporting protocols, reports and other documentation as required. May be involved with supporting product cybersecurity assessments in conjunction with a cross-functional team Complies with US FDA regulations, other country regulatory requirements, company policies, and procedures. Maintains a strong, collaborative partnership with cross functional team members especially with software supplier. Works as an individual contributor and may provide guidance of other QE team members. -- -- Thank you, Deepak Singh Email: ****************************
$78k-103k yearly est. 3d ago
Semiconductor Packaging engineer
Vista Applied Solutions Group Inc. 4.0
Santa Clara, CA jobs
This role is highly specialized in semiconductor packaging design, requiring strong EDA tool proficiency and knowledge of advanced packaging technologies. Responsibilities: Tools & Knowledge: Mentor/Siemens and Cadence tools (especially for Package Layout Automation - PLA). Technical Expertise: Multi-layer package design experience. Understanding of substrate manufacturing Design Rules and Assembly Rules. Familiarity with SIPI (Signal Integrity & Power Integrity) Rules. Flip-chip package design concepts. Tasks: Perform point-to-point connections. Run DRC (Design Rule Checks), identify root causes, and fix issues. Execute design based on provided schematics, including component placement and constraint setup
$84k-114k yearly est. 5d ago
Senior Site Reliability Engineer
Gradle Inc. 4.1
San Francisco, CA jobs
Job DescriptionWho We Are Develocity is a first-of-its-kind toolchain observability and acceleration platform that helps software teams adopt and improve DORA capabilities (including continuous delivery) in order to achieve software delivery excellence. It combines build and test acceleration with deep observability for builds and tests with Gradle Build Tool, Apache Maven™, sbt, npm, and Python, and applies to both CI and local builds and tests. Ultimately, Develocity provides an operational layer across an organization's toolchains to speed up, troubleshoot, and optimize local developer and remote CI feedback loops. Our software is used by some of the world's leading software organizations, such as Netflix, Airbnb, SAP, several top ten banks, and many other major customers across all verticals. We regularly collaborate with these and other users to make our products continuously better. We have partnered with the Apache Software Foundation, the Commonhaus Foundation, the Scala Center, the Micronaut Foundation, and other OSS projects like Spring, Quarkus, Kotlin, JUnit, AndroidX, and many more to bring the values of Develocity also to the OSS Community. Our Values Seek to Understand: Everything starts with listening and understanding, and we strive to understand different viewpoints, problems, and motivations. Before we take action, we ensure we truly grasp the challenges, perspectives, and goals. Know the Why: We approach our work with a clear sense of purpose, ensuring every step is deliberate and focused. We take meaningful action with urgency, but never at the expense of thoughtful consideration. Innovate & Iterate: We embrace challenges and are not afraid to try new things, even if they might fail. With deep understanding and a clear purpose, we can develop creative and bold solutions to tackle challenges. Own the Outcome: We are empowered to take initiative and we maintain transparency in our work and its outcomes. When we execute, we take responsibility for our decisions, measure the success of our innovations, and learn from the results. Who You Are We're building a new SRE team and looking for founding members to help shape how we operate. You'll be responsible for the reliability, performance, and availability of Develocity instances serving paying customers, open-source projects, and public-facing services, plus supporting infrastructure like artifact registries. You'll work on our internally-built Cloud Application Platform, Kubernetes on AWS, and develop deep expertise in it. When incidents happen, you'll troubleshoot issues across the stack, from application to infrastructure. You'll collaborate with the Cloud Platform team to improve the tooling you depend on, and with engineering teams to build reliability into how we ship software. If you like automating things and hate doing the same task twice, you'll fit in well. You'll be part of a distributed, remote-first team that values asynchronous communication and written documentation. Strong self-direction and clear communication across time zones are essential. Responsibilities Operate and maintain all Develocity instances and supporting services. Participate in a follow-the-sun on-call rotation, owning incident response and troubleshooting issues across the stack. Drive automation across application deployment, upgrades, monitoring, self-healing, and recovery. Build and maintain observability for all managed services (logging, metrics, tracing, and alerting). Work with engineering teams to build reliability into features from the start. Run incident response and retrospectives, and make sure we learn from them. Own disaster recovery, backups, and business continuity. Communicate with customers during incidents and maintenance windows. Optimize performance, resource usage, and costs. Help evolve our SaaS operations as we grow. Minimum qualifications 5+ years in SRE, DevOps, or equivalent role operating production services at scale. Strong Kubernetes experience in production environments. Cloud infrastructure expertise, preferably AWS (EKS, RDS, S3, EC2). Proficiency with observability tools (Prometheus, Grafana) and Infrastructure as Code (Terraform). Track record of incident management and response. Knowledge of SRE best practices (SLAs, SLOs). Scripting proficiency (Python, Bash) for automation. Experience with 24/7 on-call rotations. Strong written and verbal English communication. Preferred qualifications Experience operating SaaS platforms at scale. Familiarity with Develocity. JVM language experience (Java, Kotlin). Disaster recovery planning and execution experience. Customer-facing incident communication skills. Experience establishing SRE practices in new or growing teams. What We Offer A ground-floor role in a new SRE team-you'll shape how we do things, not inherit someone else's decisions. Real ownership of production systems used by engineers at companies you've heard of. Direct interaction with customers when things go wrong (and when they go right). A culture that values automation over heroics. In-person meetings, such as our annual company offsite and team meetings. Work from home in a remote-first environment. Competitive salaries and equity grants. Compensation The US salary range for this position is $150-190k which reflects the target ranges for all US locations. Within this range, individual pay is determined by geographic location and additional factors including but not limited to experience, relevant skills, qualifications, seniority, performance, and travel requirements. Our recruiting team can share more information about the specific salary range for your location during the hiring process. Location Remote from anywhere in PST timezone. While our team works remotely and is spread across the globe, we deeply value daily interactions and collaboration.
$150k-190k yearly 2d ago
Senior Reliability Engineer
KLA 4.4
Milpitas, CA jobs
KLA is a global leader in diversified electronics for the semiconductor manufacturing ecosystem. Virtually every electronic device in the world is produced using our technologies. No laptop, smartphone, wearable device, voice-controlled gadget, flexible screen, VR device or smart car would have made it into your hands without us. KLA invents systems and solutions for the manufacturing of wafers and reticles, integrated circuits, packaging, printed circuit boards and flat panel displays. The innovative ideas and devices that are advancing humanity all begin with inspiration, research and development. KLA focuses more than average on innovation and we invest 15% of sales back into R&D. Our expert teams of physicists, engineers, data scientists and problem-solvers work together with the world's leading technology providers to accelerate the delivery of tomorrow's electronic devices. Life here is exciting and our teams thrive on tackling really hard problems. There is never a dull moment with us. /Preferred Qualifications We are looking for a Senior Reliability Engineer to join our team in Milpitas! Job Description Develop reliability program plans, reliability block diagrams and allocation models, FMECA, FTA, present to senior management and engineers Attend design reviews, provide feedback and assistance regarding design for reliability. Develop test plans and implement testing including setup and maintenance of reliability test fixtures in collaboration with design engineers Predict life of components based on subsystem testing and system reliability based on IRONMAN/Marathons Conduct failure analysis and suggest design improvements Drive reliability improvement by monitoring tool performance, leading failure review boards and raising issues for resolution Work with critical suppliers for reliability Perform data analysis using appropriate reliability techniques such as Weibull, Duane, Crow-AMSAA, etc. Responsible for reliability best practices for product Participate in Product Life Cycle (PLC) management review meetings Train design engineering community on reliability principles and techniques Qualifications Master's degree in mechanical engineering, physics, or other relevant engineering fields 5+ years experience in reliability 3+ years experience in the semiconductor industry Experience with reliability of mechanical, opto-mechanical and electronic systems reliability Experience with laser reliability Experience with data acquisition and control software Proficiency in the use of Microsoft Office applications, including Outlook, Excel, PowerPoint, Word. Experience with Reliasoft Applications, e.g. Weibull++, Blocksim, Xfracas, would be a plus. Microsoft Project skills are preferred Experience with electron beam sources is preferred Minimum Qualifications Doctorate (Academic) Degree and related work experience of 3 years; Master's Level Degree and related work experience of 6 years; Bachelor's Level Degree and related work experience of 8 years Base Pay Range: $134,800.00 - $229,200.00 AnnuallyPrimary Location: USA-CA-Milpitas-KLAKLA's total rewards package for employees may also include participation in performance incentive programs and eligibility for additional benefits including but not limited to: medical, dental, vision, life, and other voluntary benefits, 401(K) including company matching, employee stock purchase program (ESPP), student debt assistance, tuition reimbursement program, development and career growth opportunities and programs, financial planning benefits, wellness benefits including an employee assistance program (EAP), paid time off and paid company holidays, and family care and bonding leave. Interns are eligible for some of the benefits listed. Our pay ranges are determined by role, level, and location. The range displayed reflects the pay for this position in the primary location identified in this posting. Actual pay depends on several factors, including state minimum pay wage rates, location, job-related skills, experience, and relevant education level or training. We are committed to complying with all applicable federal and state minimum wage requirements where applicable. If applicable, your recruiter can share more about the specific pay range for your preferred location during the hiring process. KLA is proud to be an Equal Opportunity Employer. We will ensure that qualified individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us at ************************** or at *************** to request accommodation. Be aware of potentially fraudulent job postings or suspicious recruiting activity by persons that are currently posing as KLA employees. KLA never asks for any financial compensation to be considered for an interview, to become an employee, or for equipment. Further, KLA does not work with any recruiters or third parties who charge such fees either directly or on behalf of KLA. Please ensure that you have searched KLA's Careers website for legitimate job postings. KLA follows a recruiting process that involves multiple interviews in person or on video conferencing with our hiring managers. If you are concerned that a communication, an interview, an offer of employment, or that an employee is not legitimate, please send an email to ************************** to confirm the person you are communicating with is an employee. We take your privacy very seriously and confidentially handle your information.
$134.8k-229.2k yearly Auto-Apply 48d ago
Senior Site Reliability Engineer - Federal - 3rd Shift
Servicenow 4.7
San Diego, CA jobs
It all started in sunny San Diego, California in 2004 when a visionary engineer, Fred Luddy, saw the potential to transform how we work. Fast forward to today - ServiceNow stands as a global market leader, bringing innovative AI-enhanced technology to over 8,100 customers, including 85% of the Fortune 500 . Our intelligent cloud-based platform seamlessly connects people, systems, and processes to empower organizations to find smarter, faster, and better ways to work. But this is just the beginning of our journey. Join us as we pursue our purpose to make the world work better for everyone. Job Description Please Note: This position will include supporting our US Public Sector customers. “This position requires passing a ServiceNow background screening, USFedPASS (US Federal Personnel Authorization Screening Standards). This includes a credit check, criminal/misdemeanor check and taking a drug test. Any employment is contingent upon passing the screening. Due to Federal requirements, only US citizens, US naturalized citizens or US Permanent Residents, holding a green card, will be considered. The Federal SRE Team has 3 shifts that provide 24x7 production support for our Government Community Cloud infrastructure. Below are some highlights. No on-call rotation Shift Bonuses for 2nd and 3rd shifts 4 Day Work week This is a 3rd shift position and has a 10-hour 4-day work week, Sunday - Wednesday with working shift hours from 11 p.m. - 10 a.m. PT. What you get to do in this role: The SRE team is a group of highly technical engineers who are tasked with maintaining and developing the reliability, scalability and performance of the ServiceNow infrastructure. The SRE is empowered to drive technical resolutions across the technology stack from hardware through to application and all stops in between. They are also tasked with driving forward the operability of the platform to drive down the number of incidents and to reduce MTTR. To accomplish this the team combines software development, networking and systems engineering expertise with a strong desire to be challenged by problems of scale and complexity and to make services better for our customers. Do you: Have a passion for DevOps, Automation and Scripting? Know Linux systems in depth? Have software development background? Understand Observability and Monitoring? Have Experience with Cloud technologies - Azure, AWS ? Write code to automate repetitive tasks? Smile when you trouble shoot and solve an issue in London from your laptop in San Diego? Have years of experience accumulated in roles like IT Operations, Software Development, or Systems Engineering? Answer 'yes' to these questions and you are in the front row for a position as SRE at ServiceNow. Go ahead, hit the Apply button and let's have a chat about your skills and experiences. Want to know more about us? Now that we have set the pace, keep reading if you want to understand more about ServiceNow as a company and the SRE role. As an Engineer on the SRE team you will: Provide relief and sustainable resolution to issues within our infrastructure. Use your experience in software development, systems engineering, and networking to proactively prevent repeatable issues. Drive initiatives with partner teams to improve the reliability and performance of the infrastructure through improved system design. Drive a culture of intolerance to manual activity which results in a highly automated environment delivering scalable solutions. Drive monitoring and automation initiatives Qualifications To be successful in this role you have: Experience in leveraging or critically thinking about how to integrate AI into work processes, decision-making, or problem-solving. This may include using AI-powered tools, automating workflows, analyzing AI-driven insights, or exploring AI's potential impact on the function or industry. Typically requires a Bachelor's degree and a minimum of 4+ years of related experience; or an advanced degree with 2+ years of related experience; or equivalent work experience. Deep knowledge of Linux systems 2+ years of experience with DevOps automation, CI/CD pipeline and agile methodologies 2+ years of experience with Cloud technologies, preferably Azure 2+ years of Coding in various languages; we normally prefer Python, JavaScript, and Ruby MySQL database administration, troubleshooting, and performance tuning Expertise in Observability and Monitoring of applications, services, and networks at scale Networking skills, IP addressing and routing. Nice to Have: Experience developing on the ServiceNow Platform GCS-23 For positions in this location, we offer a base pay of $111,200 - $172,400, plus equity (when applicable), variable/incentive compensation and benefits. Sales positions generally offer a competitive On Target Earnings (OTE) incentive compensation structure. Please note that the base pay shown is a guideline, and individual total compensation will vary based on factors such as qualifications, skill level, competencies, and work location. We also offer health plans, including flexible spending accounts, a 401(k) Plan with company match, ESPP, matching donations, a flexible time away plan and family leave programs. Compensation is based on the geographic location in which the role is located and is subject to change based on work location. Additional Information Work Personas We approach our distributed world of work with flexibility and trust. Work personas (flexible, remote, or required in office) are categories that are assigned to ServiceNow employees depending on the nature of their work and their assigned work location. Learn more here. To determine eligibility for a work persona, ServiceNow may confirm the distance between your primary residence and the closest ServiceNow office using a third-party service. Equal Opportunity Employer ServiceNow is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, creed, religion, sex, sexual orientation, national origin or nationality, ancestry, age, disability, gender identity or expression, marital status, veteran status, or any other category protected by law. In addition, all qualified applicants with arrest or conviction records will be considered for employment in accordance with legal requirements. Accommodations We strive to create an accessible and inclusive experience for all candidates. If you require a reasonable accommodation to complete any part of the application process, or are unable to use this online application and need an alternative method to apply, please contact [email protected] for assistance. Export Control Regulations For positions requiring access to controlled technology subject to export control regulations, including the U.S. Export Administration Regulations (EAR), ServiceNow may be required to obtain export control approval from government authorities for certain individuals. All employment is contingent upon ServiceNow obtaining any export license or other approval that may be required by relevant export control authorities. From Fortune. ©2025 Fortune Media IP Limited. All rights reserved. Used under license.
$111.2k-172.4k yearly 30d ago
Senior Site Reliability Engineer - Federal - 3rd Shift
Servicenow, Inc. 4.7
San Diego, CA jobs
It all started in sunny San Diego, California in 2004 when a visionary engineer, Fred Luddy, saw the potential to transform how we work. Fast forward to today - ServiceNow stands as a global market leader, bringing innovative AI-enhanced technology to over 8,100 customers, including 85% of the Fortune 500 . Our intelligent cloud-based platform seamlessly connects people, systems, and processes to empower organizations to find smarter, faster, and better ways to work. But this is just the beginning of our journey. Join us as we pursue our purpose to make the world work better for everyone. Please Note: This position will include supporting our US Public Sector customers. _"This position requires passing a_ **_ServiceNow background screening, USFedPASS (US Federal Personnel Authorization Screening Standards)_** _. This includes a credit check, criminal/misdemeanor check and taking a drug test. Any employment is contingent upon passing the screening. _ **_Due to Federal requirements, only US citizens, US naturalized citizens or US Permanent Residents, holding a green card, will be considered._** **_The Federal SRE Team has 3 shifts that provide 24x7 production support for our Government Community Cloud infrastructure._** _Below are some highlights._ + No on-call rotation + Shift Bonuses for 2nd and 3rd shifts + 4 Day Work week + This is a **3rd shift position** and has a **10-hour 4-day work week, Sunday - Wednesday** with working shift hours from **11 p.m. - 10 a.m. PT.** **What you get to do in this role:** The SRE team is a group of highly technical engineers who are tasked with maintaining and developing the reliability, scalability and performance of the ServiceNow infrastructure. The SRE is empowered to drive technical resolutions across the technology stack from hardware through to application and all stops in between. They are also tasked with driving forward the operability of the platform to drive down the number of incidents and to reduce MTTR. To accomplish this the team combines software development, networking and systems engineering expertise with a strong desire to be challenged by problems of scale and complexity and to make services better for our customers. **Do you:** + Have a passion for **DevOps, Automation and Scripting** ? + Know **Linux systems** in depth? + Have **software development** background? + Understand **Observability and Monitoring** ? + Have Experience with **Cloud technologies** - Azure, AWS ? + Write code to **automate** repetitive tasks? + Smile when you **trouble shoot** and solve an issue in London from your laptop in San Diego? + Have years of experience accumulated in roles like IT Operations, Software Development, or Systems Engineering? Answer 'yes' to these questions and you are in the front row for a position as SRE at ServiceNow. Go ahead, hit the Apply button and let's have a chat about your skills and experiences. Want to know more about us? Now that we have set the pace, keep reading if you want to understand more about ServiceNow as a company and the SRE role. **As an Engineer on the SRE team you will:** + Provide relief and sustainable resolution to issues within our infrastructure. + Use your experience in software development, systems engineering, and networking to proactively prevent repeatable issues. + Drive initiatives with partner teams to improve the reliability and performance of the infrastructure through improved system design. + Drive a culture of intolerance to manual activity which results in a highly automated environment delivering scalable solutions. + Drive monitoring and automation initiatives **To be successful in this role you have:** + Experience in leveraging or critically thinking about how to integrate AI into work processes, decision-making, or problem-solving. This may include using AI-powered tools, automating workflows, analyzing AI-driven insights, or exploring AI's potential impact on the function or industry. + Typically requires a Bachelor's degree and a minimum of 4+ years of related experience; or an advanced degree with 2+ years of related experience; or equivalent work experience. + Deep knowledge of Linux systems + 2+ years of experience with DevOps automation, CI/CD pipeline and agile methodologies + 2+ years of experience with Cloud technologies, preferably Azure + 2+ years of Coding in various languages; we normally prefer Python, JavaScript, and Ruby + MySQL database administration, troubleshooting, and performance tuning + Expertise in Observability and Monitoring of applications, services, and networks at scale + Networking skills, IP addressing and routing. Nice to Have: + Experience developing on the ServiceNow Platform GCS-23 For positions in this location, we offer a base pay of $111,200 - $172,400, plus equity (when applicable), variable/incentive compensation and benefits. Sales positions generally offer a competitive On Target Earnings (OTE) incentive compensation structure. Please note that the base pay shown is a guideline, and individual total compensation will vary based on factors such as qualifications, skill level, competencies, and work location. We also offer health plans, including flexible spending accounts, a 401(k) Plan with company match, ESPP, matching donations, a flexible time away plan and family leave programs. Compensation is based on the geographic location in which the role is located and is subject to change based on work location. **Work Personas** We approach our distributed world of work with flexibility and trust. Work personas (flexible, remote, or required in office) are categories that are assigned to ServiceNow employees depending on the nature of their work and their assigned work location. Learn more here (************************************************************************************************************************************* . To determine eligibility for a work persona, ServiceNow may confirm the distance between your primary residence and the closest ServiceNow office using a third-party service. **Equal Opportunity Employer** ServiceNow is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, creed, religion, sex, sexual orientation, national origin or nationality, ancestry, age, disability, gender identity or expression, marital status, veteran status, or any other category protected by law. In addition, all qualified applicants with arrest or conviction records will be considered for employment in accordance with legal requirements. **Accommodations** We strive to create an accessible and inclusive experience for all candidates. If you require a reasonable accommodation to complete any part of the application process, or are unable to use this online application and need an alternative method to apply, please contact ***************************** for assistance. **Export Control Regulations** For positions requiring access to controlled technology subject to export control regulations, including the U.S. Export Administration Regulations (EAR), ServiceNow may be required to obtain export control approval from government authorities for certain individuals. All employment is contingent upon ServiceNow obtaining any export license or other approval that may be required by relevant export control authorities. From Fortune. ©2025 Fortune Media IP Limited. All rights reserved. Used under license.
$111.2k-172.4k yearly 30d ago
Senior Site Reliability Engineer - Federal - 3rd Shift
Servicenow 4.7
San Diego, CA jobs
It all started in sunny San Diego, California in 2004 when a visionary engineer, Fred Luddy, saw the potential to transform how we work. Fast forward to today - ServiceNow stands as a global market leader, bringing innovative AI-enhanced technology to over 8,100 customers, including 85% of the Fortune 500. Our intelligent cloud-based platform seamlessly connects people, systems, and processes to empower organizations to find smarter, faster, and better ways to work. But this is just the beginning of our journey. Join us as we pursue our purpose to make the world work better for everyone. Job Description Please Note: This position will include supporting our US Public Sector customers. "This position requires passing a ServiceNow background screening, USFedPASS (US Federal Personnel Authorization Screening Standards). This includes a credit check, criminal/misdemeanor check and taking a drug test. Any employment is contingent upon passing the screening. Due to Federal requirements, only US citizens, US naturalized citizens or US Permanent Residents, holding a green card, will be considered. The Federal SRE Team has 3 shifts that provide 24x7 production support for our Government Community Cloud infrastructure. Below are some highlights. * No on-call rotation * Shift Bonuses for 2nd and 3rd shifts * 4 Day Work week * This is a 3rd shift position and has a 10-hour 4-day work week, Sunday - Wednesday with working shift hours from 11 p.m. - 10 a.m. PT. What you get to do in this role: The SRE team is a group of highly technical engineers who are tasked with maintaining and developing the reliability, scalability and performance of the ServiceNow infrastructure. The SRE is empowered to drive technical resolutions across the technology stack from hardware through to application and all stops in between. They are also tasked with driving forward the operability of the platform to drive down the number of incidents and to reduce MTTR. To accomplish this the team combines software development, networking and systems engineering expertise with a strong desire to be challenged by problems of scale and complexity and to make services better for our customers. Do you: * Have a passion for DevOps, Automation and Scripting? * Know Linux systems in depth? * Have software development background? * Understand Observability and Monitoring? * Have Experience with Cloud technologies - Azure, AWS ? * Write code to automate repetitive tasks? * Smile when you trouble shoot and solve an issue in London from your laptop in San Diego? * Have years of experience accumulated in roles like IT Operations, Software Development, or Systems Engineering? Answer 'yes' to these questions and you are in the front row for a position as SRE at ServiceNow. Go ahead, hit the Apply button and let's have a chat about your skills and experiences. Want to know more about us? Now that we have set the pace, keep reading if you want to understand more about ServiceNow as a company and the SRE role. As an Engineer on the SRE team you will: * Provide relief and sustainable resolution to issues within our infrastructure. * Use your experience in software development, systems engineering, and networking to proactively prevent repeatable issues. * Drive initiatives with partner teams to improve the reliability and performance of the infrastructure through improved system design. * Drive a culture of intolerance to manual activity which results in a highly automated environment delivering scalable solutions. * Drive monitoring and automation initiatives Qualifications To be successful in this role you have: * Experience in leveraging or critically thinking about how to integrate AI into work processes, decision-making, or problem-solving. This may include using AI-powered tools, automating workflows, analyzing AI-driven insights, or exploring AI's potential impact on the function or industry. * Typically requires a Bachelor's degree and a minimum of 4+ years of related experience; or an advanced degree with 2+ years of related experience; or equivalent work experience. * Deep knowledge of Linux systems * 2+ years of experience with DevOps automation, CI/CD pipeline and agile methodologies * 2+ years of experience with Cloud technologies, preferably Azure * 2+ years of Coding in various languages; we normally prefer Python, JavaScript, and Ruby * MySQL database administration, troubleshooting, and performance tuning * Expertise in Observability and Monitoring of applications, services, and networks at scale * Networking skills, IP addressing and routing. Nice to Have: * Experience developing on the ServiceNow Platform GCS-23 For positions in this location, we offer a base pay of $111,200 - $172,400, plus equity (when applicable), variable/incentive compensation and benefits. Sales positions generally offer a competitive On Target Earnings (OTE) incentive compensation structure. Please note that the base pay shown is a guideline, and individual total compensation will vary based on factors such as qualifications, skill level, competencies, and work location. We also offer health plans, including flexible spending accounts, a 401(k) Plan with company match, ESPP, matching donations, a flexible time away plan and family leave programs. Compensation is based on the geographic location in which the role is located and is subject to change based on work location. Additional Information Work Personas We approach our distributed world of work with flexibility and trust. Work personas (flexible, remote, or required in office) are categories that are assigned to ServiceNow employees depending on the nature of their work and their assigned work location. Learn more here. To determine eligibility for a work persona, ServiceNow may confirm the distance between your primary residence and the closest ServiceNow office using a third-party service. Equal Opportunity Employer ServiceNow is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, creed, religion, sex, sexual orientation, national origin or nationality, ancestry, age, disability, gender identity or expression, marital status, veteran status, or any other category protected by law. In addition, all qualified applicants with arrest or conviction records will be considered for employment in accordance with legal requirements. Accommodations We strive to create an accessible and inclusive experience for all candidates. If you require a reasonable accommodation to complete any part of the application process, or are unable to use this online application and need an alternative method to apply, please contact ***************************** for assistance. Export Control Regulations For positions requiring access to controlled technology subject to export control regulations, including the U.S. Export Administration Regulations (EAR), ServiceNow may be required to obtain export control approval from government authorities for certain individuals. All employment is contingent upon ServiceNow obtaining any export license or other approval that may be required by relevant export control authorities. From Fortune. 2025 Fortune Media IP Limited. All rights reserved. Used under license.
$111.2k-172.4k yearly 30d ago
Senior/Lead Site Reliability Engineer - Federal
C3 Ai 3.7
Redwood City, CA jobs
C3 AI (NYSE: AI), is the Enterprise AI application software company. C3 AI delivers a family of fully integrated products including the C3 Agentic AI Platform, an end-to-end platform for developing, deploying, and operating enterprise AI applications, C3 AI applications, a portfolio of industry-specific SaaS enterprise AI applications that enable the digital transformation of organizations globally, and C3 Generative AI, a suite of domain-specific generative AI offerings for the enterprise. Learn more at: C3 AI C3 AI is seeking a Senior/Lead Site Reliability Engineer - Federal to join our team in Tysons, VA or Redwood City, CA. This role requires US Citizenship. Active US Government Security Secret clearance or higher is required (Top Secret or higher is preferred). Responsibilities: Work with Federal customers to design and implement customized installations of the C3 AI Platform that meet unique access and security requirements of Federal environments Maximize system uptime and availability, ensuring functional and performance SLAs Establish end-to-end monitoring and alerting on all critical aspects Solve complex problems for critical services and build automation to prevent problem recurrence Initiate and lead scripting and automation to streamline system updates and upgrades Set up critical infrastructure, tools, and framework to streamline the deployment cycle Work cross-functionally with Services and Engineering teams Travel to customer site (up to 50%) Qualifications: Bachelor's degree in a Science, Technology, Engineering or Mathematics (STEM), or comparable area of study An active U.S. Government security clearance (Top Secret preferred) Demonstrated experience in deploying, managing, and operating scalable and fault-tolerant Kubernetes-based infrastructure in AWS and Azure clouds; on-premise deployment experience preferred Expertise in Linux Operating Systems, Networking, and Database concepts Expertise in cloud providers, such as Amazon Web Services, Azure, and GCP Experience with Infrastructure-as-Code configurations such as Terraform, Ansible, or Puppet Experience in Ruby, Bash, or Python; to automate and monitor systems Excellent problem-solving, critical thinking, and communication skills Experience supporting as a DevOps or sys admin for commercial SaaS solutions. Customer facing experience is a plus. Candidates must be authorized to work in the United States without the need for current or future company sponsorship. C3 AI provides excellent benefits, a competitive compensation package and generous equity plan. California Base Pay Range$159,000-$230,000 USD C3 AI is proud to be an Equal Opportunity and Affirmative Action Employer. We do not discriminate on the basis of any legally protected characteristics, including disabled and veteran status.
$159k-230k yearly Auto-Apply 60d+ ago
Senior Site Reliability Engineer
TP-Link Systems Inc. 3.9
Irvine, CA jobs
Job Description At the forefront of the future of connected living, TP-Link's Systems Inc. R&D Center in Irvine, Southern California's innovation hub, spearheads research and development of next-generation networking, IoT smart home products, and software services. Our team of passionate engineers are constantly innovating, engineering solutions that transform the end user experience with simpler, smarter, and more reliable connectivity. We're looking for a passionate and experienced Senior Site Reliability Engineer to join our team and play a crucial role in ensuring our cloud platform's security, Reliability, scalability, and operational excellence. About Us: Headquartered in the United States, TP-Link Systems Inc. is a global provider of reliable networking devices and smart home products, consistently ranked as the world's top provider of Wi-Fi devices. The company is committed to delivering innovative products that enhance people's lives through faster, more reliable connectivity. With a commitment to excellence, TP-Link serves customers in over 170 countries and continues to grow its global footprint. We believe technology changes the world for the better! At TP-Link Systems Inc, we are committed to crafting dependable, high-performance products to connect users worldwide with the wonders of technology. Embracing professionalism, innovation, excellence, and simplicity, we aim to assist our clients in achieving remarkable global performance and enable consumers to enjoy a seamless, effortless lifestyle. Responsibilities: Serve as technical SME for implementing and operating Microservices on Kubernetes cloud-based platforms. Collaborate with the Cloud Technical Development and DevOps teams to deploy services to the Multi-Cloud Platform. Performing Load Tests and Chaos Tests to ensure the scalability and reliability of microservices. Build Observability for Microservices and cloud platforms like AWS, OCI, Azure, and GCP. Write and Execute the Disaster recovery plans in collaboration with the Development and DevOps team. Analyze and resolve production risks caused by insufficient resources, such as node groups, CPU, memory, HPA scheduling, JVM pre-warming, etc. Write and maintain scripts for automation using languages like Python, Go, or Bash. Define and maintain the KPIs (SLA/SLO/SLI) for all cloud microservices with development teams to better understand the business. Create and maintain technical documentation, including architecture diagrams, design documents, and standard operating procedures. Guarantee adherence to security and compliance standards, including ISO27001, SOC2, and GDPR. Lead incident response efforts to troubleshoot and resolve production issues quickly. Perform post-incident analysis to identify root causes and potential workarounds/solutions. Assist with product/technology selection, including implementation of POCs Be fluid and open to change and evolving processes and tools Help to mentor and train less senior members of the team Ability to be part of On-call rotation and provide support after work hours and on weekends. Other duties as assigned Requirements Bachelor's degree in Computer Science, Information Technology, or a related field. 5+ years of experience as a Site Reliability Engineer. Proficiency in programming and scripting languages like Java, Python, Bash, or PowerShell. Hands-on experience in SRE, DevOps, cloud operations, and cloud security best practices. Strong knowledge of security technologies, including Identity and access management, Network security, Application security, and Data protection. Strong problem-solving and analytical skills, with the ability to work independently and as part of a team. Experience in developing and maintaining technical documentation and implementing compliance requirements. Additional Skills (Preferred): Expert-level cloud certifications include AWS Solutions Architect, Professional, Azure Solutions Architect Expert, and GCP Professional Cloud Architect. Experience with container orchestration technologies (e.g., Kubernetes). Benefits Salary range: $140,000 - $180,000 Free snacks and drinks, and provided lunch on Fridays Fully paid medical, dental, and vision insurance (partial coverage for dependents) Contributions to 401k funds Bi-annual reviews, and annual pay increases Health and wellness benefits, including free gym membership Quarterly team-building events At TP-Link Systems Inc., we are continually searching for ambitious individuals who are passionate about their work. We believe that diversity fuels innovation, collaboration, and drives our entrepreneurial spirit. As a global company, we highly value diverse perspectives and are committed to cultivating an environment where all voices are heard, respected, and valued. We are dedicated to providing equal employment opportunities to all employees and applicants, and we prohibit discrimination and harassment of any kind based on race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state, or local laws. Beyond compliance, we strive to create a supportive and growth-oriented workplace for everyone. If you share our passion and connection to this mission, we welcome you to apply and join us in building a vibrant and inclusive team at TP-Link Systems Inc. Please, no third-party agency inquiries, and we are unable to offer visa sponsorships at this time.
$140k-180k yearly 29d ago
Senior Site Reliability Engineer
TP-Link Systems 3.9
Irvine, CA jobs
At the forefront of the future of connected living, TP-Link's Systems Inc. R&D Center in Irvine, Southern California's innovation hub, spearheads research and development of next-generation networking, IoT smart home products, and software services. Our team of passionate engineers are constantly innovating, engineering solutions that transform the end user experience with simpler, smarter, and more reliable connectivity. We're looking for a passionate and experienced Senior Site Reliability Engineer to join our team and play a crucial role in ensuring our cloud platform's security, Reliability, scalability, and operational excellence. About Us: Headquartered in the United States, TP-Link Systems Inc. is a global provider of reliable networking devices and smart home products, consistently ranked as the world's top provider of Wi-Fi devices. The company is committed to delivering innovative products that enhance people's lives through faster, more reliable connectivity. With a commitment to excellence, TP-Link serves customers in over 170 countries and continues to grow its global footprint. We believe technology changes the world for the better! At TP-Link Systems Inc, we are committed to crafting dependable, high-performance products to connect users worldwide with the wonders of technology. Embracing professionalism, innovation, excellence, and simplicity, we aim to assist our clients in achieving remarkable global performance and enable consumers to enjoy a seamless, effortless lifestyle. Responsibilities: Serve as technical SME for implementing and operating Microservices on Kubernetes cloud-based platforms. Collaborate with the Cloud Technical Development and DevOps teams to deploy services to the Multi-Cloud Platform. Performing Load Tests and Chaos Tests to ensure the scalability and reliability of microservices. Build Observability for Microservices and cloud platforms like AWS, OCI, Azure, and GCP. Write and Execute the Disaster recovery plans in collaboration with the Development and DevOps team. Analyze and resolve production risks caused by insufficient resources, such as node groups, CPU, memory, HPA scheduling, JVM pre-warming, etc. Write and maintain scripts for automation using languages like Python, Go, or Bash. Define and maintain the KPIs (SLA/SLO/SLI) for all cloud microservices with development teams to better understand the business. Create and maintain technical documentation, including architecture diagrams, design documents, and standard operating procedures. Guarantee adherence to security and compliance standards, including ISO27001, SOC2, and GDPR. Lead incident response efforts to troubleshoot and resolve production issues quickly. Perform post-incident analysis to identify root causes and potential workarounds/solutions. Assist with product/technology selection, including implementation of POCs Be fluid and open to change and evolving processes and tools Help to mentor and train less senior members of the team Ability to be part of On-call rotation and provide support after work hours and on weekends. Other duties as assigned Requirements Bachelor's degree in Computer Science, Information Technology, or a related field. 5+ years of experience as a Site Reliability Engineer. Proficiency in programming and scripting languages like Java, Python, Bash, or PowerShell. Hands-on experience in SRE, DevOps, cloud operations, and cloud security best practices. Strong knowledge of security technologies, including Identity and access management, Network security, Application security, and Data protection. Strong problem-solving and analytical skills, with the ability to work independently and as part of a team. Experience in developing and maintaining technical documentation and implementing compliance requirements. Additional Skills (Preferred): Expert-level cloud certifications include AWS Solutions Architect, Professional, Azure Solutions Architect Expert, and GCP Professional Cloud Architect. Experience with container orchestration technologies (e.g., Kubernetes). Benefits Salary range: $140,000 - $180,000 Free snacks and drinks, and provided lunch on Fridays Fully paid medical, dental, and vision insurance (partial coverage for dependents) Contributions to 401k funds Bi-annual reviews, and annual pay increases Health and wellness benefits, including free gym membership Quarterly team-building events At TP-Link Systems Inc., we are continually searching for ambitious individuals who are passionate about their work. We believe that diversity fuels innovation, collaboration, and drives our entrepreneurial spirit. As a global company, we highly value diverse perspectives and are committed to cultivating an environment where all voices are heard, respected, and valued. We are dedicated to providing equal employment opportunities to all employees and applicants, and we prohibit discrimination and harassment of any kind based on race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state, or local laws. Beyond compliance, we strive to create a supportive and growth-oriented workplace for everyone. If you share our passion and connection to this mission, we welcome you to apply and join us in building a vibrant and inclusive team at TP-Link Systems Inc. Please, no third-party agency inquiries, and we are unable to offer visa sponsorships at this time.
$140k-180k yearly Auto-Apply 60d+ ago
Senior Site Reliability Engineer
TP-Link Corp 3.9
Irvine, CA jobs
At the forefront of the future of connected living, TP-Link's Systems Inc. R&D Center in Irvine, Southern California's innovation hub, spearheads research and development of next-generation networking, IoT smart home products, and software services. Our team of passionate engineers are constantly innovating, engineering solutions that transform the end user experience with simpler, smarter, and more reliable connectivity. We're looking for a passionate and experienced Senior Site Reliability Engineer to join our team and play a crucial role in ensuring our cloud platform's security, Reliability, scalability, and operational excellence. About Us: Headquartered in the United States, TP-Link Systems Inc. is a global provider of reliable networking devices and smart home products, consistently ranked as the world's top provider of Wi-Fi devices. The company is committed to delivering innovative products that enhance people's lives through faster, more reliable connectivity. With a commitment to excellence, TP-Link serves customers in over 170 countries and continues to grow its global footprint. We believe technology changes the world for the better! At TP-Link Systems Inc, we are committed to crafting dependable, high-performance products to connect users worldwide with the wonders of technology. Embracing professionalism, innovation, excellence, and simplicity, we aim to assist our clients in achieving remarkable global performance and enable consumers to enjoy a seamless, effortless lifestyle. Responsibilities: * Serve as technical SME for implementing and operating Microservices on Kubernetes cloud-based platforms. * Collaborate with the Cloud Technical Development and DevOps teams to deploy services to the Multi-Cloud Platform. * Performing Load Tests and Chaos Tests to ensure the scalability and reliability of microservices. * Build Observability for Microservices and cloud platforms like AWS, OCI, Azure, and GCP. * Write and Execute the Disaster recovery plans in collaboration with the Development and DevOps team. * Analyze and resolve production risks caused by insufficient resources, such as node groups, CPU, memory, HPA scheduling, JVM pre-warming, etc. * Write and maintain scripts for automation using languages like Python, Go, or Bash. * Define and maintain the KPIs (SLA/SLO/SLI) for all cloud microservices with development teams to better understand the business. * Create and maintain technical documentation, including architecture diagrams, design documents, and standard operating procedures. * Guarantee adherence to security and compliance standards, including ISO27001, SOC2, and GDPR. * Lead incident response efforts to troubleshoot and resolve production issues quickly. * Perform post-incident analysis to identify root causes and potential workarounds/solutions. * Assist with product/technology selection, including implementation of POCs * Be fluid and open to change and evolving processes and tools * Help to mentor and train less senior members of the team * Ability to be part of On-call rotation and provide support after work hours and on weekends. * Other duties as assigned
$134k-176k yearly est. 9d ago
Sr Specialist Site Reliability Engineer
Waystar 4.6
Atlanta, GA jobs
We are seeking a highly skilled and proactive Senior Specialist, Site Reliability Engineering (SRE) to help drive reliability, scalability, and performance across our critical platforms. This role is ideal for a senior-level engineer who combines deep technical expertise with a passion for automation, observability, and operational excellence. As a Senior Specialist, you'll work on complex reliability challenges, lead technical initiatives, and collaborate across engineering, product, and infrastructure teams to ensure our systems are resilient and efficient. WHAT YOU'LL DO * Reliability Engineering * Architect and implement solutions to improve system reliability, scalability, and performance. * Define and manage SLIs/SLOs and error budgets across services. * Lead efforts to automate operational tasks and improve system observability. * Incident Management & Root Cause Analysis * Serve as a technical lead during major incidents and drive resolution. * Conduct deep root cause analyses and implement long-term fixes. * Champion blameless postmortems and continuous improvement. * Technical Leadership * Lead cross-functional reliability initiatives and mentor junior engineers. * Influence system design and architecture to embed reliability from the ground up. * Collaborate with software engineers to optimize deployment pipelines and infrastructure. * Monitoring & Tooling * Enhance observability through metrics, logging, and tracing. * Develop and maintain dashboards, alerts, and automated recovery systems. WHAT YOU'LL NEED * 7+ years of experience in SRE, DevOps, or infrastructure engineering. * Deep expertise in cloud platforms (AWS, GCP, or Azure), container orchestration (Kubernetes), and infrastructure-as-code (Terraform, CloudFormation). * Strong proficiency in observability tools (e.g., Prometheus, Grafana, Splunk) and CI/CD pipelines. * Proven track record of solving complex reliability challenges in distributed systems. * Excellent communication and collaboration skills. * Experience in Python, Powershell, or other similar languages * Active use of artificial intelligence (AI) tools and techniques to enhance performance, drive innovation, and improve decision-making across business functions * Ability to leverage AI tools and platforms to streamline workflows, improve decision-making, and drive innovation * Curiosity and adaptability in exploring emerging AI technologies, with a mindset for continuous learning and experimentation Preferred Qualifications * Experience in regulated or high-availability environments (e.g., financial services, healthcare). * Familiarity with chaos engineering, performance tuning, and capacity planning. * Background in software development with strong coding skills (e.g., Python, Go, Bash). ABOUT WAYSTAR Through a smart platform and better experience, Waystar helps providers simplify healthcare payments and yield powerful results throughout the complete revenue cycle. Waystar's healthcare payments platform combines innovative, cloud-based technology, robust data, and unparalleled client support to streamline workflows and improve financials so providers can focus on what matters most: their patients and communities. Waystar is trusted by 1M+ providers, 1K+ hospitals and health systems, and is connected to over 5K commercial and Medicaid/Medicare payers. We are deeply committed to living out our organizational values: honesty; kindness; passion; curiosity; fanatical focus; best work, always; making it happen; and joyful, optimistic & fun. Waystar products have won multiple Best in KLAS or Category Leader awards since 2010 and earned multiple #1 rankings from Black Book surveys since 2012. The Waystar platform supports more than 500,000 providers, 1,000 health systems and hospitals, and 5,000 payers and health plans. For more information, visit waystar.com or follow @Waystar on Twitter. WAYSTAR PERKS * Competitive total rewards (base salary + bonus, if applicable) * Customizable benefits package (3 medical plans with Health Saving Account company match) * We offer generous paid time off for our non-exempt team members, starting with 3 weeks + 13 paid holidays, including 2 personal floating holidays. We also offer flexible time off for our exempt team members + 13 paid holidays * Paid parental leave (including maternity + paternity leave) * Education assistance opportunities and free LinkedIn Learning access * Free mental health and family planning programs, including adoption assistance and fertility support * 401(K) program with company match * Pet insurance * Employee resource groups Waystar is proud to be an equal opportunity workplace. We celebrate, value, and support diversity and inclusion. Qualified applicants will receive consideration for employment without regard to race, color, religion, age, sex, national origin, disability status, genetics, marital status, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state, or local laws. This applies to all terms and conditions of employment, including recruiting, hiring, placement, promotion, termination, layoff, recall, transfer, leaves of absence, compensation, and training.
$99k-128k yearly est. Auto-Apply 54d ago
Sr. Specialist Site Reliability Engineer
Waystar 4.6
Atlanta, GA jobs
We are looking for a talented and driven Site Reliability Engineering (SRE) Specialist to support our engineering team, which manages the infrastructure and services that power our Waystar products. This role is ideal for an experienced engineer who thrives in data-intensive environments and is passionate about building reliable, scalable systems that ensure data integrity, availability, and performance. As an SRE Specialist, you'll work closely with engineering, product, and data teams to ensure our data licensing platforms are resilient, observable, and continuously improving. WHAT YOU'LL DO * System Reliability & Performance * Design and implement reliability solutions for data ingestion, processing, and delivery pipelines. * Define and maintain SLIs/SLOs for data licensing services and manage error budgets. * Build automation for deployment, monitoring, and incident response. * Observability & Monitoring * Enhance system observability through metrics, logging, and tracing. * Develop and maintain dashboards and alerts to proactively detect and resolve issues. * Incident Response & Postmortems * Participate in on-call rotations and lead incident response efforts. * Conduct root cause analysis and drive post-incident improvements. * Maintain runbooks and operational documentation. * Collaboration & Continuous Improvement * Partner with software and data engineers to embed reliability into system design. * Contribute to blameless postmortems and reliability reviews. * Share knowledge and mentor junior team members. WHAT YOU'LL NEED * 7+ years of experience in SRE, DevOps, or infrastructure engineering. * Strong understanding of cloud platforms (AWS, GCP, or Azure), container orchestration (Kubernetes), and infrastructure-as-code (Terraform, CloudFormation). * Experience with observability tools (e.g., Prometheus, Grafana, Splunk) and CI/CD pipelines. * Familiarity with data platforms, ETL pipelines, and distributed systems. * Excellent problem-solving and communication skills. * Experience with Python, Powershell, and other similar languages * Active use of artificial intelligence (AI) tools and techniques to enhance performance, drive innovation, and improve decision-making across business functions * Ability to leverage AI tools and platforms to streamline workflows, improve decision-making, and drive innovation * Curiosity and adaptability in exploring emerging AI technologies, with a mindset for continuous learning and experimentation Preferred Qualifications * Experience with data licensing, data governance, or data compliance frameworks. * Exposure to data pipeline tools (e.g., Apache Airflow, Kafka, Spark). * Familiarity with regulatory requirements related to data usage and distribution. ABOUT WAYSTAR Through a smart platform and better experience, Waystar helps providers simplify healthcare payments and yield powerful results throughout the complete revenue cycle. Waystar's healthcare payments platform combines innovative, cloud-based technology, robust data, and unparalleled client support to streamline workflows and improve financials so providers can focus on what matters most: their patients and communities. Waystar is trusted by 1M+ providers, 1K+ hospitals and health systems, and is connected to over 5K commercial and Medicaid/Medicare payers. We are deeply committed to living out our organizational values: honesty; kindness; passion; curiosity; fanatical focus; best work, always; making it happen; and joyful, optimistic & fun. Waystar products have won multiple Best in KLAS or Category Leader awards since 2010 and earned multiple #1 rankings from Black Book surveys since 2012. The Waystar platform supports more than 500,000 providers, 1,000 health systems and hospitals, and 5,000 payers and health plans. For more information, visit waystar.com or follow @Waystar on Twitter. WAYSTAR PERKS * Competitive total rewards (base salary + bonus, if applicable) * Customizable benefits package (3 medical plans with Health Saving Account company match) * We offer generous paid time off for our non-exempt team members, starting with 3 weeks + 13 paid holidays, including 2 personal floating holidays. We also offer flexible time off for our exempt team members + 13 paid holidays * Paid parental leave (including maternity + paternity leave) * Education assistance opportunities and free LinkedIn Learning access * Free mental health and family planning programs, including adoption assistance and fertility support * 401(K) program with company match * Pet insurance * Employee resource groups Waystar is proud to be an equal opportunity workplace. We celebrate, value, and support diversity and inclusion. Qualified applicants will receive consideration for employment without regard to race, color, religion, age, sex, national origin, disability status, genetics, marital status, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state, or local laws. This applies to all terms and conditions of employment, including recruiting, hiring, placement, promotion, termination, layoff, recall, transfer, leaves of absence, compensation, and training.
$99k-128k yearly est. Auto-Apply 9d ago
Sr. Specialist Site Reliability Engineer
Waystar 4.6
Atlanta, GA jobs
We are looking for a talented and driven Site Reliability Engineering (SRE) Specialist to support our engineering team, which manages the infrastructure and services that power our Waystar products. This role is ideal for an experienced engineer who thrives in data-intensive environments and is passionate about building reliable, scalable systems that ensure data integrity, availability, and performance. As an SRE Specialist, you'll work closely with engineering, product, and data teams to ensure our data licensing platforms are resilient, observable, and continuously improving. WHAT YOU'LL DO System Reliability & Performance Design and implement reliability solutions for data ingestion, processing, and delivery pipelines. Define and maintain SLIs/SLOs for data licensing services and manage error budgets. Build automation for deployment, monitoring, and incident response. Observability & Monitoring Enhance system observability through metrics, logging, and tracing. Develop and maintain dashboards and alerts to proactively detect and resolve issues. Incident Response & Postmortems Participate in on-call rotations and lead incident response efforts. Conduct root cause analysis and drive post-incident improvements. Maintain runbooks and operational documentation. Collaboration & Continuous Improvement Partner with software and data engineers to embed reliability into system design. Contribute to blameless postmortems and reliability reviews. Share knowledge and mentor junior team members. WHAT YOU'LL NEED 7+ years of experience in SRE, DevOps, or infrastructure engineering. Strong understanding of cloud platforms (AWS, GCP, or Azure), container orchestration (Kubernetes), and infrastructure-as-code (Terraform, CloudFormation). Experience with observability tools (e.g., Prometheus, Grafana, Splunk) and CI/CD pipelines. Familiarity with data platforms, ETL pipelines, and distributed systems. Excellent problem-solving and communication skills. Experience with Python, Powershell, and other similar languages Active use of artificial intelligence (AI) tools and techniques to enhance performance, drive innovation, and improve decision-making across business functions Ability to leverage AI tools and platforms to streamline workflows, improve decision-making, and drive innovation Curiosity and adaptability in exploring emerging AI technologies, with a mindset for continuous learning and experimentation Preferred Qualifications Experience with data licensing, data governance, or data compliance frameworks. Exposure to data pipeline tools (e.g., Apache Airflow, Kafka, Spark). Familiarity with regulatory requirements related to data usage and distribution. ABOUT WAYSTAR Through a smart platform and better experience, Waystar helps providers simplify healthcare payments and yield powerful results throughout the complete revenue cycle. Waystar's healthcare payments platform combines innovative, cloud-based technology, robust data, and unparalleled client support to streamline workflows and improve financials so providers can focus on what matters most: their patients and communities. Waystar is trusted by 1M+ providers, 1K+ hospitals and health systems, and is connected to over 5K commercial and Medicaid/Medicare payers. We are deeply committed to living out our organizational values: honesty; kindness; passion; curiosity; fanatical focus; best work, always; making it happen; and joyful, optimistic & fun. Waystar products have won multiple Best in KLAS or Category Leader awards since 2010 and earned multiple #1 rankings from Black Book™ surveys since 2012. The Waystar platform supports more than 500,000 providers, 1,000 health systems and hospitals, and 5,000 payers and health plans. For more information, visit waystar.com or follow @Waystar on Twitter. WAYSTAR PERKS Competitive total rewards (base salary + bonus, if applicable) Customizable benefits package (3 medical plans with Health Saving Account company match) We offer generous paid time off for our non-exempt team members, starting with 3 weeks + 13 paid holidays, including 2 personal floating holidays. We also offer flexible time off for our exempt team members + 13 paid holidays Paid parental leave (including maternity + paternity leave) Education assistance opportunities and free LinkedIn Learning access Free mental health and family planning programs, including adoption assistance and fertility support 401(K) program with company match Pet insurance Employee resource groups Waystar is proud to be an equal opportunity workplace. We celebrate, value, and support diversity and inclusion. Qualified applicants will receive consideration for employment without regard to race, color, religion, age, sex, national origin, disability status, genetics, marital status, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state, or local laws. This applies to all terms and conditions of employment, including recruiting, hiring, placement, promotion, termination, layoff, recall, transfer, leaves of absence, compensation, and training.
$99k-128k yearly est. Auto-Apply 7d ago
SR Site Reliability Engineer
F5 Networks 4.6
Washington jobs
At F5, we strive to bring a better digital world to life. Our teams empower organizations across the globe to create, secure, and run applications that enhance how we experience our evolving digital world. We are passionate about cybersecurity, from protecting consumers from fraud to enabling companies to focus on innovation. Everything we do centers around people. That means we obsess over how to make the lives of our customers, and their customers, better. And it means we prioritize a diverse F5 community where each individual can thrive. Job Title: SR SRE Engineer Job Family Name: (see Job Family Names) SRE Business Title: SR SRE Engineer Date: 10/7/2025 Reports to (Title) : Rick Mitchell, Sr Manager, Engineering Direct Reports: N/A Our Employees Are valued and empowered, collaborative and team oriented, innovative in their approach and passionate about their work. They are reliable, trustworthy, and open with a high level of integrity. They value diversity, are inclusive and are committed to a global mindset Common Engineering Development Infrastructure (CEDI) has an opening for a Site Reliability Engineer on our team. Our engineering team drives quality through resilient and high performing development services. Our team continuously drives improvement and a high level of service for our F5 team members. If you have a passion for development, automation for infrastructure, service reliability and monitoring we look forward to meeting with you. This specific role will be supporting, improving and deploying complex Kubernetes configurations This role will require in depth Kubernetes knowledge and management especially using Kubernetes in conjunction with GitHub. The SR Site Reliability Engineer is responsible for ensuring operations for infrastructure and application solutions are reliable and sustainable by incorporating automation towards installation and configuration of entire technology stack as well as the deployment of code throughout the development life cycle. The engineer is passionate about full stack visibility, a broad knowledge in application, databases, servers, virtualization, networking and requires extensive automation. The engineer will be responsible for all aspects of environment planning including capacity management, monitoring, scalability, auditing, disaster recovery, and interoperability. The role is responsible for assisting with design and implementation of new technologies to replace legacy systems and processes. Bring your passion and dedication to working with an amazing team to quickly turn ideas into great product. ROLE DESCRIPTION Develop high-quality services, lead design discussions, execute development against design for development teams to utilize in a self-service model Coordinate with product and platform teams on regular maintenance, improve availability, scalability, and performance of the CI/CD environment. Collaborate with product teams and work cross-functionally with F5 IT department and vendors to implement the services and automation required to support application use cases. Actively engage with internal teams to develop tooling, framework to drive full observability and automation of the environment. Ensure adherence to architecture standards and roadmaps. Drive digital innovation by leveraging innovative new technologies and approaches to renovate, extend, and transform the existing core technology base. Ensure that post-production operational processes / deliverables are well designed and implemented prior to the project moving into the solution support phase. Define and create development procedures, processes, and scripts to drive a standard software development lifecycle. Assist in the evaluation, selection, and implementation of new technologies with product teams to ensure adherence to architecture guidelines for new technology introduction. Provide technical leadership on establishing standards and guidelines. Facilitate collaboration between development and operations teams throughout the application lifecycle. Partner with Corporate Information Security to ensure all security policies and audit inquiries are addressed. Coordinate and align all other technology teams to ensure operational delivery processes are governed and monitored to expedite issue remediation REQUIREMENTS 3 to 5 years of experience developing and implementing CI/CD automation, performance tuning, and scaling applications. Direct experience with automation to deploy, manage and maintain complex Kubernetes installations. 3 to 5 years of experience with open-source technologies and cloud services preferably Azure Experience with microservice architecture and development Hands-on development experience with one or more general purpose programming languages including but not limited to: Python, JavaScript, or Go. Infrastructure deployment experience using technologies such as TerraForm, and Ansible. Excellent working knowledge of system environments - operating systems, networking, applications, platforms, and databases. Experience with branching strategies, test-driven development, release management, Agile methodologies, Unix, Linux, Familiarity with common database technologies such as MS SQL Service, PostGreSQL, Experience with configuration management system (Puppet, Chef, Ansible, etc.) Knowledge of development methodologies (Agile, Kanban, Scrum) across various technologies. Experience with continuous integration methodologies and tools such as GitLab. Self-motivated individual that possesses excellent time management and organizational skills Strong sense of personal responsibility and accountability for delivering high quality work, both personally and at a team level. Great communication skills as this will be facing internal/external users. Excellent written and verbal communication skills MINIMUM QUALIFICATION/EDUCATION B.S. or M.S. in Computer Science, Software Engineering, or comparable experience 3 years direct deployment, optimization and management of Kubernetes clusters. 8+ years of experience in technical support and troubleshooting of multiple systems including: cloud native applications, interface engines, and complex distributed systems. 2+ years of experience developing automation for Cloud applications, preferably Azure Comfortable mentoring team members with different skill sets and technical areas of focus and expertise. The Job Description is intended to be a general representation of the responsibilities and requirements of the job. However, the description may not be all-inclusive, and responsibilities and requirements are subject to change. The annual base pay for this position is: $147,200.00 - $220,800.00 F5 maintains broad salary ranges for its roles in order to account for variations in knowledge, skills, experience, geographic locations, and market conditions, as well as to reflect F5's differing products, industries, and lines of business. The pay range referenced is as of the time of the job posting and is subject to change. You may also be offered incentive compensation, bonus, restricted stock units, and benefits. More details about F5's benefits can be found at the following link: ******************************************* . F5 reserves the right to change or terminate any benefit plan without notice. Please note that F5 only contacts candidates through F5 email address (ending with @f5.com) or auto email notification from Workday (ending with f5.com or @myworkday.com). Equal Employment Opportunity It is the policy of F5 to provide equal employment opportunities to all employees and employment applicants without regard to unlawful considerations of race, religion, color, national origin, sex, sexual orientation, gender identity or expression, age, sensory, physical, or mental disability, marital status, veteran or military status, genetic information, or any other classification protected by applicable local, state, or federal laws. This policy applies to all aspects of employment, including, but not limited to, hiring, job assignment, compensation, promotion, benefits, training, discipline, and termination. F5 offers a variety of reasonable accommodations for candidates. Requesting an accommodation is completely voluntary. F5 will assess the need for accommodations in the application process separately from those that may be needed to perform the job. Request by contacting accommodations@f5.com.
$147.2k-220.8k yearly Auto-Apply 9d ago
Sr. Site Reliability Engineer - GIS Real-Time Data
Esri 4.4
Redlands, CA jobs
Join us to work collaboratively with our talented team of dynamic and passionate engineers to deliver capabilities that enable our customers to make a difference. You'll deploy and operate ArcGIS Velocity and ArcGIS Workflow Manager SaaS solutions. You will also have the opportunity to design, deploy, and operate next-generation real-time and big data GIS software-as-a-service (SaaS) capabilities for thousands of cloud users worldwide. Our teams have a broad mix of experience levels and tenures that support an environment which promotes professional development. We care about your career growth and strive to assign projects based on what will help each team member develop into a better-rounded engineer and enable them to take on more complex tasks in the future. Our team also puts a high value on work-life balance, and we understand that striking a healthy balance between your personal and professional life is crucial to your happiness and success here. We offer a flexible hybrid schedule so you can have a more productive and well-balanced life both in and outside of work. The Professional Services division is the consulting and implementation arm of Esri. We break ground in new markets, push the technology envelope and ultimately deliver transformational solutions to high profile clients worldwide. The Professional Services organization is comprised of nearly 1,000 talented business and technical professionals who strive every day to help our users be successful. This product team works closely with solution delivery consultants to envision and then ‘make real' the emerging requirements in industry, government, and other user communities. Responsibilities Collaborate with a team of SRE engineers to operate SaaS capabilities across multiple regions on the cloud platform Design, implement, configure, and utilize monitoring systems to monitor the health of SaaS products Manage infrastructure used for ArcGIS Velocity and ArcGIS Workflow Manager, respond to alerts, and troubleshoot problems to resolution Develop, implement, and maintain automation solutions for repetitive operational tasks, such as deployment pipelines, incident resolution, and scaling processes Design and implement the deployment and upgrade containerized micro-service components that, when combined, power Esri's SaaS offerings Create and automate Git workflows to simplify code integration, testing, and infrastructure deployments. Participate in technical spike efforts, bringing new innovative ideas to future versions of our software Troubleshoot the system incidents and provide root cause analysis reports Provide rotational on-call technical support Requirements 5+ years of experience managing Kubernetes (EKS), logging and monitoring (ELK, Prometheus), and container technologies (Docker, ECS) Proficient in using Terraform for automating infrastructure provisioning and management Ability to design and automate Git workflows for streamlined code integration, testing, and infrastructure deployment Ability to write scripts to deploy infrastructure and/or applications (Bash, Python, Terraform) High level of understanding and experience with cloud computing platforms (AWS) Strong knowledge of Linux Operating system administration, including troubleshooting, performance tuning, and shell scripting Proficient in cloud networking, including VPCs, subnets, security groups, and VPNs in platforms like AWS Skilled in identifying and resolving system and application issues through effective troubleshooting and root cause analysis Working knowledge of a source control and issue management system, preferably GitHub Working knowledge of authoring, deploying, and troubleshooting Java applications on AWS Lambda Bachelor's in computer science, computer engineering, GIS, or information systems Recommended Qualifications 5+ years of experience designing, administering, and/or maintaining cloud environments, such as AWS, supporting 24×7 high-availability production environments Interest in working with GitOps principles to automate the deployment of applications on Kubernetes clusters Certifications: AWS Certified Solution Architect Associate, CKA/CKAD or similar Experience managing OpenSearch (datastore or logstore), and Kafka for managing distributed data streams and ensuring high availability in large-scale systems Ability to work with continuous integration and delivery best practices Knowledge of operating resilient, highly available, scalable, and performance SaaS capabilities Knowledge of Esri ArcGIS or other web mapping technologies Master's in computer science, computer engineering, GIS, or information systems #LI-DR5 #LI-Hybrid
$94k-120k yearly est. Auto-Apply 25d ago