Site Reliability Engineer (Genetec)
Reliability engineer job in Englewood Cliffs, NJ
STAND 8 provides end to end IT solutions to enterprise partners across the United States and with offices in Los Angeles, New York, New Jersey, Atlanta, and more including internationally in Mexico and India. Our Technology Solutions team is seeking an experienced and highly skilled Site Reliability Engineer (Genetec) to join our team in support of a global Media & Entertainment client. Ideally, you will have advanced knowledge of Genetec Security Center and related security software, with the ability to set up servers, test software, create staging environments, and build connections to other systems. This role requires hands-on expertise in software development, system configuration, and network security. We are looking for team players that can add value with their contributions and enjoy working with a great group of people.
This project will require the person to work onsite 5 days / week in Englewood, NJ. Our team is setting up interviews immediately and if you'd like to work in a dynamic Entertainment environment with a lot of action, you'll love this job!
Responsibilities
Conduct regular system checks to ensure optimal performance, including verifying communication links.
Monitor and support the Genetec unified system, managing access control systems such as readers, panels, turnstiles, and other I/O devices, as well as video surveillance systems including cameras, LPR, and video analytics modules.
Perform initial diagnostic troubleshooting to identify and resolve simple network and system issues, escalating as necessary.
Uphold high standards and best practices, including testing, documentation, process application, and solution deployment.
Generate monthly metrics reports to track system availability and firmware status.
Maintain integrations with third-party solutions using restful API, IIS, and SQL querying for reporting modules.
Handle initial escalations from system failures and coordinate with third parties as needed.
Support and monitor facility systems such as Continuum, ScheduAll (resource/asset scheduling), EOMS (asset management), EPMS (Electric Power Monitoring System), and TripShot (Shuttle service portal).
Participate in system upgrades and patching activities.
Provide support during major incidents and outages as required.
Collaborate with other engineering teams to integrate new tools and automate existing applications.
Interface with various Technology organization teams, including cyber, networking, firewall, Wintel, database, and data warehouse teams.
Manage access control and video devices for multiple campuses, both domestic and international.
Partner with Security Engineering team for system knowledge transfer.
Oversee vendor management.
Monitor and support Global Risk and Incident response applications and mass communication systems.
Requirements
Minimum of 3 years' experience with Genetec systems V5.11+ and knowledge of high availability enterprise architectural designs.
At least 3 years' experience designing scalable and reliable enterprise solutions, considering multi-tier software architectures, networking, and security.
3+ years of experience in managing and maintaining facility management systems.
System administration experience with Windows and Linux.
Effective communication skills, both written and verbal, with stakeholders and engineers.
Experience with change management methodologies.
Proficiency with identity access systems.
Strong problem-solving abilities.
Ability to work within a fast-paced agile process.
Ability to build effective cross-functional relationships to deliver enterprise-wide solutions.
Self-starter, results-oriented.
Experience creating network and system diagrams for both established and new proposed deployments.
Familiarity with third-party systems and custom applications.
Provide system support during business hours and after hours on an on-call basis.
Preferred Qualifications
Current and active Genetec Enterprise certifications.
Fundamental understanding of Networking (Network+ or CCNA).
2+ years' experience with cloud-based infrastructure and platform services (Azure, AWS) preferred.
3+ years of experience in developing, deploying, and operating facility management systems.
3+ years of experience in support of Global Risk and response systems.
Experience with network protocols such as TCP/IP, SNMP, Modbus, and BACnet.
Ability to read and interpret drawings, wiring diagrams, and device data sheets.
Experience with automation and motivation around leveraging tools
Additional Details
The base salary range for this position is $65.00 - $75.00 per hour, depending on experience.
Our pay ranges are determined by role, level, and location. The range displayed on each job posting reflects the minimum and maximum target for new hires of this position across all US locations. Within the range, individual pay is determined by work location and additional factors, including job-related skills, experience, and relevant education or training.
Benefits
Medical coverage and Health Savings Account (HSA) through Anthem
Dental/Vision/Various Ancillary coverages through Unum
401(k) retirement savings plan
Company-paid Employee Assistance Program (EAP)
Discount programs through ADP WorkforceNow
About Us
STAND 8 provides end-to-end IT solutions to enterprise partners across the United States and globally with offices in Los Angeles, Atlanta, New York, Mexico, Japan, India, and more. STAND 8 focuses on the "bleeding edge" of technology and leverages automation, process, marketing, and over fifteen years of success and growth to provide a world-class experience for our customers, partners, and employees.
Our mission is to impact the world positively by creating success through PEOPLE, PROCESS, and TECHNOLOGY.
Check out more at ************** and reach out today to explore opportunities to grow together!
Site Reliability Engineer
Reliability engineer job in Jersey City, NJ
*Presently we are unable to sponsor and request applicants to apply who are authorized to work without sponsorship* (Can work only on W2)
Below are the few details of the opportunity.
Job Title: Software Engineering (SRE/DevOps/Windows Eng)
Location: Jersey City, NJ 07310 - Onsite
Duration: Contract to Hire
Job Description:
About Candidate:
End to end - development, deployment, automation & monitor - using Automation CI/CD pipelines
Working with SQL servers, oracle
Most apps deployed on windows servers - (windows stack - deployment front end web servers, application servers and database servers)
Manage vendor applications
Experience with reporting
Observability - is key - Graphana, dashboards, Dynatrace, SQL monitoring
Agile
Skills (required) -
Windows
PowerShell - scripting / APIs (post man, swagger)
Automation - (jewls PL), this is an CI/CD process
Repair Quality Engineer
Reliability engineer job in Englewood, NJ
Hanwha Vision America (HVA) is an affiliate of the Hanwha Group, a Fortune Global 500 company. HVA is an industry-leading provider of advanced network video surveillance products, including IP cameras, storage devices, and video management systems, founded on world-class technologies. We offer end-to-end security solutions and have achieved global success across a wide range of industry verticals, including retail, transportation, education, banking, healthcare, hospitality, and airports.
Hanwha Vision America (HVA) is seeking a Repair/ Quality Engineer to support HTCC's engineering and repair operations by performing intake screening, basic diagnostics, quality checks, and documentation.
The role ensures that incoming units are properly evaluated, repair processes run efficiently, and completed products meet quality standards before shipment. This position combines repair-support responsibilities with quality assurance activities to improve workflow efficiency, accuracy, and overall service performance.
Major Functions / Accountabilities
Perform initial screening and basic functional checks on incoming units
Identify obvious issues or simple conditions that can be resolved before repair
Support repair workflow by preparing units, organizing information, and performing basic diagnostics
Conduct quality checks on completed repair units to ensure they meet internal standards
Document inspection results and update system records accurately
Assist with failure analysis for repeated issues and provide feedback to engineering
Inspect packaging quality and verify final shipment readiness
Collaborate with repair staff, engineering, logistics, and warehouse teams as needed
Maintain checklists, guidelines, and standard procedures for inspection work
Support process improvements related to efficiency, quality, and documentation compliance
Knowledge, Skills & Requirements
Preferred background: Electronics, Electrical Engineering, Computer Engineering, or related field
Basic understanding of electronic components (e.g., resistors, capacitors, diodes)
Ability to use multimeters and basic diagnostic tools
Strong attention to detail and problem-solving skills
Ability to follow technical checklists and standardized procedures
Proficiency with Microsoft Office and basic system data entry
Bilingual (Korean/English) preferred but not required
Site Reliability Engineer, Payments - USDS
Reliability engineer job in New York, NY
Team Intro: The Global Payment team of the US Tech Service department of TikTok provides all-round payment solutions for the company's USA products, overseas commercialization, and the company's overseas travel and procurement, including channel access, product order design, user interaction, capital management, tax and exchange optimization, settlement reconciliation and so on. In this role, you'll have the opportunity to develop and manage the complex challenges of scale with your expertise in large-scale system design.
On-site presence across teams allows the company to operate with greater speed, alignment, and agility - especially in areas like real-time decision-making, team development, and integrated execution. As such, the company is shifting from a hybrid work model to a fully in-person schedule. The specific requirements may vary from team to team.
Responsibilities:
* Engage in and improve the whole lifecycle of services from inception and design, through deployment, operation and refinement.
* Support services before they go live through activities such as capacity planning and launch reviews.
* Support and maintain services by measuring and monitoring availability, latency and overall system health.
* Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.Minimum Qualifications:
* BS or MS degree in Computer Science, Electrical Engineering, Computer Engineering or related areas.
* 3+ years of experience in one or more programming languages such as Go, Java, C++, Python etc.
* Good problem-solving, analytical thinking capabilities and exceptional attention to details.
* Good communication and collaboration skills.
Preferred Qualifications:
* Expertise in designing, analyzing, and troubleshooting large-scale distributed systems.
* Ability to debug, optimize code, and automate routine tasks.
* Proficiency working with algorithms, data structures and production troubleshooting.
* Expertise in problem solving and analyzing global scale distributed systems.
* Systematic problem-solving approach, coupled with effective communication skills and a sense of drive.
Software Engineer II - Site Reliability Engineer
Reliability engineer job in New York, NY
Technology is at the heart of Disney's past, present, and future. Disney Entertainment and ESPN Product & Technology is a global organization of engineers, product developers, designers, technologists, data scientists, and more - all working to build and advance the technological backbone for Disney's media business globally.
The team marries technology with creativity to build world-class products, enhance storytelling, and drive velocity, innovation, and scalability for our businesses. We are Storytellers and Innovators. Creators and Builders. Entertainers and Engineers. We work with every part of The Walt Disney Company's media portfolio to advance the technological foundation and consumer media touch points serving millions of people around the world.
Here are a few reasons why we think you'd love working here:
Building the future of Disney's media: Our Technologists are designing and building the products and platforms that will power our media, advertising, and distribution businesses for years to come. Reach, Scale & Impact: More than ever, Disney's technology and products serve as a signature doorway for fans' connections with the company's brands and stories. Disney+. Hulu. ESPN. ABC. ABC News…and many more. These products and brands - and the unmatched stories, storytellers, and events they carry - matter to millions of people globally. Innovation: We develop and implement groundbreaking products and techniques that shape industry norms, and solve complex and distinctive technical problems.
Product Engineering is a unified team responsible for the engineering of Disney Entertainment & ESPN digital and streaming products and platforms. This includes product engineering, media engineering, quality assurance, engineering behind personalization, commerce, lifecycle, and identity.
Job Summary:
As a Software Engineer on the COPEX team, you'll design and build the foundational backend systems that directly power the Hulu & Disney+ streaming experience. You will architect mission-critical, high-throughput services for API and content recommendation delivery, while also building the platforms that empower our entire engineering organization to ship code with speed and confidence.
You will join a talented team of engineers who build the software that:
* Delivers foundational APIs and serves personalized streaming experiences to millions of users daily.
* Enables our engineering organization to define, provision, and manage cloud infrastructure programmatically and at scale.
* Allows teams to deploy changes to production swiftly and safely through sophisticated, automated CI/CD pipelines.
* Provides deep insight into application performance via powerful, self-service observability and testing platforms.
* Optimizes system capacity and cloud costs by engineering data-driven, automated solutions.
Responsibilities and Duties of the Role:
* Architect, build, and scale foundational backend services for API delivery and content recommendation, focusing on high availability, low latency, and massive throughput.
* Design, build, and evolve our CI/CD solutions, writing clean, scalable code to automate the entire build, test, and deployment lifecycle.
* Architect and develop robust, scalable test automation frameworks that product teams will use for load, integration, and functional testing.
* Write software to abstract and automate infrastructure provisioning, creating a seamless, self-service experience for engineering teams using Infrastructure as Code (IaC).
* Develop the core software, libraries, and services that form our observability platform, enabling engineers to easily build reliable and performant applications.
* Proactively improve system architecture and build software-based solutions to reduce toil, minimize incidents, and automate remediation.
Required Education, Experience/Skills/Training:
Basic Qualifications
* Minimum 3 years of professional experience
* Experience in a DevOps or SRE role.
* Experience with IaC
* Experience with incident response
* Experience with containerization
* Experience with CI/CD tools
* Experience programming in Java or a JVM language
* Experience working on cross-team projects.
* An ability to work both independently and collaboratively
* Strong communication skills and a desire to share and learn
Required Education
* Bachelor's degree in computer science, Computer Engineering, Information Technology, or a related technical field.
The hiring range for this position in New York is $120,300 to $161,300 per year. The base pay actually offered will take into account internal equity and also may vary depending on the candidate's geographic region, job-related knowledge, skills, and experience among other factors. A bonus and/or long-term incentive units may be provided as part of the compensation package, in addition to the full range of medical, financial, and/or other benefits, dependent on the level and position offered.
About Disney Entertainment and ESPN Product & Technology:
At Disney Entertainment and ESPN Product & Technology, we're blending imagination and innovation to reimagine the ways people experience and engage with the world's most beloved stories and products. Our work is wide-ranging and deeply sophisticated. We create amazing experiences, transform the future of media, and build products and platforms that enable the connection between people everywhere and the stories and sports they love.
Disney's ability to marry world-class technology with one-of-a-kind creativity makes us unique. It is at the heart of our past, present, and future. We are Storytellers and Innovators. Creators and Builders. Entertainers and Engineers.
About The Walt Disney Company:
The Walt Disney Company, together with its subsidiaries and affiliates, is a leading diversified international family entertainment and media enterprise that includes three core business segments: Disney Entertainment, ESPN, and Disney Experiences. From humble beginnings as a cartoon studio in the 1920s to its preeminent name in the entertainment industry today, Disney proudly continues its legacy of creating world-class stories and experiences for every member of the family. Disney's stories, characters and experiences reach consumers and guests from every corner of the globe. With operations in more than 40 countries, our employees and cast members work together to create entertainment experiences that are both universally and locally cherished.
This position is with Disney Streaming Technology LLC, which is part of a business we call Disney Entertainment and ESPN Product & Technology.
Disney Streaming Technology LLC is an equal opportunity employer. Applicants will receive consideration for employment without regard to race, religion, color, sex, sexual orientation, gender, gender identity, gender expression, national origin, ancestry, age, marital status, military or veteran status, medical condition, genetic information or disability, or any other basis prohibited by federal, state or local law. Disney champions a business environment where ideas and decisions from all people help us grow, innovate, create the best stories and be relevant in a constantly evolving world.
Apply Now Apply Later
Current Employees Apply via My Disney Career
Explore Location
Site Reliability Engineer
Reliability engineer job in New York, NY
Role Roadmap We are building a next-generation financial ecosystem (think NYSE or CME from scratch). We are a small team, which means your responsibilities scale very rapidly, and your contributions are clear and visible, not marginal. There is still a lot of green field at Kalshi and a lot of it (including entire systems) can be yours.
What you'll do
* Improve observability, reliability and availability by defining and measuring key metrics.
* Build automation and improve systems to eliminate toil and operations work.
* Collaborate with our core infrastructure team to performance tune and optimize our cloud deployments. (Think Docker, Terraform, Kubernetes, EC2, etc.)
* Collaborate with product teams to reduce service disruptions and automate incident response.
* Proactively find and analyze reliability problems across our business units and stack, then design and implement software to create step-function improvements.
* Educate, mentor and hold accountable the engineering team to improve the reliability of our systems and make reliability a core value of the Kalshi engineering culture.
* Write high quality, well tested code to meet the needs of your customers.
* Debugging extremely difficult technical problems, and making systems and products both work better and are easier to deploy, own, operate and diagnose.
* Review all feature designs within your product area and across the company for cross-cutting projects.
* Be an owner of the security, safety, scale, operational integrity, and architectural clarity of these designs.
* Build integrations with 3rd party vendors.
* Participate in an on-call support rotation to provide timely troubleshooting and resolution of urgent issues.
What we're looking for
Attributes:
* You have at least 4 years of experience in software engineering.
* You've designed, built, scaled and maintained production services, and know how to compose a service oriented architecture.
* You write high quality, well tested code to meet the needs of your customers.
* You're passionate about building an open financial system that brings the world together.
* You possess strong technical skills for system design and coding.
* Excellent written and verbal communication skills, and a bias toward open, transparent cultural practices.
* Strong skills around observability, debugging and performance tuning.
* Strong interpersonal skills working with engineers from junior to principal levels
* Demonstrated critical thinking under pressure.
* A willingness to dive into understanding, debugging, and improving any layer of the stack.
* On-call availability to ensure swift resolution of issues.
Bonus points
* Experience designing and building reliable systems capable of handling high throughput and low latency.
* Experience with Datadog.
* Experience with Rust, Go and Terraform.
* Experience with AWS, GCP, or Azure.
* Experience working in a highly regulated environment.
* Experience writing company-facing blog posts and training materials.
Our Culture
Meritocracy is at our core, and we value people who take ownership and figure (usually hard) things out. We dream big. We love our craft deeply and are proud of what we put out in the world. We are committed to our vision of building something big… but also useful: a product that brings more truth through the power of markets.
Kalshians are Kalshi's most important asset: we pick Kalshians carefully, so we trust them fully on day 1.
NYC Pay Transparency Disclosure:
Salary Range: $100,000 to $250,000 annually plus equity and benefits.
This salary range is based on the current available market data and represents the expected salary range for this role. Kalshi has minimal hierarchy and few titles, but a broad range of experience is represented within roles. Should you have compensation expectations that exceed these bands, we'd love to hear from you and would welcome you to reach out to discuss further.
Commitment to Equal Opportunity
Kalshi is committed to creating a culture of inclusion and belonging, and we are proud to be an equal opportunity employer. We believe it is our collective responsibility to uphold these values and encourage candidates from all backgrounds to join us in our mission. All qualified applicants will be treated with respect and receive equal consideration for employment without regard to race, color, creed, religion, sex, gender identity, sexual orientation, national origin, disability, uniform service, veteran status, age, or any other protected characteristic per federal, state, or local law. If you are passionate about what you do and want to use your talents to support our mission and values, we'd love to hear from you.
Auto-ApplyReliability Engineer
Reliability engineer job in Parsippany-Troy Hills, NJ
SummaryAs the Reliability Engineer for Metem a GE Vernova business, you will be an active contributor to the success of the organization by improving the reliability, availability, and performance of our equipment and processes. You will analyze failure data, develop maintenance strategies, and work cross-functionally to implement proactive measures that reduce downtime and increase efficiency.Job Description
What you'll do
Develop and implement reliability improvement strategies using industry best practices such as RCM (Reliability-Centered Maintenance), FMEA (Failure Mode and Effects Analysis), and Root Cause Analysis (RCA).
Monitor and analyze equipment performance and failure data to identify trends and areas for improvement.
Collaborate with maintenance, operations, engineering, and safety teams to design and implement preventive and predictive maintenance programs.
Establish key performance indicators (KPIs) for equipment reliability, and track progress against targets.
Drive continuous improvement initiatives aimed at reducing equipment downtime and maintenance costs.
Lead investigations into equipment failures and chronic issues, identifying root causes and implementing long-term solutions.
Provide technical support for asset management, including equipment life cycle analysis and spare parts optimization.
Participate in the design and installation of new equipment, ensuring reliability is considered from the outset (Design for Reliability).
Eligibility Requirements
This role requires use of technical data subject to U.S. Government export restrictions and this posting is only for U.S. Persons (U.S. Citizens, lawful permanent residents and protected individuals (e.g., certain refugees and asylees)). GE will require proof of status prior to employment.
This is an onsite position based in Parsippany, NJ.
Must be open to travel requirements. Ability to travel to the Allentown, PA facility approximately 1 time per week and to Hungary on average of 2 times a year.
What you'll bring (Basic Qualifications)
Bachelor's degree from an accredited university in Mechanical, Electrical, or Industrial Engineering.
Minimum of 7 years of experience in reliability, maintenance, or engineering
Strong knowledge of reliability engineering tools and methodologies (e.g., FMEA, RCA, Weibull analysis, MTBF/MTTR).
Strong knowledge of engineering concepts and maintenance repair methods.
Ability to interpret blueprints, specifications, drawings, and schematics.
Experience with Maintenance Management Systems.
Project Management skills and experience.
What will make you stand out
You have completed your CMRP certification.
You have a Six Sigma certification.
You are detail oriented with good organizational skills.
You have excellent verbal and written communication skills.
You have experience in chemical manufacturing operations and/or CNC machining facilities.
You have a Process Safety Management background.
This role requires access to U.S. export-controlled information. If applicable, final offers will be contingent on ability to obtain authorization for access to U.S. export-controlled information from the U.S. Government.
Additional Information
GE Vernova offers a great work environment, professional development, challenging careers, and competitive compensation. GE Vernova is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, national or ethnic origin, sex, sexual orientation, gender identity or expression, age, disability, protected veteran status or other characteristics protected by law.
GE Vernova will only employ those who are legally authorized to work in the United States for this opening. Any offer of employment is conditioned upon the successful completion of a drug screen (as applicable).
Relocation Assistance Provided: Yes
For candidates applying to a U.S. based position, the pay range for this position is between $123,700.00 and $206,200.00. The Company pays a geographic differential of 110%, 120% or 130% of salary in certain areas. The specific pay offered may be influenced by a variety of factors, including the candidate's experience, education, and skill set.Bonus eligibility: discretionary annual bonus.This posting is expected to remain open for at least seven days after it was posted on November 18, 2025.Available benefits include medical, dental, vision, and prescription drug coverage; access to Health Coach from GE Vernova, a 24/7 nurse-based resource; and access to the Employee Assistance Program, providing 24/7 confidential assessment, counseling and referral services. Retirement benefits include the GE Vernova Retirement Savings Plan, a tax-advantaged 401(k) savings opportunity with company matching contributions and company retirement contributions, as well as access to Fidelity resources and financial planning consultants. Other benefits include tuition assistance, adoption assistance, paid parental leave, disability benefits, life insurance, 12 paid holidays, and permissive time off.GE Vernova Inc. or its affiliates (collectively or individually, “GE Vernova”) sponsor certain employee benefit plans or programs GE Vernova reserves the right to terminate, amend, suspend, replace, or modify its benefit plans and programs at any time and for any reason, in its sole discretion. No individual has a vested right to any benefit under a GE Vernova welfare benefit plan or program. This document does not create a contract of employment with any individual.
Auto-ApplySite Reliability Engineer
Reliability engineer job in New York, NY
Superblocks is reimagining software development for a billion builders. Our mission is to help every team build, deploy, and manage AI-powered software with full control and flexibility. Why Join Us? We're one of the fastest-growing AI startups, backed by top-tier investors and widely adopted by companies like Instacart, Sofi, Betterment, and Carrier. Our team comes from Uber, Stripe, Datadog, Confluent, Elastic, and Google, and has founded/architected systems like Kafka, Kibana, Debezium, Datadog APM, and more.
Since launching Clark, our AI builder, the response has been overwhelming with strong adoption from enterprises across different industries.
We're fully in-person at our NYC HQ near Union Square and are looking for exceptional engineers who are passionate about creating great products.
The Role
You'll play a key role in designing and developing the core systems that power and manage hundreds of thousands of AI applications. If you're interested in building and operating complex infrastructure in production, innovating new AI agent architectures, and building with some of the sharpest engineers, this is the place for you.
Responsibilities:
* Architect and operate scalable production systems supporting both multi-tenant cloud and on-premise deployments.
* Design and develop a real-time distributed execution engine that powers all AI applications, workflows, and agents.
* Build, deploy, and optimize AI agent architecture, guardrails and evals.
* Partner with product and customers to define the roadmap and bring new builder and AI experiences to life
Must haves:
* 3+ years of experience managing cloud-based production apps with deep knowledge of containers, VMs, caches, task queues, networking, and OS.
* Designed and deployed infrastructure in production at scale with containerized solutions like Docker, Kubernetes (k8s), ECS/EKS, Firecracker etc.
* Strong product sense focused on great user experiences and strategic thinking to meet market and customer needs.
Nice to haves:
* Built and operated production AI systems and are familiar with AI inference techniques
* Optimized language runtimes and enable cross-language integration (e.g., Go, Python, C), including customizing or building WASM compilers and runtimes.
* Experience with machine learning algorithms, platforms, and frameworks like PyTorch and Tensorflow.
Compensation
The base salary ranges between $175,000-$225,000+ USD, plus a generous equity package and benefits. Final comp will be based on experience and skills.
If you're excited to build the core infrastructure powering the next billion AI-powered apps, let's talk.
Auto-ApplyLead Site Reliability Engineer
Reliability engineer job in New York, NY
Help us use technology to make a big green dent in the universe!
Kraken powers some of the most innovative global developments in energy.
We're a technology company focused on creating a smart, sustainable energy system. From optimising renewable generation, creating a more intelligent grid and enabling utilities to provide excellent customer experiences, our operating system for energy is transforming the industry around the world in a way that benefits everyone.
It's a really exciting time in energy. Help us make a real impact on shaping a better, more sustainable future.
Our Global Platform Engineering Reliability group is responsible for architecting, developing, and maintaining the resilient and scalable infrastructure that power and support our platforms.
As a Lead Site Reliability Engineer within the newly created ‘Product Reliability' team, you'll be responsible for ensuring the availability, performance, and scalability of the products on our platform. Your proficiency in leading technical teams that support products serving millions of customers will ensure stability and high performance for our brands and clients.
You will keep up with best practices in building products for scale. Your communication skills and attention to detail will be indispensable as you pinpoint areas for enhancement, ensure optimal product performance, and continuously improve our platforms reliability and efficiency.
What you'll do:
Team leadership
Have ownership of the Product Reliability team within Platform, working closely with the Director and Heads of Platform Engineering to define strategic objectives and team direction
Manage team priorities and ensure initiatives are completed within deadlines
Collaborate regularly and effectively with the Staff Platform Engineer in your functional team to deliver the technical implementation of the team's strategic priorities
Lead delivery of major initiatives on clear timelines
Partner effectively in the wider Platform Engineering team to deliver outcomes
Build a strong culture of open communication where teammates can ask questions without fear, promoting a positive and inclusive team environment
People management
Line-manage the engineers in the Product Reliability team
Set clear performance expectations and goals for team members
Regularly review individual and team performance, offering actionable insights and constructive feedback to support and grow team members
Technical delivery
Deliver technical improvements such as small features and bug fixes
Support team delivery through code reviews, technology research and architectural guidance
Provide support for service offerings owned by your team
Help solve interesting and difficult problems. There's a great opportunity for disruption in the global energy market
What you'll have:
Excellent communication skills, working effectively with developers, product managers and other business stakeholders to understand and deliver impactful projects and reliability improvements
Record of successfully and consistently delivering critical path projects, on time and at scale
Meticulous organisation and planning skills
Experience of mentoring and coaching a team to perform at a high-level of quality
Experience managing and supporting a large-scale internet-facing distributed systems, for millions of customers
Good experience with AWS and a programming language. We use a lot of different AWS services and not just the standard few
Knowledge of security best-practices, security and CI/CD tooling, and methodologies
We're hiring this role in New York City, but would also consider remote candidates who are based in the EST timezone, we cannot consider any applicants outside this region
What will help:
Previous experience in leading technical delivery for small, highly-autonomous teams
Previous experience as a technical individual contributor, preferably as a Site Reliability Engineer
Track-record of effective collaboration with other teams and departments to drive holistic outcomes
A proactive, innovative mindset with the ability to drive continuous improvement
Previous experience working in a remote-first asynchronous global team
Familiarity with some of our tech stack:
- PostgreSQL, or a similar RDBMS, particularly in Amazon RDS at scale
- Docker and Kubernetes, we use Amazon EKS in production
- Python
- Datadog, or a similar logging/monitoring tool
- Messaging queues, event-driven async processing or similar technologies - we use RabbitMQ
- Terraform, or a similar infrastructure-as-code tool
- Experience with a Linux distribution
Why you'll love it here:
Great medical, dental, and vision insurance options including FSAs.
Paid time off - we know working hard means also being able to recharge as needed, we trust our employees to get the work done and take the time they need.
401(k) plan with employer match.
Parental leave. Biological, adoptive and foster parents are all eligible.
Pre-tax commuter benefits.
Flexible working environment: you need to shift around your schedule? You do you, we genuinely believe in work/life balance.
Equity Options: every Octopus employee owns part of the business. We're a team, working together towards huge goals. Every person is crucial to our success, you should be rewarded as such.
Modern office or co-working spaces depending on location.
We hire a wide range of experience levels into our platform team. The salary range for this role in the US ranges on average from $180,000-$220,000 depending on relevant experience, role alignment, and performance throughout the interview process. While the broad salary range is listed, not all candidates will be placed at the top of the range-this will be determined by the overall fit for the position. If you have questions about this, just ask! Our recruiters are happy to provide more context.
We are hiring this role remotely in North America and require candidates to be based in the EST timezone. We cannot candidates outside this region.
Kraken is a certified Great Place to Work in France, Germany, Spain, Japan and Australia. In the UK we are one of the Best Workplaces on Glassdoor with a score of 4.7. Check out our Welcome to the Jungle site (FR/EN) to learn more about our teams and culture.
Are you ready for a career with us? We want to ensure you have all the tools and environment you need to unleash your potential. If you have any specific accommodations or a unique preference, please contact us at ********************* and we'll do what we can to customise your interview process for comfort and maximum magic!
Studies have shown that some groups of people, like women, are less likely to apply to a role unless they meet 100% of the job requirements. Whoever you are, if you like one of our jobs, we encourage you to apply as you might just be the candidate we hire. Across Kraken, we're looking for genuinely decent people who are honest and empathetic. Our people are our strongest asset and the unique skills and perspectives people bring to the team are the driving force of our success. As an equal opportunity employer, we do not discriminate on the basis of any protected attribute. We consider all applicants without regard to race, colour, religion, national origin, age, sex, gender identity or expression, sexual orientation, marital or veteran status, disability, or any other legally protected status. U.S. based candidates can learn more about their EEO rights here.
Our (i)
Applicant and Candidate Privacy Notice and Artificial Intelligence (AI) Notice
, (ii)
Website Privacy Notice
and (iii)
Cookie Notice
govern the collection and use of your personal data in connection with your application and use of our website. These policies explain how we handle your data and outline your rights under applicable laws, including, but not limited to, the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA). Depending on your location, you may have the right to access, correct, or delete your information, object to processing, or withdraw consent. By applying, you acknowledge that you've read, understood and consent to these terms
Staff Site Reliability Engineer
Reliability engineer job in New York, NY
Who We Are
Develocity is a first-of-its-kind toolchain observability and acceleration platform that helps software teams adopt and improve DORA capabilities (including continuous delivery) in order to achieve software delivery excellence. It combines build and test acceleration with deep observability for builds and tests with Gradle Build Tool, Apache Maven™, sbt, npm, and Python, and applies to both CI and local builds and tests. Ultimately, Develocity provides an operational layer across an organization's toolchains to speed up, troubleshoot, and optimize local developer and remote CI feedback loops.
Our software is used by some of the world's leading software organizations, such as Netflix, Airbnb, SAP, several top ten banks, and many other major customers across all verticals. We regularly collaborate with these and other users to make our products continuously better.
We have partnered with the Apache Software Foundation, the Commonhaus Foundation, the Scala Center, the Micronaut Foundation, and other OSS projects like Spring, Quarkus, Kotlin, JUnit, AndroidX, and many more to bring the values of Develocity also to the OSS Community.
Our Values
Seek to Understand: Everything starts with listening and understanding, and we strive to understand different viewpoints, problems, and motivations. Before we take action, we ensure we truly grasp the challenges, perspectives, and goals.
Know the Why: We approach our work with a clear sense of purpose, ensuring every step is deliberate and focused. We take meaningful action with urgency, but never at the expense of thoughtful consideration.
Innovate & Iterate: We embrace challenges and are not afraid to try new things, even if they might fail. With deep understanding and a clear purpose, we can develop creative and bold solutions to tackle challenges.
Own the Outcome: We are empowered to take initiative and we maintain transparency in our work and its outcomes. When we execute, we take responsibility for our decisions, measure the success of our innovations, and learn from the results.
Who You Are
We're building a new SRE team and looking for founding members to help shape how we operate. As a Lead SRE, you'll be a technical and operational leader for reliability across Develocity. You'll help define our SRE vision, set standards for how we operate production services, and mentor other SREs as the team grows. This is a hands-on role with broad influence across engineering, cloud platform, and customer-facing teams.
The SRE team will be responsible for the reliability, performance, and availability of Develocity instances serving paying customers, open-source projects, and public-facing services, plus supporting infrastructure like artifact registries.
You'll work on our internally-built Cloud Application Platform, Kubernetes on AWS, and develop deep expertise in it. When incidents happen, you'll troubleshoot issues across the stack, from application to infrastructure. You'll collaborate with the Cloud Platform team to improve the tooling you depend on, and with engineering teams to build reliability into how we ship software. If you like automating things and hate doing the same task twice, you'll fit in well.
You'll be part of a distributed, remote-first team that values asynchronous communication and written documentation. Strong self-direction and clear communication across time zones are essential.
Responsibilities
Operate and maintain all Develocity instances and supporting services in production.
Define and evolve SRE standards, practices, and operating models, including on-call, incident response, postmortems, and SLOs.
Participate in a follow-the-sun on-call rotation, acting as a technical escalation point for complex or high-severity incidents.
Lead incident response and blameless retrospectives, ensuring learnings result in measurable reliability improvements.
Set reliability priorities using risk, customer impact, business goals, SLOs, and error budgets.
Identify systemic reliability risks and continuously evolve Develocity's SaaS operations as the platform and customer base grow.
Lead and influence architectural and design reviews to ensure reliability, scalability, and operability.
Drive automation across deployment, upgrades, monitoring, self-healing, recovery, and operational workflows.
Build and maintain comprehensive observability for all managed services, including logging, metrics, tracing, and alerting.
Own disaster recovery, backups, and business continuity planning and execution.
Partner with engineering leadership to balance feature delivery with reliability and operational excellence.
Mentor and coach SREs, supporting technical growth and strong operational practices.
Help onboard new SREs and contribute to hiring by defining and assessing SRE excellence at Develocity.
Communicate clearly with customers during incidents and maintenance windows.
Optimize performance, resource utilization, and operational costs.
Minimum qualifications
7+ years in SRE, DevOps, or an equivalent role operating production services at scale.
Experience leading reliability initiatives across multiple teams or services.
Demonstrated ability to influence technical direction without direct authority.
Experience designing and operating systems with SLOs and error budgets, and exercising strong judgment in balancing reliability, velocity, and cost.
Strong Kubernetes experience in production environments.
Cloud infrastructure expertise, preferably AWS (EKS, RDS, S3, EC2).
Proficiency with observability tools (Prometheus, Grafana) and Infrastructure as Code (Terraform).
Track record of incident management and response in a 24/7 on-call environment.
Scripting proficiency (Python, Bash) for automation.
Strong written and verbal English communication skills.
Preferred qualifications
Experience as a founding or early SRE establishing practices in a growing SaaS organization.
Familiarity with Develocity.
JVM language experience (Java, Kotlin).
Experience with customer-facing and executive-level incident communications.
What We Offer
A ground-floor role in a new SRE team - you'll shape how we do things, not inherit someone else's decisions.
Real ownership of production systems used by engineers at companies you've heard of.
Direct interaction with customers when things go wrong (and when they go right).
A culture that values automation over heroics.
In-person meetings, such as our annual company offsite and team meetings.
Work from home in a remote-first environment.
Competitive salaries and equity grants.
Compensation
The US salary range for this position is $180-220k which reflects the target ranges for all US locations. Within this range, individual pay is determined by geographic location and additional factors including but not limited to experience, relevant skills, qualifications, seniority, performance, and travel requirements. Our recruiting team can share more information about the specific salary range for your location during the hiring process.
Location
Remote from anywhere in EST timezone.
While our team works remotely and is spread across the globe, we deeply value daily interactions and collaboration.
Auto-ApplyStaff Site Reliability Engineer
Reliability engineer job in New York, NY
Altana is the network for trusted trade. Our AI-powered product network empowers governments and businesses to build a more resilient and secure global economy while keeping trade flowing.
The Opportunity at Altana
At Altana, we believe that software that ships must be reliable and efficient. As a Staff Site Reliability Engineer, you will be instrumental in ensuring the availability, performance, and scalability of Altana's critical production services, with a strong focus on our cloud-native environments and data pipelines. You will apply Google-style SRE principles, embedding reliability into our architecture and operations through automation, proactive monitoring, and a commitment to reducing toil.
You will work hands-on with engineering teams, influencing system design for operability and contributing to the development of robust, self-healing infrastructure. This role emphasizes a deep understanding of observability practices to gain comprehensive insights into system behavior, proactive incident prevention, and efficient incident response. Success will be measured by the resilience of our production systems, the effectiveness of our observability stack, and our continuous improvement in operational efficiency and reliability.
Your Responsibilities
Reliability Engineering: Champion and implement SRE principles, including establishing and monitoring Service Level Objectives (SLOs) and error budgets for critical services. Drive initiatives to improve system reliability, availability, performance, and efficiency.
Observability & Monitoring: Design, implement, and maintain advanced monitoring, logging, and tracing solutions for our cloud-native applications and infrastructure (e.g., Kubernetes, microservices). Develop dashboards, alerts, and runbooks that provide deep insights into system health and behavior.
Automation & Toil Reduction: Identify and automate repetitive operational tasks and manual processes across our production environment. Develop tools and scripts to enhance system operations, deployment pipelines, and incident response.
Incident Management & Postmortems: Actively participate in the incident response lifecycle, including detection, triage, mitigation, and resolution of production issues. Lead thorough blameless postmortems to identify root causes and implement preventative measures and lasting improvements.
System Design & Optimization: Collaborate closely with development teams to influence the design of new services, ensuring they are built for operability, reliability, and cost-efficiency. Proactively identify and address performance bottlenecks and architectural weaknesses.
On-Call Rotation: Participate in a periodic on-call rotation, responding to critical alerts and ensuring rapid resolution of production incidents.
Data Reliability: Implement and maintain reliability and observability for critical data pipelines and data infrastructure, ensuring data integrity, availability, and timely processing.
About You
5+ years of hands-on experience in a Site Reliability Engineering (SRE), DevOps, or equivalent role focusing on production system reliability and operations.
Strong understanding and practical application of Site Reliability Engineering (SRE) principles, including SLOs, error budgets, toil reduction, and blameless culture.
Expertise in designing, implementing, and managing observability platforms for cloud-native environments (e.g., Prometheus, Grafana, Datadog, ELK stack, OpenTelemetry, Jaeger).
Proficiency in at least one programming/scripting language (e.g., Python, Go) for automation and tool development.
Extensive hands-on experience with cloud platforms (AWS, Azure, or GCP), including their compute, networking, and database services.
Demonstrated experience with containerization technologies (Docker) and container orchestration platforms (Kubernetes).
Experience with Infrastructure as Code (IaC) tools (e.g., Terraform, OpenTofu, CloudFormation) for managing cloud resources.
Proven experience participating in and improving incident management processes for critical systems.
Knowledge of modern software delivery paradigms, including microservices architectures and CI/CD pipelines.
Excellent problem-solving, analytical, and troubleshooting skills in complex distributed systems.
Strong communication and collaboration skills, with the ability to work effectively across engineering teams.
Experience with data engineering concepts, including building or operating reliable data pipelines, data streaming technologies, or managing large-scale data infrastructure.
This role can be based in New York City, or the San Francisco Bay Area with an expectation of occasional travel as needed.
US Salary Range and Benefits
$170,000 - $220,000
The salary range, to the extent specified for this role, is a good faith statement of the minimum and maximum levels of the annual based salary for the position. The base salary offered to a successful candidate will depend on a wide range of compensation factors, including, but not limited to, work experience, education and/or training, critical skills, and/or business considerations. Competitive equity grants are included in the majority of full time offers; and are considered part of Altana's total compensation package. Altana also offers either a discretionary bonus or a variable compensation plan depending on the role. Additionally, Altana offers top-tier benefits for full-time employees, including:
Flexible Time Off: Altana operates with a Flexible Time Off (FTO) policy that gives you agency over your own time off so you can maximize your work-life balance.
Parental Leave: We offer industry leading Paid Parental Leave (PPL), providing 14 weeks of leave for non-birthing, adoptive, and foster parents and up to 26 weeks of leave for birthing parents, all paid at 100% of your base salary.
Health Benefits: We have a full suite of medical, vision, and dental benefits with generous employer contributions, designed to give you flexibility and choice for your individual health situation. Our high deductible health plan is 100% employer paid for employees and supplemented with an employer contribution to your Health Savings Account (HSA). There is also a Flexible Spending Account (FSA) option.
Supplemental Benefits: Altana provides life, short- and long-term disability, and AD&D insurance coverage, all at no cost to you, so you know that you and your loved ones are covered in case of an emergency.
401(k) Savings: Save for and invest in your future using our Guideline 401(k) retirement savings program.
Commuter Benefits: Save money on your commute by setting aside pre-tax funds for public transit or parking!
Wellness: Because we value mental and emotional health, every Altana employee has access to a free premium subscription to Calm, the #1 app for meditation, sleep, and mindfulness.
Pet Insurance: Pets are family too! Keep them healthy with Wishbone insurance and / or our Total Pet vet service and telehealth discount plan.
Employee Assistance Program: Free access to confidential personal support.
Dependent Care FSA: You will have access to a Dependent Care FSA, which allows you to set aside pre-tax funds for childcare expenses
The recruiter assigned to this role can share more information about the specific compensation and benefit details associated with this role during the hiring process.
Equal Opportunity Statement
At Altana, we believe that a diverse workforce enables greater creativity, performance, and adaptability. We're proud to be an equal opportunity employer and welcome you to join us as you are. Our employment opportunities and decisions are based on business needs and individual qualifications, without regard to race, color, religious creed, national origin, ancestry, age, physical or mental disability, medical condition, marital status, sexual orientation, gender identity or expression, genetic information, family care or medical leave status, military or veteran status, or any other characteristic protected by the laws or regulations in the areas in which we operate. We prohibit discrimination and harassment of any type, in any situation.
Offers related to employment at Altana will come from an Altana.ai email address. We will never ask for payment as part of the interview or onboarding process.
Our Values
Our values are the core beliefs that shape who we are, what we stand for, and how we behave.They form the foundation of Altana's culture and integrity and guide how we hire, design, build, and connect with each other and our customers.
Trust: Our customers and partners entrust us with missions of the highest importance. We honor that by keeping our word, meeting commitments, and ensuring every action we take reinforces confidence in us. We rely on each other to deliver, to speak openly, and to hold ourselves accountable.
Resilience: In a world of uncertainty and complexity, our work must withstand challenges, evolve with conditions, and ensure reliability over time. Resilience is both how we operate and what we deliver. It's how we respond when things don't go to plan - we adapt, we support each other, and we keep moving forward.
Stewardship: We are stewards of every mission we touch. Because our work impacts lives and futures, we hold ourselves accountable to delivering mission impact and never compromising. Our responsibility extends beyond individual projects to the broader system of global trade. We believe that stewardship starts from within so that we can bring focus, creativity, and excellence to our work. Each of us is personally responsible for fostering a workplace where people can thrive. And we are stewards of the greater good of the company. By holding ourselves and each other accountable, we build a culture of innovation and collective success that reflects the scale of our mission.
Courage: Courage is what unlocks the seemingly impossible for our customers. It's the core value that drives us make bold moves and take on big, complicated network problems-the ones others avoid. We know success isn't guaranteed, but we have the audacious vision to believe a solution is possible and to build it. Courage fuels our growth mindset. It means embracing challenges that make us stronger, and it's demonstrated by how we approach hard conversations and complex projects.
At Altana, we believe that a diverse workforce enables greater creativity, performance, and adaptability. We're proud to be an equal opportunity employer and welcome you to join us as you are. Our employment opportunities and decisions are based on business needs and individual qualifications, without regard to race, color, religious creed, national origin, ancestry, age, physical or mental disability, medical condition, marital status, sexual orientation, gender identity or expression, genetic information, family care or medical leave status, military or veteran status, or any other characteristic protected by the laws or regulations in the areas in which we operate. We prohibit discrimination and harassment of any type, in any situation.
Offers related to employment at Altana will come from an Altana.ai email address. We will never ask for payment as part of the interview or onboarding process.
Auto-ApplySite Reliability Engineer NAM (F/M/D)
Reliability engineer job in New York, NY
Flowdesk's mission is to build a global financial institution for digital assets, one designed from the ground up for market integrity and efficiency.
To achieve this in a rapidly evolving market, we apply a disciplined, first-principles approach to everything we do. This approach is embedded in our core services, from institutional liquidity provision, trading solutions, OTC execution to our comprehensive treasury management offerings. This is how we cut through the noise and build robust and scalable systems across all our business lines.
We seek individuals who are driven by this systematic approach. Joining Flowdesk means you will be a key contributor in building and scaling a more transparent and efficient financial markets infrastructure.
You will be part of the infrastructure team. Reporting to Flowdesk's Lead of Infrastructure and in permanent interaction with the Engineering/Trading/Data teams, your mission is to improve, scale, and maintain Flowdesk's infrastructures.
The Role
As a Core SRE Engineer at Flowdesk, you will be at the heart of our infrastructure operations, responsible for the reliability, performance, and scalability of our global high-frequency trading platform.
Reporting to Flowdesk's head of Infrastructure and working in close collaboration with the Engineering, Trading, and Data teams, you will play a crucial role in maintaining, enhancing, and scaling the critical systems that power our operations.
Your Mission
Infrastructure Management & Reliability
Maintain and optimize critical infrastructures including Nats (ultra-low latency networking stack), cloud infrastructures, data pipelines, and core trading systems.
Monitor and optimize network infrastructure to ensure peak performance and minimal downtime.
Ensure reliability, security, scalability, and performance of the company's essential systems.
Participate in day-to-day operational activities and incident management.
Kubernetes & Orchestration
Develop and maintain in-house Kubernetes operators to manage our evolving stack.
Manage containerized workloads and ensure smooth deployments.
Automation & DevOps
Improve and add new features to our CI/CD systems (FluxCD, Github Actions, Python scripting)
Implement robust monitoring solutions using Prometheus, Grafana, and similar technologies
Advance DevOps practices, application-level reliability, and observability
Collaboration & Innovation
Engage with Flowdesk's teams to understand their needs and deliver technical solutions
Provide technical support that enhances operational capabilities across teams
Propose improvements and bring innovative ideas to enhance performance and reliability
Explain technical concepts effectively to non-technical stakeholders
Take a look at our stack stackshare.io/flowdesk/flowdesk
Requirements
RequirementsEssential Experience
Kubernetes Expertise
Systems & Network Management - Knowledge of network infrastructure administration, especially for ultra-low latency setups and interconnects.
Cloud Infrastructure - Experience with cloud services (AWS, GCP) and infrastructure-as-code tools (Terraform, Ansible).
Coding Skills - Comfortable coding in languages like Golang, Rust, or Python for infrastructure automation and tooling.
Cloud Infrastructure - Experience with cloud services (AWS, GCP) and infrastructure-as-code tools (Terraform, Ansible).
Monitoring & Observability - Hands-on experience with monitoring solutions (Prometheus, Grafana) and implementing proactive detection systems.
Highly Valued
Experience with CI/CD pipelines, FluxCD, and Github Actions.
DevOps methodologies and practices.
Experience working in a trading or high-frequency trading environment.
Experience with ultra-low latency systems and network optimization.
Knowledge of FinOps and cost optimization strategies.
Performance tuning in high-throughput environments.
Benefits
> International environment (English is the main language)
> 100% Coverage from Justworks Benefits (Medical, Dental, and Vision plans)
> Team events and offsites
The base salary range for this role is between $150,000 - $200,000 in the State of New York. This range is not inclusive of our discretionary bonus. When determining a candidate's compensation, we consider a number of factors including skillset, experience, job scope, and current market data.
Recruitment Process
> Are you interested in this job but feel you haven't ticked all the boxes? Don't hesitate to apply and tell us in the cover letter section why we should meet!
Here's what you can expect if you apply
HR Call with the Tech Talent Acquisition (30')
Technical round with the Lead of the Infrastructure team (60')
Technical interview with the wider Engineering tean (60')
CTO Interview (30')
Culture fit with the Lead Talent Acquisition (30')
On the agenda discussions rather than trick questions! These moments of exchange will allow you to understand how Flowdesk works and its values. But they are also (and above all) an opportunity for you to present your career path and your expectations for your next job!
So... Ready to Join Us?
If you're excited by the opportunity to shape crypto's future and directly impact our cutting-edge infrastructure, we'd love to hear from you! Apply today and let's explore how we can build great things together.
We are committed to an inclusive and accessible recruitment process. If you require any reasonable adjustments or have specific needs to enable you to participate fully in the interview or assessment process (e.g., a sign language interpreter, extra time for a test, or an accessible location), please contact us to discuss how we can support you.
Auto-ApplyStaff Site Reliability Engineer
Reliability engineer job in New York, NY
Ro is a direct-to-patient healthcare company with a mission of helping patients achieve their health goals by delivering the easiest, most effective care possible. Ro is the only company to offer nationwide telehealth, labs, and pharmacy services. This is enabled by Ro's vertically integrated platform that helps patients achieve their goals through a convenient, end-to-end healthcare experience spanning from diagnosis, to delivery of medication, to ongoing care. Since 2017, Ro has helped millions of patients, including one in every county in the United States, and in 98% of primary care deserts.
Ro has been recognized as a Fortune Best Workplace in New York and Health Care for four consecutive years (2021-2024). In 2023, Ro was also named Best Workplace for Parents for the third year in a row. In 2022, Ro was listed as a CNBC Disruptor 50.
The Role:
At Ro, our mission is to provide world-class healthcare by putting patients first - and that mission depends on reliable, secure, and scalable systems. As a Staff SRE on the infrastructure team, you'll sit at the core of that effort: owning the reliability of our production systems, hardening infrastructure and building tools that empower our engineers to ship safely and confidently.
You will work across teams to drive uptime, performance and observability - partnering closely with product, platform and security engineers.
From designing resilient systems to shaping incident response practices, this is a role for engineers who thrive on impact and care deeply about operational excellence.
What You'll Do:
* Design and implement resilient infrastructure to support high availability at scale
* Build and contribute to tools and platforms that streamline deployment, monitoring and recovery of systems
* Drive incident response and harness learnings, leading efforts to minimize downtime and improve MTTR
* Partner with engineering teams to bake best practices for reliability, resilience and observability into services
* Automate infrastructure workflows using IaC and other cloud native tools
* Champion a culture of operational excellence, guiding engineers through reliability practices and raising the bar across the engineering org
What You'll Bring to the Team:
* Deep understanding of systems and infrastructure, with experience operating distributed services in production. We are mostly in AWS and leverage a lot of its primitives - EKS, RDS, Route53, S3, Elasticache to name a few
* Strong programming and automation skills using Go (bonus points for Python)
* Proficiency with infrastructure as code - Terraform / Pulumi
* A passion for observability, with hands-on experience in metrics, logging, tracing using Datadog
* Strong cross-functional communication, able to collaborate with product, platform, security and other teams
* An operational mindset that puts reliability and resilience as a core product requirement
* A mission-driven attitude, motivated by the opportunity to make healthcare better.
We've Got You Covered:
* Full medical, dental, and vision insurance + OneMedical membership
* Healthcare and Dependent Care FSA
* 401(k) with company match
* Flexible PTO
* Wellbeing + Learning & Growth reimbursements
* Paid parental leave + Fertility benefits
* Pet insurance
* Student loan refinancing
* Virtual resources for mindfulness, counseling, and fitness
The target base salary for this position ranges from $211,700 to $292,000, in addition to a competitive equity and benefits package (as applicable). When determining compensation, we analyze and carefully consider several factors, including location, job-related knowledge, skills and experience. These considerations may cause your compensation to vary.
Ro recognizes the power of in-person collaboration, while supporting the flexibility to work anywhere in the United States. For our Ro'ers in the tri-state (NY) area, you will join us at HQ on Tuesdays and Thursdays. For those outside of the tri-state area, you will be able to join in-person collaborations throughout the year (i.e., during team on-sites).
At Ro, we believe that our diverse perspectives are our biggest strengths - and that embracing them will create real change in healthcare. As an equal opportunity employer, we provide equal opportunity in all aspects of employment, including recruiting, hiring, compensation, training and promotion, termination, and any other terms and conditions of employment without regard to race, ethnicity, color, religion, sex, sexual orientation, gender identity, gender expression, familial status, age, disability and/or any other legally protected classification protected by federal, state, or local law.
See our California Privacy Policy here.
Lead Site Reliability Engineer, AI/ML Platform
Reliability engineer job in Jersey City, NJ
JobID: 210694137 JobSchedule: Full time JobShift: Base Pay/Salary: Jersey City,NJ $152,000.00-$215,000.00 Responsibilities: * Design and implement solutions to enhance the reliability and scalability of AI/ML platforms and applications to accommodate fast growing demands.
* Partner with product engineering teams to ensure the AI/ML systems are reliable and high performing.
* Develop observability, security, automation and fin-ops tools and orchestration.
* Provide strategic technology leadership by defining and evaluating standards and architecture for reliability, observability and automation frameworks.
* Build strong cross-functional relationships that foster engagements across the organization and deliver solutions to user problems.
* Debug and solve issues in a production environment, identify root cause and remediate.
* Participates in on-call rotations, incident management and escalation workflows.
* Take full ownership of problems, develop solutions, and acquire new knowledge to complete the task.
* Mentor and guide junior engineers.
Required Qualifications:
* Bachelor's degree in computer science, Information Technology, or equivalent technical qualification with 5+ years professional experience.
* Expertise in SRE principles, reliability, scalability and performance of application and infrastructure.
* Have hands-on experience with cloud platforms (AWS, GCP, Azure) and IaC tools (Terraform, Ansible).
* Extensive experience implementing advanced observability using tools like Open Telemetry, Dynatrace, Grafana, and/or cloud-native services.
* Experience in architecting distributed systems and cloud-native architecture in AWS.
* Systematic problem-solving and troubleshooting skills in a complex system.
* Excellent communication skills and ability to represent and present business and technical concepts to stakeholders.
* Self-managed, self-motivated with strong sense of ownership, urgency, and drive
Good to have:
* Prior experience working in AI, ML, or Data engineering.
* Prior experience developing AI Ops/AI Agents.
* Multi cloud experience (AWS, GCP, Azure) is a plus
Auto-ApplyDirector, Site Reliability Engineer
Reliability engineer job in New York, NY
Who We Are:
Ordergroove is a dynamic, fast-paced environment where you will be involved in building something of real value from the ground-up. We're looking for bright, talented people who are excited about innovation, growth, and the exciting world of Relationship Commerce. If you're motivated by a desire to solve problems and deliver groundbreaking insights and solutions you'll fit in perfectly!
About the Role:
OrderGroove is looking for an extraordinarily talented, passionate and naturally curious person to join our Engineering Team. Our Engineers are problem solvers who get excited about pushing the boundaries of what we can achieve, love learning and thrive in a fast-paced, collaborative environment. As the Site Reliability Engineer Director you will be joining our SRE team whose primary goal is to infinitely scale and secure our cloud-based hosted platform and accelerate time to market of code deployments while supporting the biggest brands in the world.
What You Will Do:
You will define and execute the vision for continuous delivery, cloud deployment strategies, and operational excellence.
You will spearhead our reliability, scalability, and automation efforts to ensure our platform operates securely and efficiently.
Work closely with engineering, security, and QA teams to enhance deployment and release processes.
You will help us scale how we push high-quality code, securely and efficiently.
You will guide, mentor, and collaborate with an awesome group of highly passionate engineers.
About You:
8+ years working in DevOps, SRE, or similar capacity managing cloud/private or hybrid environments.
Passionate about automation, leveraging best-of-breed technologies, and eager to learn new skills.
Experience working with automation tools (we use Terraform, Ansible and Chef).
Experience with Kubernetes, Docker, and container orchestration best practices.
Comfortable with systems Linux administration (CentOS / RHEL / Debian).
Experience configuring CI/CD pipelines using Jenkins, GitHub Actions, or GitOps.
Fluent with Apache/Nginx configurations.
Comfortable with MySQL/MongoDB administration and scaling strategies.
Experience designing high-availability/fault-tolerant systems.
Comfortable working with development teams to understand their pain points and come up with creative solutions.
Excellent communication skills.
Service-oriented mentality.
Great critical thinking skills.
Ability to quickly adapt to change.
Bonus Points For:
Python or Go experience.
Managing GCP hosted environments.
Previous eCommerce experience.
Experience going through PCI1 and SOC2 compliance approval processes.
Start-up and SaaS experience strongly preferred.
Familiarity with monitoring tools such as Prometheus, Grafana, or New Relic.
Bachelor's or Master's degree in Computer Science, Engineering, or a related field preferred.
If you don't meet 100% of the qualifications outlined above - that's okay, nobody's perfect! We encourage you to apply if you think this is a role that would make you excited to come into work every day.
About Ordergroove:
Ordergroove powers recurring revenue for the world's largest and most innovative retailers including L'Oreal, Dollar Shave Club, La Colombe Coffee, Bonafide Health, BarkBox and more. As a direct result, more than 11% of adult Americans have a subscription powered by Ordergroove. Our technology makes seamless, one-of-a-kind subscriber and membership experiences possible to turn one-time transactions into profitable recurring customer relationships.
Ordergroove's powerful platform empowers merchants with highly customizable options such as flexible promotions, bundling, and analytics to bolster their bottom line while making customers' lives easier. We recently achieved a milestone year with 152% year over year new business growth, and were rated best-in-class subscription technology by CB Insights and eCommerce Platform of the Year by RetailTech Breakthrough Awards.
Our company values celebrate collaboration, different perspectives, and curiosity with the goal of getting to the right answer, no matter who came up with it. At Ordergroove we are committed to creating a welcoming and supportive environment for all people. We encourage people with different backgrounds and experiences to join our growing team so that we gain different perspectives and build the best team possible. We demand the best of ourselves and each other and never miss an opportunity to celebrate our successes.
With a fully flexible work from anywhere culture, staying connected and supporting each other are always top of mind. We build our tight-knit community through small group events like trivia night, cooking classes, and book clubs. We encourage cross-functional relationships through virtual coffees and we stay close to the business through weekly team updates and quarterly all-hands meetings.
At Ordergroove, we focus on flexibility and empowering our team to make the right decisions for themselves. We have flexible PTO and a totally remote (anywhere in the US) workforce, and an annual personal development budget that you use for what matters to you (wellness, career development, productivity at home, etc). And of course, that is on top of the basics like competitive compensation (including stock options) and incredible, affordable benefits. Come join our amazing team while we enable the fastest-growing segment of commerce that makes life easier for millions of consumers every day!
At Ordergroove, we want to hire, develop and retain the best talent, making Ordergroove a top destination to grow your career.
The pay transparency law is a way of narrowing the gender pay gap and fostering an engaged and positive working environment. It is also a way to share what we think is a reasonable, equitable and competitive compensation structure for the roles on our team.
The total compensation range for this role starts at $175,000.
Auto-ApplySite Reliability Engineer
Reliability engineer job in New Providence, NJ
CTH 6\-12 months
Client Fiserv
MUST be a USC or GCH
Must be local to Berkeley Heights, NJ
Description:
What does a successful Site Reliability Engineer do at Fiserv?
A successful Site Reliability Engineer at Fiserv blends software engineering principles with operational discipline to create high\-performing, reliable software systems. They design and implement tools, processes, and systems to improve the reliability, scalability, and performance of large\-scale applications and services.
Requirements Automate operational tasks and health checks to create sustainable systems and services.
Monitor the production environment to ensure system health using observability tools like Dynatrace and Splunk.
Identify reliability gaps through process reengineering and analyze performance metrics.
Collaborate with development operations for system design consulting, platform management, and capacity planning.
Create and maintain detailed documentation, including SOPs, configurations, and infrastructure maps.
What you will need to have:
5+ years of experience in Site Reliability Engineering (SRE) within a Fintech or product organization.
4+ years of experience with automation tools like Python, Java, Ansible, or PowerShell.
4+ years of experience with observability and monitoring tools such as Dynatrace, Splunk, Moogsoft, or Grafana.
Bachelor's degree in computer science or related technical field and\/or 7+ years of relevant work experience.
What would be great to have:
Experience managing CI\/CD pipelines and automation tools like GitLab, Harness, Nexus, Terraform, or SonarQube.
Strong problem\-solving and critical thinking skills for root cause analysis and proactive solution implementation.
Effective communication skills for collaboration with cross\-functional teams and customer interactions.
"}}],"is Mobile":false,"iframe":"true","job Type":"Contract","apply Name":"Apply Now","zsoid":"695381556","FontFamily":"Verdana, Geneva, sans\-serif","job OtherDetails":[{"field Label":"Employment Type","uitype":100,"value":"C2C"},{"field Label":"Industry","uitype":2,"value":"Employment \- Recruiting \- Staffing"},{"field Label":"Work Authorization","uitype":100,"value":"US Citizen;GC"},{"field Label":"Salary","uitype":1,"value":"65\/hr"},{"field Label":"City","uitype":1,"value":"Berkeley Heights"},{"field Label":"State\/Province","uitype":1,"value":"New Jersey"},{"field Label":"Zip\/Postal Code","uitype":1,"value":"07922"}],"header Name":"Site Reliability Engineer","widget Id":"**********00072311","is JobBoard":"false","user Id":"**********00268007","attach Arr":[],"custom Template":"3","is CandidateLoginEnabled":false,"job Id":"**********06106005","FontSize":"12","location":"Berkeley Heights","embedsource":"CareerSite","indeed CallBackUrl":"https:\/\/recruit.zoho.com\/recruit\/JBApplyAuth.do","logo Id":"vgtkw21b67ab9913e491893119e6f375ff5ba"}
Reliability Engineer
Reliability engineer job in Allendale, NJ
Join an innovative and dynamic team as a Reliability Engineer, where you will play a pivotal role in maintaining and enhancing the performance of our systems. You will be responsible for ensuring data integrity, supporting quality investigations, and driving continuous improvements in asset management.
Responsibilities
* Maintain and update P&ID's and AutoCAD drawings for clean rooms, physical tags, BMS systems, and other relevant databases.
* Ensure data integrity of all P&ID items through comprehensive asset walkdowns.
* Support deviations, CAPAs, audits, quality investigations, and change controls across GMP environments.
* Create and standardize site asset management lists with a focus on continuous improvement in planning, tracking, and performance.
Essential Skills
* Proficiency in AutoCAD and reliability engineering.
* Experience with CMMS, Maximo, and Blue Mountain RAM (BMRAM).
* GMP experience and documentation control.
* Ability to work independently with a strong attention to detail.
Additional Skills & Qualifications
* Bachelor's degree in Engineering or a related discipline.
* Analytical skills with the ability to read and interpret blueprints, plans, and manuals.
* Excellent customer service skills with a desire to exceed customer expectations.
* Experience in a cGMP or aseptic environment is preferred.
Job Type & Location
This is a Permanent position based out of Allendale, NJ.
Pay and Benefits
The pay range for this position is $110000.00 - $120000.00/yr.
* (1) week of paid sick time • (2) weeks of paid vacation + accrued paid time off • Paid Federal Holidays + (4) floating holidays paid. • Fidelity for 401k plan with a 6%-7% match program.
Workplace Type
This is a fully onsite position in Allendale,NJ.
Application Deadline
This position is anticipated to close on Jan 2, 2026.
About Actalent
Actalent is a global leader in engineering and sciences services and talent solutions. We help visionary companies advance their engineering and science initiatives through access to specialized experts who drive scale, innovation and speed to market. With a network of almost 30,000 consultants and more than 4,500 clients across the U.S., Canada, Asia and Europe, Actalent serves many of the Fortune 500.
The company is an equal opportunity employer and will consider all applications without regard to race, sex, age, color, religion, national origin, veteran status, disability, sexual orientation, gender identity, genetic information or any characteristic protected by law.
If you would like to request a reasonable accommodation, such as the modification or adjustment of the job application process or interviewing due to a disability, please email actalentaccommodation@actalentservices.com for other accommodation options.
Site Reliability Engineer
Reliability engineer job in New York, NY
Our client, a boutique US HedgeFund, is seeking an SRE who will sit at the center of trading operations and infrastructure to ensure the automation and delivery of productive and efficient trading systems. This is a newly created position which will provide autonomy to identify gaps, improve the whole life-cycle of services-from inception and design, through deployment, operation and refinement of our trading systems. The ideal candidate will demonstrate ownership of the Devops processes through a systematic problem-solving approach and a desire to build robust, cutting edge and scalable systems. As an SRE, you will:
Have a deep understanding of the trading workflow, ensuring its effectiveness across teams.
Assist in automating the release and deployment of software.
Streamline the build process.
Work in cluster environments.
Monitor and manage trading workflow, research and trading.
Build, streamline, organize and design framework.
Requirements
5+ years of experience in a DevOps/Cluster Engineer role.
Bachelor's Degree in Computer Science.
Knowledge of the Linux operating system, permissions, NFS.
Effective verbal and written communication skills.
Experience managing entire pipelines, working with tools such as Jenkins, Airflow and Ansible.
Highly productive in python, bash.
Experience in distributed computing/ parallized computing, cluster running computational jobs, developing system in this area.
Experience in data recording, storage, and maintenance (backups, redundancy, compression and archiving).
Knowledge of CMake.
Knowledge of the continuous integration and deployment of code as well as the development toolchain.
Hardware and software expertise.
Strong problem solving aptitude.
Additional skills/experience that will reflect favorably
Experience installing/configuring the ELK stack, including logstash input/output plugins.
Experience building Grafana/Kibana dashboards.
Familiarity with jupyter notebooks and Docker.
Trading operations experience, build environment experience, development experience in python and C++ preferred.
Production Environment Monitoring and Maintenance (diagnostics and troubleshooting live trading)
Cloud (aws/gcp/azure) and Distributed Computing (slurm, spark, etc)
Thank you for illuminating hiring with Quanta Search!
********************
Site Reliability Engineer
Reliability engineer job in New York, NY
The AI platform for investors and bankers that generates alpha and drives upside.
Founded in 2020 by George Sivulka and backed by Peter Thiel and Andreessen Horowitz, Hebbia powers investment decisions for BlackRock, KKR, Carlyle, Centerview, and 40% of the world's largest asset managers. Our flagship product, Matrix, delivers industry-leading accuracy, speed, and transparency in AI-driven analysis. It is trusted to help manage over $15 trillion in assets globally.
We deliver the intelligence that gives finance professionals a definitive edge. Our AI uncovers signals no human could see, surfaces hidden opportunities, and accelerates decisions with unmatched speed and conviction. We do not just streamline workflows. We transform how capital is deployed, how risk is managed, and how value is created across markets.
Hebbia is not a tool. Hebbia is the competitive advantage that drives performance, alpha, and market leadership.
The Role
As a highly skilled Site Reliability Engineer (SRE), you will contribute to building systems that optimize the uptime and reliability of our platform, and support the management and optimization of our DevOps and infrastructure operations. You will be responsible for owning our deployment pipelines, building and maintaining our continuous integration and continuous deployment (CI/CD) systems, ensuring the reliability and performance of our services, enhancing our observability, supporting our local development environments, and bolstering our security posture. Your technical expertise and problem-solving skills will contribute to the success of our AI products and shape the future of our technology stack.
Responsibilities
Assist in managing deployment pipelines to facilitate smooth and efficient software releases.
Help implement and maintain observability solutions for monitoring system performance and reliability.
Support local development environments to optimize developer workflows.
Work with development teams to ensure infrastructure aligns with project requirements.
Contribute to improving the security of our infrastructure by assisting with proactive measures and audits.
Assist in developing and maintaining automation scripts and tools to enhance operational efficiency.
Help troubleshoot and resolve infrastructure and application issues to minimize downtime and maintain smooth operations.
Participate in evaluating and integrating new technologies to enhance the scalability, reliability, and security of our infrastructure.
Who You Are
Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
5+ years software development experience at a venture-backed startup or top technology firm.
Proven experience as a Site Reliability Engineer, DevOps Engineer, or similar role.
Strong expertise in managing CI/CD pipelines and deployment automation.
Proficiency in cloud platforms such as AWS, Azure, or Google Cloud (we are an AWS shop).
Solid understanding of containerization and orchestration technologies such as Docker and Kubernetes.
Experience with monitoring and observability tools such as Datadog, Prometheus, Grafana, or similar.
Knowledge of infrastructure-as-code (IaC) tools such as Terraform or CloudFormation.
Familiarity with security best practices and tools for infrastructure and application security.
Excellent problem-solving skills and the ability to troubleshoot complex issues.
Strong communication skills and the ability to work effectively in a collaborative environment.
A proactive and self-motivated approach to learning and adopting new technologies.
Passion for continuous improvement and operational excellence.
Compensation
The salary range for this role is $160,000 to $300,000. This range may be inclusive of several career levels at Hebbia and will be narrowed during the interview process based on the candidate's experience and qualifications. Adjustments outside of this range may be considered for candidates whose qualifications significantly differ from those outlined in the job description.
Life @ Hebbia
PTO: Unlimited
Insurance: Medical + Dental + Vision + 401K
Eats: Catered lunch daily + doordash dinner credit if you ever need to stay late
Parental leave policy: 3 months non-birthing parent, 4 months for birthing parent
Fertility benefits: $15k lifetime benefit
New hire equity grant: competitive equity package with unmatched upside potential
#LI-Onsite
Auto-ApplySite Reliability Engineer
Reliability engineer job in New York, NY
About Clay Clay is a creative tool for growth. Our mission is to help businesses grow - without huge investments in tooling or manual labor. We're already helping over 100,000 people grow their business with Clay. From local pizza shops to enterprises like Anthropic and Notion, our tool lets you instantly translate any idea that you have for growing your company into reality.
We believe that modern GTM teams win by finding GTM alpha - a unique competitive edge powered by data, experimentation, and automation. Clay is the platform they use to uncover hidden signals, build custom plays, and launch faster than their competitors. We're looking for sharp, low-ego people to help teams find their GTM alpha.
Why is Clay the best place to work?
* Customers love the product (100K+ users and growing)
* We're growing a lot (6x YoY last year, and 10x YoY the two years before that)
* Incredible culture (our customers keep applying to work here)
* Well-resourced - We raised a $100M Series C in 2025 at a $3.1B valuation and are backed by world-class investors like Capital G (Google), Sequoia and Meritech
Read more about why people love working at Clay here and explore our wall of love to learn more about the product.
SRE @ Clay
In this role, you'll join our growing infrastructure team in building and fine-tuning our infrastructure to keep our services running smoothly. We're looking for someone who's excited about automation and continuous improvement. While your main focus will be on infrastructure, coding skills are a must. As a growing startup, we all jump in where needed, so you'll need to be comfortable taking on a variety of roles.
What You'll Do
* Architect, design, implement, and manage robust, scalable, and secure infrastructure solutions.
* Develop, maintain, and enforce best practices for CI/CD, infrastructure as code, and automation.
* Oversee the management and optimization of cloud infrastructure, ensuring high availability, performance, and cost-efficiency.
* Implement monitoring, logging, and alerting solutions to maintain system health and quickly resolve issues.
* Lead incident response efforts, troubleshooting and resolving complex issues in a timely manner.
* Participate in an oncall rotation.
* Work with teams across the company to ensure we achieve the right balance of developer velocity, reliability and performance, and cost efficiency.
What You'll Bring
* 5+ years of experience
* Experience with containerization and orchestration tools
* Strong understanding of CI/CD concepts and tools
* Knowledge of infrastructure automation tools
* Experience with oncall and incident response
* Proficiency in one or more programming languages
* Familiarity with our stack or ability to learn unfamiliar technologies quickly:
* Aurora Postgres RDS, Elasticache Redis, Docker + ECS, Lambda, OpenSearch
* Terraform and Atlantis
* CircleCI, Netlify, Playwright
* Cloudwatch, Datadog, Mezmo
* Typescript, Python