Project Quality Engineer
Reliability engineer job in Yonkers, NY
Job Title: Project Quality Engineer
Shift: 1st Shift (Monday - Friday)
Pay Rate: Up to $75000-$95,000 annually (commensurate with experience)
Type: Direct Hire
Reports To: QA Manager
Dept.: Quality Assurance
Job Description
The Project Quality Engineer supports the Quality Assurance Manager in overseeing the Quality Assurance program for assigned rail car manufacturing projects. This role ensures compliance with contract requirements, technical specifications, and industry standards across production, acceptance, warranty, and modification phases.
Responsibilities include creating Master Test and Inspection Plans, First Article Inspection schedules, Project Quality Plans, and audit procedures. The Project Quality Engineer coordinates closely with customers, vendors, and internal Kawasaki divisions to align project requirements, resolve quality issues, and support continuous improvement initiatives.
This position also monitors documentation, leads corrective action activities, conducts contract reviews, and provides weekly and monthly quality reports. The engineer will serve as a primary Quality liaison between internal teams, subcontractors, and customer Resident Inspectors, ensuring timely communication, follow-up, and delivery of all quality-related commitments.
Candidate Fit Summary
This candidate is an excellent fit for organizations in the rail, aerospace, transportation, and heavy manufacturing sectors where strict compliance, technical quality standards, and contractual requirements are essential.
They bring strong experience supporting complex production programs, managing supplier and customer interfaces, and developing detailed quality documentation. Skilled in FAI, FMEA, audits, and ISO 9001 processes, they excel in environments requiring strict quality controls, cross-functional coordination, and schedule accountability. Their ability to lead inspections, manage customer quality requirements, and drive corrective actions makes them a strong match for production-focused, project-driven engineering organizations.
Essential Functions
Implement and maintain QA programs for assigned contracts.
Develop Master Test and Inspection Plans, Project Quality Plans, FAI schedules, and audit procedures.
Attend project meetings and provide detailed quality status updates and reports.
Analyze quality issues, identify root causes, and drive corrective actions.
Coordinate with customers, suppliers, and internal teams across production and warranty phases.
Manage project quality schedules and interface with Resident Inspectors.
Ensure compliance with customer specifications, contract terms, and Kawasaki quality standards.
Review and approve subcontractor/supplier documentation (PSI, FAI, audits, drawings, field reports).
Monitor and report deviations, implement process improvements, and update procedures.
Support Configuration Management planning, execution, and product delivery.
Assist with subcontractor activity quality review and documentation.
Travel domestically/internationally up to 30% to support project quality functions.
Job Specifications
Bachelor's Degree in Engineering (Master's preferred).
Minimum five (5) years' experience in rail, aerospace, transportation, or heavy manufacturing.
Knowledge of FAI, FMEA, ISO 9001, and source inspection processes.
Strong communication, analytical, reporting, and computer skills.
Ability to plan, coordinate, and manage workloads across multiple concurrent projects.
Capable of working in both office and field/manufacturing environments.
Work Environment
Office and manufacturing floor settings.
Frequent interaction with engineering, production, and customer teams.
PPE required in production areas; must adhere to all safety protocols.
Candidate Fit
This candidate is a strong fit for Project Quality Engineering roles in complex manufacturing environments like rail, aerospace, automotive, and heavy industrial production. They have demonstrated capability in quality planning, regulatory compliance, supplier oversight, and customer interface management.
With experience leading FAIs, audits, and corrective actions while supporting production schedules, they excel in driving continuous improvement, ensuring contract compliance, and maintaining high standards of safety, product quality, and documentation integrity. Their structured approach, technical acumen, and ability to manage project-based workloads make them a key contributor to high-complexity engineering programs.
Company Overview
Founded in 2010, Top Prospect Group was created with a focus on matching high-quality candidates with top-tier clients while fostering an environment where success is shared by all. In 2023, the company was acquired by HW Staffing Solutions, expanding its service offerings to include technology, engineering, and professional services.
Qualified candidates are encouraged to apply immediately!
Please include a clean copy of your resume, salary expectations, and availability with your application.
Production Engineer
Reliability engineer job in New York, NY
A client of Insight Global in the Bronx, NY is seeking a Production Engineer to join their team! This individual will be responsible for leading manufacturing improvements by optimizing packaging line performance and minimizing downtime through data-driven analysis. Must partner with cross-functional teams to implement sustainable process enhancements and uphold quality standards, as well as applying Lean Six Sigma methodologies to drive efficiency and support continuous improvement. This is an onsite position; candidates are required to be onsite 5 days per week.
Required Skills & Experience
5-7 years of experience as an engineer in a manufacturing environment
Bachelor's degree in engineering (mechanical, chemical, or biomedical preferred)
Experience partnering with cross functional teams
Strong understanding of Lean and Six Sigma principles
Site Reliability Engineer - AML Global Recommendation - USDS
Reliability engineer job in New York, NY
About the Team: Site Reliability Engineering (SRE) of the AML (Applied Machine Learning) team combines system engineering and the art of machine learning to develop and run a massively distributed AI/ML recommendation system for the United States and all around the world.
On the SRE team, you'll have the opportunity to sharpen your expertise in coding, performance analysis, and large-scale systems operation. Join us and you'll have the chance to shape the future of AML systems and make a real, tangible impact on TikTok users.
In order to enhance collaboration and cross-functional partnerships, among other things, at this time, our organization follows a hybrid work schedule that requires employees to work in the office 3 days a week, or as directed by their manager/department. We regularly review our hybrid work model, and the specific requirements may change at any time.
Responsibilities:
* Design, build, and maintain highly available, scalable, and fault-tolerant systems.
* Monitor and analyze system performance, identifying and resolving issues before causing user impact.
* Develop and maintain automated monitoring, alerting, and incident response systems.
* Collaborate closely with software engineering teams to ensure that applications are designed with reliability, scalability, and performance in mind.
* Implement and maintain security best practices and ensure compliance with regulatory requirements.
* Participate in on-call rotations and respond to issues and incidents within and outside of normal business hours.
* Conduct root cause analysis of incidents, hold post-mortem reviews with stakeholders, and implement preventative measures to minimize the risk of similar incidents occurring in the future.Minimum Qualifications:
* Expertise in analyzing and troubleshooting Linux-based distributed systems.
* Bachelor's/Master's degree in Computer Science, Computer Engineering, or equivalent years of experience in a SRE or software engineering role.
* Experience programming with at least one commonly used language (C, C++, Python, Go).
* Strong understanding of data structures and algorithms.
* Competent knowledge of relational database systems.
Preferred Qualifications:
* Ability to design and maintain large-scale systems.
* Strong understanding of code optimization and routine task automation.
* Proficiency in at least one machine learning framework: TensorFlow, PyTorch, MXNet or PaddlePaddle
As a condition of employment, all successful candidates must be able to establish authorization to work in the United States. For this position, the Company does not provide sponsorship for any immigration-related benefits.
Software Engineer II - Site Reliability Engineer
Reliability engineer job in New York, NY
Technology is at the heart of Disney's past, present, and future. Disney Entertainment and ESPN Product & Technology is a global organization of engineers, product developers, designers, technologists, data scientists, and more - all working to build and advance the technological backbone for Disney's media business globally.
The team marries technology with creativity to build world-class products, enhance storytelling, and drive velocity, innovation, and scalability for our businesses. We are Storytellers and Innovators. Creators and Builders. Entertainers and Engineers. We work with every part of The Walt Disney Company's media portfolio to advance the technological foundation and consumer media touch points serving millions of people around the world.
Here are a few reasons why we think you'd love working here:
**Building the future of Disney's media:** Our Technologists are designing and building the products and platforms that will power our media, advertising, and distribution businesses for years to come. **Reach, Scale & Impact:** More than ever, Disney's technology and products serve as a signature doorway for fans' connections with the company's brands and stories. Disney+. Hulu. ESPN. ABC. ABC News...and many more. These products and brands - and the unmatched stories, storytellers, and events they carry - matter to millions of people globally. **Innovation:** We develop and implement groundbreaking products and techniques that shape industry norms, and solve complex and distinctive technical problems.
Product Engineering is a unified team responsible for the engineering of Disney Entertainment & ESPN digital and streaming products and platforms. This includes product engineering, media engineering, quality assurance, engineering behind personalization, commerce, lifecycle, and identity.
**Job Summary:**
As a Software Engineer on the COPEX team, you'll design and build the foundational backend systems that directly power the Hulu & Disney+ streaming experience. You will architect mission-critical, high-throughput services for API and content recommendation delivery, while also building the platforms that empower our entire engineering organization to ship code with speed and confidence.
You will join a talented team of engineers who build the software that:
+ Delivers foundational APIs and serves personalized streaming experiences to millions of users daily.
+ Enables our engineering organization to define, provision, and manage cloud infrastructure programmatically and at scale.
+ Allows teams to deploy changes to production swiftly and safely through sophisticated, automated CI/CD pipelines.
+ Provides deep insight into application performance via powerful, self-service observability and testing platforms.
+ Optimizes system capacity and cloud costs by engineering data-driven, automated solutions.
**Responsibilities and Duties of the Role:**
+ Architect, build, and scale foundational backend services for API delivery and content recommendation, focusing on high availability, low latency, and massive throughput.
+ Design, build, and evolve our CI/CD solutions, writing clean, scalable code to automate the entire build, test, and deployment lifecycle.
+ Architect and develop robust, scalable test automation frameworks that product teams will use for load, integration, and functional testing.
+ Write software to abstract and automate infrastructure provisioning, creating a seamless, self-service experience for engineering teams using Infrastructure as Code (IaC).
+ Develop the core software, libraries, and services that form our observability platform, enabling engineers to easily build reliable and performant applications.
+ Proactively improve system architecture and build software-based solutions to reduce toil, minimize incidents, and automate remediation.
**Required Education, Experience/Skills/Training:**
Basic Qualifications
+ Minimum 3 years of professional experience
+ Experience in a DevOps or SRE role.
+ Experience with IaC
+ Experience with incident response
+ Experience with containerization
+ Experience with CI/CD tools
+ Experience programming in Java or a JVM language
+ Experience working on cross-team projects.
+ An ability to work both independently and collaboratively
+ Strong communication skills and a desire to share and learn
Required Education
+ Bachelor's degree in computer science, Computer Engineering, Information Technology, or a related technical field.
The hiring range for this position in New York is $120,300 to $161,300 per year. The base pay actually offered will take into account internal equity and also may vary depending on the candidate's geographic region, job-related knowledge, skills, and experience among other factors. A bonus and/or long-term incentive units may be provided as part of the compensation package, in addition to the full range of medical, financial, and/or other benefits, dependent on the level and position offered.
**Job ID:** 10133879
**Location:** New York,New York
**Job Posting Company:** Disney Entertainment and ESPN Product & Technology
The Walt Disney Company and its Affiliated Companies are Equal Employment Opportunity employers and welcome all job seekers including individuals with disabilities and veterans with disabilities. If you have a disability and believe you need a reasonable accommodation in order to search for a job opening or apply for a position, email Candidate.Accommodations@Disney.com with your request. This email address is not for general employment inquiries or correspondence. We will only respond to those requests that are related to the accessibility of the online application system due to a disability.
Site Reliability Engineer
Reliability engineer job in New York, NY
Role Roadmap
We are building a next-generation financial ecosystem (think NYSE or CME from scratch). We are a small team, which means your responsibilities scale very rapidly, and your contributions are clear and visible, not marginal. There is still a lot of green field at Kalshi and a lot of it (including entire systems) can be yours.
What you'll do
Improve observability, reliability and availability by defining and measuring key metrics.
Build automation and improve systems to eliminate toil and operations work.
Collaborate with our core infrastructure team to performance tune and optimize our cloud deployments. (Think Docker, Terraform, Kubernetes, EC2, etc.)
Collaborate with product teams to reduce service disruptions and automate incident response.
Proactively find and analyze reliability problems across our business units and stack, then design and implement software to create step-function improvements.
Educate, mentor and hold accountable the engineering team to improve the reliability of our systems and make reliability a core value of the Kalshi engineering culture.
Write high quality, well tested code to meet the needs of your customers.
Debugging extremely difficult technical problems, and making systems and products both work better and are easier to deploy, own, operate and diagnose.
Review all feature designs within your product area and across the company for cross-cutting projects.
Be an owner of the security, safety, scale, operational integrity, and architectural clarity of these designs.
Build integrations with 3rd party vendors.
Participate in an on-call support rotation to provide timely troubleshooting and resolution of urgent issues.
What we're looking for
Attributes:
You have at least 4 years of experience in software engineering.
You've designed, built, scaled and maintained production services, and know how to compose a service oriented architecture.
You write high quality, well tested code to meet the needs of your customers.
You're passionate about building an open financial system that brings the world together.
You possess strong technical skills for system design and coding.
Excellent written and verbal communication skills, and a bias toward open, transparent cultural practices.
Strong skills around observability, debugging and performance tuning.
Strong interpersonal skills working with engineers from junior to principal levels
Demonstrated critical thinking under pressure.
A willingness to dive into understanding, debugging, and improving any layer of the stack.
On-call availability to ensure swift resolution of issues.
Bonus points
Experience designing and building reliable systems capable of handling high throughput and low latency.
Experience with Datadog.
Experience with Rust, Go and Terraform.
Experience with AWS, GCP, or Azure.
Experience working in a highly regulated environment.
Experience writing company-facing blog posts and training materials.
Our Culture
Meritocracy is at our core, and we value people who take ownership and figure (usually hard) things out. We dream big. We love our craft deeply and are proud of what we put out in the world. We are committed to our vision of building something big… but also useful: a product that brings more truth through the power of markets.
Kalshians are Kalshi's most important asset: we pick Kalshians carefully, so we trust them fully on day 1.
NYC Pay Transparency Disclosure:
Salary Range: $100,000 to $250,000 annually plus equity and benefits.
This salary range is based on the current available market data and represents the expected salary range for this role. Kalshi has minimal hierarchy and few titles, but a broad range of experience is represented within roles. Should you have compensation expectations that exceed these bands, we'd love to hear from you and would welcome you to reach out to discuss further.
Commitment to Equal Opportunity
Kalshi is committed to creating a culture of inclusion and belonging, and we are proud to be an equal opportunity employer. We believe it is our collective responsibility to uphold these values and encourage candidates from all backgrounds to join us in our mission. All qualified applicants will be treated with respect and receive equal consideration for employment without regard to race, color, creed, religion, sex, gender identity, sexual orientation, national origin, disability, uniform service, veteran status, age, or any other protected characteristic per federal, state, or local law. If you are passionate about what you do and want to use your talents to support our mission and values, we'd love to hear from you.
Auto-ApplyStaff Site Reliability Engineer
Reliability engineer job in New York, NY
Ro is a direct-to-patient healthcare company with a mission of helping patients achieve their health goals by delivering the easiest, most effective care possible. Ro is the only company to offer nationwide telehealth, labs, and pharmacy services. This is enabled by Ro's vertically integrated platform that helps patients achieve their goals through a convenient, end-to-end healthcare experience spanning from diagnosis, to delivery of medication, to ongoing care. Since 2017, Ro has helped millions of patients, including one in every county in the United States, and in 98% of primary care deserts.
Ro has been recognized as a Fortune Best Workplace in New York and Health Care for four consecutive years (2021-2024). In 2023, Ro was also named Best Workplace for Parents for the third year in a row. In 2022, Ro was listed as a CNBC Disruptor 50.
The Role:
At Ro, our mission is to provide world-class healthcare by putting patients first - and that mission depends on reliable, secure, and scalable systems. As a Staff SRE on the infrastructure team, you'll sit at the core of that effort: owning the reliability of our production systems, hardening infrastructure and building tools that empower our engineers to ship safely and confidently.
You will work across teams to drive uptime, performance and observability - partnering closely with product, platform and security engineers.
From designing resilient systems to shaping incident response practices, this is a role for engineers who thrive on impact and care deeply about operational excellence.What You'll Do:
Design and implement resilient infrastructure to support high availability at scale
Build and contribute to tools and platforms that streamline deployment, monitoring and recovery of systems
Drive incident response and harness learnings, leading efforts to minimize downtime and improve MTTR
Partner with engineering teams to bake best practices for reliability, resilience and observability into services
Automate infrastructure workflows using IaC and other cloud native tools
Champion a culture of operational excellence, guiding engineers through reliability practices and raising the bar across the engineering org
What You'll Bring to the Team:
Deep understanding of systems and infrastructure, with experience operating distributed services in production. We are mostly in AWS and leverage a lot of its primitives - EKS, RDS, Route53, S3, Elasticache to name a few
Strong programming and automation skills using Go (bonus points for Python)
Proficiency with infrastructure as code - Terraform / Pulumi
A passion for observability, with hands-on experience in metrics, logging, tracing using Datadog
Strong cross-functional communication, able to collaborate with product, platform, security and other teams
An operational mindset that puts reliability and resilience as a core product requirement
A mission-driven attitude, motivated by the opportunity to make healthcare better.
We've Got You Covered:
Full medical, dental, and vision insurance + OneMedical membership
Healthcare and Dependent Care FSA
401(k) with company match
Flexible PTO
Wellbeing + Learning & Growth reimbursements
Paid parental leave + Fertility benefits
Pet insurance
Student loan refinancing
Virtual resources for mindfulness, counseling, and fitness
The target base salary for this position ranges from $202,000 to $243,000, in addition to a competitive equity and benefits package (as applicable). When determining compensation, we analyze and carefully consider several factors, including location, job-related knowledge, skills and experience. These considerations may cause your compensation to vary.
Ro recognizes the power of in-person collaboration, while supporting the flexibility to work anywhere in the United States. For our Ro'ers in the tri-state (NY) area, you will join us at HQ on Tuesdays and Thursdays. For those outside of the tri-state area, you will be able to join in-person collaborations throughout the year (i.e., during team on-sites).
At Ro, we believe that our diverse perspectives are our biggest strengths - and that embracing them will create real change in healthcare. As an equal opportunity employer, we provide equal opportunity in all aspects of employment, including recruiting, hiring, compensation, training and promotion, termination, and any other terms and conditions of employment without regard to race, ethnicity, color, religion, sex, sexual orientation, gender identity, gender expression, familial status, age, disability and/or any other legally protected classification protected by federal, state, or local law.
See our California Privacy Policy here.
Auto-ApplySite Reliability Engineer
Reliability engineer job in New York, NY
The Company Cape was founded in early 2022 by Palantir and Anduril alums with deep expertise in privacy and national security. While running Palantir's US national security business, our CEO became passionate about privacy and security on mobile devices. Our mission is to be a force for good in global wireless.
At Cape, we are not just another cellular service provider; we are the architects of a privacy-centric movement that starts with the devices in your pocket. We are building a cellular network that helps citizens, including those responsible for our nation's security, regain control of their own data.
We believe that where we are, where we go, and whom we are with are among our most personal information and should be kept private. Privacy is not something you achieve by limiting yourself or by doing less, it is a set of features to be built so you can do more. We have raised money from Andreessen Horowitz and other top-tier VCs, and are excited to grow the team.
The Team
We are relentless builders, constantly pushing the boundaries of what's possible and bringing to life ideas that have never before existed. Innovation is at the core of everything we do. At Cape, we trust our team to deliver greatness and empower them to make a profound impact. As a member of our team, you will collaborate seamlessly with our diverse group of talented engineers and other team members, enjoying dynamic interactions with colleagues from across the organization.
The Role
To join our team, you should be excited to:
* Dive into a well-funded but early-stage startup. We're in a scrappy phase, be comfortable getting a little uncomfortable.
* Reclaim some of the personal privacy we have all sacrificed as smartphone adoption has grown.
* Flex your technical skills on hard, important problems with serious implications for consumer privacy and national security.
* Push the envelope - we are using new technology in novel ways.
* Work on greenfield problems. Starting new projects from the ground up. Shaping the stack, practices, and getting the opportunity to try new tools and technologies.
* Work in person! There are no facetime requirements or set hours here, and we all take work from home days. But our default work location is our DC or NY office, and we enjoy the informal culture and serendipity that in-person work enables.
We're offering competitive salary, benefits, and equity with early-stage upside.
What you'll do
* Be responsible for the full lifecycle development of our privacy-focused telecommunications and deployment infrastructure.
* Build, integrate, and maintain our instrumentation and monitoring infrastructure and tooling for improving the reliability, availability, and performance of our system.
* Help solve issues proactively before they become issues.
* Build new or integrate with existing telecommunications infrastructure and components.
* Own the technical accreditation and compliance process end-to-end for FedRamp.
* Shape and influence what great software engineering practices look like.
* Balance short term critical business needs with long term product vision and roadmap.
Qualifications
Although we list out what we generally look for, we are likely missing other attributes and skills that you have that could make you a great fit, but are not currently listed. It doesn't hurt to take a chance and apply!
Preferred
* 4+ years of software engineering or SRE experience.
* Strong familiarity with AWS.
* Fluency in Golang, Rust, Java/Kotlin, Python, or similar language.
* Experience with building, deploying, and using monitoring infrastructure & tools.
* Experience designing, building, and delivering high availability systems and infrastructure.
* Passion for privacy and national security.
* A desire to work on software that has real-world impact.
Nice to have
* Familiarity with Azure and/or GCP in addition to AWS.
* Familiarity with mobile telecommunications technologies such as IP media subsystem (IMS), 4G/5G mobile core network functions, and/or multimedia protocols.
* Experience managing redundant, high availability multi-site deployments.
* Experience organizing documentation to support accreditation processes such as FedRamp and ATOs.
The salary range for this role is $150,000-$230,000 a year + equity + 401K match. Within the range, individual pay is determined by location, experience, relevant education, and/or training.
Our Culture
* We are builders, and we choose to spend our time building things that matter. Many of our people have backgrounds in Defense Tech as well as the defense and intelligence community. We build to win.
* We hire excellent people, give them outsized responsibility, and trust them to execute at a high level. Everyone here has a track record of solving hard problems throughout their careers.
* We believe that personal privacy and national security interests are not inherently at odds, and can be reconciled via strong technology.
* We believe that companies exist to build awesome things and take care of their people. Our benefits reflect that- top-tier health care, 401(k) matching, and a generous vacation policy (that we actually use).
* We hire candidates of any race, color, ancestry, religion, sex, national origin, sexual orientation, gender identity, age, marital or family status, disability, Veteran status, and any other status. Achieving diversity across these categories will serve to make our company stronger and our product better.
How to apply
Click the link below to apply.
We reserve the right to make use of any unsolicited resumes received from outside recruiting agencies and / or individual recruiters without being responsible for payment of any fees asserted from the use of unsolicited resumes.
Staff Site Reliability Engineer
Reliability engineer job in New York, NY
AI can be a powerful tool for good in the world - at Altana we apply AI to the world's largest organized body of supply chain data to power a more resilient, more secure, and more sustainable model of global commerce. Our customers connect to the Altana network to build resilience for critical industries and infrastructure, automate and safeguard cross-border trade, transform insurance underwriting, protect national security, combat modern slave labor, disrupt fentanyl trafficking, and ensure that their products are sustainable.
Altana is backed by leading investors and used by the world's most important organizations, including Lloyd's, Maersk, multiple government agencies across the US, UK, EU, Singapore, and Australia, General Atomics, Boston Scientific, and more. We are building a global platform connecting the public and private sectors into an AI-powered network for building trusted supply chains. We operate in accordance with our values: we focus on value creation, not capture; we foster diversity and embrace difference; we embrace reality; we get things done; we amaze our clients. When you join Altana, you'll be joining a vibrant, collaborative team working together to solve complex problems with the potential for global societal impact.
The Opportunity at Altana
At Altana, we believe that software that ships must be reliable and efficient. As a Staff Site Reliability Engineer, you will be instrumental in ensuring the availability, performance, and scalability of Altana's critical production services, with a strong focus on our cloud-native environments and data pipelines. You will apply Google-style SRE principles, embedding reliability into our architecture and operations through automation, proactive monitoring, and a commitment to reducing toil.
You will work hands-on with engineering teams, influencing system design for operability and contributing to the development of robust, self-healing infrastructure. This role emphasizes a deep understanding of observability practices to gain comprehensive insights into system behavior, proactive incident prevention, and efficient incident response. Success will be measured by the resilience of our production systems, the effectiveness of our observability stack, and our continuous improvement in operational efficiency and reliability.
Your Responsibilities
Reliability Engineering: Champion and implement SRE principles, including establishing and monitoring Service Level Objectives (SLOs) and error budgets for critical services. Drive initiatives to improve system reliability, availability, performance, and efficiency.
Observability & Monitoring: Design, implement, and maintain advanced monitoring, logging, and tracing solutions for our cloud-native applications and infrastructure (e.g., Kubernetes, microservices). Develop dashboards, alerts, and runbooks that provide deep insights into system health and behavior.
Automation & Toil Reduction: Identify and automate repetitive operational tasks and manual processes across our production environment. Develop tools and scripts to enhance system operations, deployment pipelines, and incident response.
Incident Management & Postmortems: Actively participate in the incident response lifecycle, including detection, triage, mitigation, and resolution of production issues. Lead thorough blameless postmortems to identify root causes and implement preventative measures and lasting improvements.
System Design & Optimization: Collaborate closely with development teams to influence the design of new services, ensuring they are built for operability, reliability, and cost-efficiency. Proactively identify and address performance bottlenecks and architectural weaknesses.
On-Call Rotation: Participate in a periodic on-call rotation, responding to critical alerts and ensuring rapid resolution of production incidents.
Data Reliability: Implement and maintain reliability and observability for critical data pipelines and data infrastructure, ensuring data integrity, availability, and timely processing.
About You
5+ years of hands-on experience in a Site Reliability Engineering (SRE), DevOps, or equivalent role focusing on production system reliability and operations.
Strong understanding and practical application of Site Reliability Engineering (SRE) principles, including SLOs, error budgets, toil reduction, and blameless culture.
Expertise in designing, implementing, and managing observability platforms for cloud-native environments (e.g., Prometheus, Grafana, Datadog, ELK stack, OpenTelemetry, Jaeger).
Proficiency in at least one programming/scripting language (e.g., Python, Go) for automation and tool development.
Extensive hands-on experience with cloud platforms (AWS, Azure, or GCP), including their compute, networking, and database services.
Demonstrated experience with containerization technologies (Docker) and container orchestration platforms (Kubernetes).
Experience with Infrastructure as Code (IaC) tools (e.g., Terraform, OpenTofu, CloudFormation) for managing cloud resources.
Proven experience participating in and improving incident management processes for critical systems.
Knowledge of modern software delivery paradigms, including microservices architectures and CI/CD pipelines.
Excellent problem-solving, analytical, and troubleshooting skills in complex distributed systems.
Strong communication and collaboration skills, with the ability to work effectively across engineering teams.
Experience with data engineering concepts, including building or operating reliable data pipelines, data streaming technologies, or managing large-scale data infrastructure.
This role can be based in New York City, or the San Francisco Bay Area with an expectation of occasional travel as needed.
US Salary Range and Benefits
$170,000 - $220,000
The salary range, to the extent specified for this role, is a good faith statement of the minimum and maximum levels of the annual based salary for the position. The base salary offered to a successful candidate will depend on a wide range of compensation factors, including, but not limited to, work experience, education and/or training, critical skills, and/or business considerations. Competitive equity grants are included in the majority of full time offers; and are considered part of Altana's total compensation package. Altana also offers either a discretionary bonus or a variable compensation plan depending on the role. Additionally, Altana offers top-tier benefits for full-time employees, including:
Flexible Time Off: Altana operates with a Flexible Time Off (FTO) policy that gives you agency over your own time off so you can maximize your work-life balance.
Parental Leave: We offer industry leading Paid Parental Leave (PPL), providing 14 weeks of leave for non-birthing, adoptive, and foster parents and up to 26 weeks of leave for birthing parents, all paid at 100% of your base salary.
Health Benefits: We have a full suite of medical, vision, and dental benefits with generous employer contributions, designed to give you flexibility and choice for your individual health situation. Our high deductible health plan is 100% employer paid for employees and supplemented with an employer contribution to your Health Savings Account (HSA). There is also a Flexible Spending Account (FSA) option.
Supplemental Benefits: Altana provides life, short- and long-term disability, and AD&D insurance coverage, all at no cost to you, so you know that you and your loved ones are covered in case of an emergency.
401(k) Savings: Save for and invest in your future using our Guideline 401(k) retirement savings program.
Commuter Benefits: Save money on your commute by setting aside pre-tax funds for public transit or parking!
Wellness: Because we value mental and emotional health, every Altana employee has access to a free premium subscription to Calm, the #1 app for meditation, sleep, and mindfulness.
Pet Insurance: Pets are family too! Keep them healthy with Wishbone insurance and / or our Total Pet vet service and telehealth discount plan.
Employee Assistance Program: Free access to confidential personal support.
Dependent Care FSA: You will have access to a Dependent Care FSA, which allows you to set aside pre-tax funds for childcare expenses
The recruiter assigned to this role can share more information about the specific compensation and benefit details associated with this role during the hiring process.
Equal Opportunity Statement
At Altana, we believe that a diverse workforce enables greater creativity, performance, and adaptability. We're proud to be an equal opportunity employer and welcome you to join us as you are. Our employment opportunities and decisions are based on business needs and individual qualifications, without regard to race, color, religious creed, national origin, ancestry, age, physical or mental disability, medical condition, marital status, sexual orientation, gender identity or expression, genetic information, family care or medical leave status, military or veteran status, or any other characteristic protected by the laws or regulations in the areas in which we operate. We prohibit discrimination and harassment of any type, in any situation.
Offers related to employment at Altana will come from an Altana.ai email address. We will never ask for payment as part of the interview or onboarding process.
Why it's great to work at Altana
We love to collaborate, and we win as a team!
We are committed to engineering excellence
We value personal and professional development
We learn from diverse backgrounds and perspectives
We impact the world, from enabling developing countries to identifying drug traffickers
At Altana, we believe that a diverse workforce enables greater creativity, performance, and adaptability. We're proud to be an equal opportunity employer and welcome you to join us as you are. Our employment opportunities and decisions are based on business needs and individual qualifications, without regard to race, color, religious creed, national origin, ancestry, age, physical or mental disability, medical condition, marital status, sexual orientation, gender identity or expression, genetic information, family care or medical leave status, military or veteran status, or any other characteristic protected by the laws or regulations in the areas in which we operate. We prohibit discrimination and harassment of any type, in any situation.
Offers related to employment at Altana will come from an Altana.ai email address. We will never ask for payment as part of the interview or onboarding process.
Auto-ApplyLead Site Reliability Engineer
Reliability engineer job in New York, NY
Help us use technology to make a big green dent in the universe! Kraken powers some of the most innovative global developments in energy. We're a technology company focused on creating a smart, sustainable energy system. From optimising renewable generation, creating a more intelligent grid and enabling utilities to provide excellent customer experiences, our operating system for energy is transforming the industry around the world in a way that benefits everyone.
It's a really exciting time in energy. Help us make a real impact on shaping a better, more sustainable future.
Our Global Platform Engineering Reliability group is responsible for architecting, developing, and maintaining the resilient and scalable infrastructure that power and support our platforms.
As a Lead Site Reliability Engineer within the newly created ‘Product Reliability' team, you'll be responsible for ensuring the availability, performance, and scalability of the products on our platform. Your proficiency in leading technical teams that support products serving millions of customers will ensure stability and high performance for our brands and clients.
You will keep up with best practices in building products for scale. Your communication skills and attention to detail will be indispensable as you pinpoint areas for enhancement, ensure optimal product performance, and continuously improve our platforms reliability and efficiency.What you'll do:
Team leadership
Have ownership of the Product Reliability team within Platform, working closely with the Director and Heads of Platform Engineering to define strategic objectives and team direction
Manage team priorities and ensure initiatives are completed within deadlines
Collaborate regularly and effectively with the Staff Platform Engineer in your functional team to deliver the technical implementation of the team's strategic priorities
Lead delivery of major initiatives on clear timelines
Partner effectively in the wider Platform Engineering team to deliver outcomes
Build a strong culture of open communication where teammates can ask questions without fear, promoting a positive and inclusive team environment
People management
Line-manage the engineers in the Product Reliability team
Set clear performance expectations and goals for team members
Regularly review individual and team performance, offering actionable insights and constructive feedback to support and grow team members
Technical delivery
Deliver technical improvements such as small features and bug fixes
Support team delivery through code reviews, technology research and architectural guidance
Provide support for service offerings owned by your team
Help solve interesting and difficult problems. There's a great opportunity for disruption in the global energy market
What you'll have:
Excellent communication skills, working effectively with developers, product managers and other business stakeholders to understand and deliver impactful projects and reliability improvements
Record of successfully and consistently delivering critical path projects, on time and at scale
Meticulous organisation and planning skills
Experience of mentoring and coaching a team to perform at a high-level of quality
Experience managing and supporting a large-scale internet-facing distributed systems, for millions of customers
Good experience with AWS and a programming language. We use a lot of different AWS services and not just the standard few
Knowledge of security best-practices, security and CI/CD tooling, and methodologies
We're hiring this role in New York City, but would also consider remote candidates who are based in the EST timezone, we cannot consider any applicants outside this region
What will help:
Previous experience in leading technical delivery for small, highly-autonomous teams
Previous experience as a technical individual contributor, preferably as a Site Reliability Engineer
Track-record of effective collaboration with other teams and departments to drive holistic outcomes
A proactive, innovative mindset with the ability to drive continuous improvement
Previous experience working in a remote-first asynchronous global team
Familiarity with some of our tech stack:
- PostgreSQL, or a similar RDBMS, particularly in Amazon RDS at scale
- Docker and Kubernetes, we use Amazon EKS in production
- Python
- Datadog, or a similar logging/monitoring tool
- Messaging queues, event-driven async processing or similar technologies - we use RabbitMQ
- Terraform, or a similar infrastructure-as-code tool
- Experience with a Linux distribution
Why you'll love it here:
Great medical, dental, and vision insurance options including FSAs.
Paid time off - we know working hard means also being able to recharge as needed, we trust our employees to get the work done and take the time they need.
401(k) plan with employer match.
Parental leave. Biological, adoptive and foster parents are all eligible.
Pre-tax commuter benefits.
Flexible working environment: you need to shift around your schedule? You do you, we genuinely believe in work/life balance.
Equity Options: every Octopus employee owns part of the business. We're a team, working together towards huge goals. Every person is crucial to our success, you should be rewarded as such.
Modern office or co-working spaces depending on location.
We hire a wide range of experience levels into our platform team. The salary range for this role in the US ranges on average from $170,000-$200,000 depending on relevant experience, role alignment, and performance throughout the interview process. While the broad salary range is listed, not all candidates will be placed at the top of the range-this will be determined by the overall fit for the position. If you have questions about this, just ask! Our recruiters are happy to provide more context.
We are hiring this role in our New York City office, but would also consider candidates who are remote within the East Coast region/timezone. We cannot candidates outside this region.
Kraken is a certified Great Place to Work in France, Germany, Spain, Japan and Australia. In the UK we are one of the Best Workplaces on Glassdoor with a score of 4.7. Check out our Welcome to the Jungle site (FR/EN) to learn more about our teams and culture.
Are you ready for a career with us? We want to ensure you have all the tools and environment you need to unleash your potential. If you have any specific accommodations or a unique preference, please contact us at [email protected] and we'll do what we can to customise your interview process for comfort and maximum magic!
Studies have shown that some groups of people, like women, are less likely to apply to a role unless they meet 100% of the job requirements. Whoever you are, if you like one of our jobs, we encourage you to apply as you might just be the candidate we hire. Across Kraken, we're looking for genuinely decent people who are honest and empathetic. Our people are our strongest asset and the unique skills and perspectives people bring to the team are the driving force of our success. As an equal opportunity employer, we do not discriminate on the basis of any protected attribute. We consider all applicants without regard to race, colour, religion, national origin, age, sex, gender identity or expression, sexual orientation, marital or veteran status, disability, or any other legally protected status. U.S. based candidates can learn more about their EEO rights here.
Our (i)
Applicant and Candidate Privacy Notice and Artificial Intelligence (AI) Notice
, (ii)
Website Privacy Notice
and (iii)
Cookie Notice
govern the collection and use of your personal data in connection with your application and use of our website. These policies explain how we handle your data and outline your rights under applicable laws, including, but not limited to, the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA). Depending on your location, you may have the right to access, correct, or delete your information, object to processing, or withdraw consent. By applying, you acknowledge that you've read, understood and consent to these terms
We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.
Auto-ApplyDirector, Site Reliability Engineer
Reliability engineer job in New York, NY
Who We Are:
Ordergroove is a dynamic, fast-paced environment where you will be involved in building something of real value from the ground-up. We're looking for bright, talented people who are excited about innovation, growth, and the exciting world of Relationship Commerce. If you're motivated by a desire to solve problems and deliver groundbreaking insights and solutions you'll fit in perfectly!
About the Role:
OrderGroove is looking for an extraordinarily talented, passionate and naturally curious person to join our Engineering Team. Our Engineers are problem solvers who get excited about pushing the boundaries of what we can achieve, love learning and thrive in a fast-paced, collaborative environment. As the Site Reliability Engineer Director you will be joining our SRE team whose primary goal is to infinitely scale and secure our cloud-based hosted platform and accelerate time to market of code deployments while supporting the biggest brands in the world.
What You Will Do:
You will define and execute the vision for continuous delivery, cloud deployment strategies, and operational excellence.
You will spearhead our reliability, scalability, and automation efforts to ensure our platform operates securely and efficiently.
Work closely with engineering, security, and QA teams to enhance deployment and release processes.
You will help us scale how we push high-quality code, securely and efficiently.
You will guide, mentor, and collaborate with an awesome group of highly passionate engineers.
About You:
8+ years working in DevOps, SRE, or similar capacity managing cloud/private or hybrid environments.
Passionate about automation, leveraging best-of-breed technologies, and eager to learn new skills.
Experience working with automation tools (we use Terraform, Ansible and Chef).
Experience with Kubernetes, Docker, and container orchestration best practices.
Comfortable with systems Linux administration (CentOS / RHEL / Debian).
Experience configuring CI/CD pipelines using Jenkins, GitHub Actions, or GitOps.
Fluent with Apache/Nginx configurations.
Comfortable with MySQL/MongoDB administration and scaling strategies.
Experience designing high-availability/fault-tolerant systems.
Comfortable working with development teams to understand their pain points and come up with creative solutions.
Excellent communication skills.
Service-oriented mentality.
Great critical thinking skills.
Ability to quickly adapt to change.
Bonus Points For:
Python or Go experience.
Managing GCP hosted environments.
Previous eCommerce experience.
Experience going through PCI1 and SOC2 compliance approval processes.
Start-up and SaaS experience strongly preferred.
Familiarity with monitoring tools such as Prometheus, Grafana, or New Relic.
Bachelor's or Master's degree in Computer Science, Engineering, or a related field preferred.
If you don't meet 100% of the qualifications outlined above - that's okay, nobody's perfect! We encourage you to apply if you think this is a role that would make you excited to come into work every day.
About Ordergroove:
Ordergroove powers recurring revenue for the world's largest and most innovative retailers including L'Oreal, Dollar Shave Club, La Colombe Coffee, Bonafide Health, BarkBox and more. As a direct result, more than 11% of adult Americans have a subscription powered by Ordergroove. Our technology makes seamless, one-of-a-kind subscriber and membership experiences possible to turn one-time transactions into profitable recurring customer relationships.
Ordergroove's powerful platform empowers merchants with highly customizable options such as flexible promotions, bundling, and analytics to bolster their bottom line while making customers' lives easier. We recently achieved a milestone year with 152% year over year new business growth, and were rated best-in-class subscription technology by CB Insights and eCommerce Platform of the Year by RetailTech Breakthrough Awards.
Our company values celebrate collaboration, different perspectives, and curiosity with the goal of getting to the right answer, no matter who came up with it. At Ordergroove we are committed to creating a welcoming and supportive environment for all people. We encourage people with different backgrounds and experiences to join our growing team so that we gain different perspectives and build the best team possible. We demand the best of ourselves and each other and never miss an opportunity to celebrate our successes.
With a fully flexible work from anywhere culture, staying connected and supporting each other are always top of mind. We build our tight-knit community through small group events like trivia night, cooking classes, and book clubs. We encourage cross-functional relationships through virtual coffees and we stay close to the business through weekly team updates and quarterly all-hands meetings.
At Ordergroove, we focus on flexibility and empowering our team to make the right decisions for themselves. We have flexible PTO and a totally remote (anywhere in the US) workforce, and an annual personal development budget that you use for what matters to you (wellness, career development, productivity at home, etc). And of course, that is on top of the basics like competitive compensation (including stock options) and incredible, affordable benefits. Come join our amazing team while we enable the fastest-growing segment of commerce that makes life easier for millions of consumers every day!
At Ordergroove, we want to hire, develop and retain the best talent, making Ordergroove a top destination to grow your career.
The pay transparency law is a way of narrowing the gender pay gap and fostering an engaged and positive working environment. It is also a way to share what we think is a reasonable, equitable and competitive compensation structure for the roles on our team.
The total compensation range for this role starts at $175,000.
Auto-ApplySite Reliability Engineer - Capital Markets
Reliability engineer job in Jersey City, NJ
Jefferies is seeking for Site Reliability Engineer to play an instrumental role in supporting Equity Front office trading application, risk and middle office real time products, developed and used for Equity Cash and ETS application.
As part of the wider platform engineering team, you will be working closely with the Business users interactively throughout the day, along with technical, analysis and testing colleagues. Investigation and resolution of the work items at hand will require competent technical skills and a keen intellect. The business is a growth area, with current investments taking place in all the technology, business and middle office areas.
Responsibilities:
Front Line Site Reliable Engineering and Support functions for Equity trading systems used by Jefferies clients as well as internal users.
Build monitoring tools for application and infrastructure components.
Implement and manage scalable infrastructure using cloud-native technologies and tools.
Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault finding.
Partner with business, development and infrastructure teams to improve services through rigorous testing and release procedures.
Develop and maintain CI/CD pipelines to streamline deployment processes.
Expedient deployment of new systems. Capacity planning, Platform Management, and support for increasing volumes and business growth.
Create sustainable systems and services through automation.
Collaborate with Application team to establish and enforce production and development standards.
Document procedures, best practices and troubleshooting FAQs.
Resolve complex application and technical problems.
Debugging the system and fixing the production related issues.
Escalate / follow-up on permanent fix for development related issues.
Lead incident response efforts and post-mortem analysis to prevent future occurrences.
Handles complex operational tasks and recommends process and technology changes.
Global support and includes weekend availability to troubleshoot production related issues and perform checkouts.
Ability to work both independently and in groups in an energetic, diverse environment.
Participate in on-call rotations to ensure 24/7 system availability and support.
Support compliance and legal queries.
Qualifications:
Strong experience in Windows and Linux/Unix services.
Strong experience in scripting language like Power shell, Python and SQL.
Strong Knowledge of monitoring tools - Nagios, Splunk, OTEL, Datadog
Strong Knowledge of FIX protocol
Strong Domain skills - Must have working experience in Capital Markets across modules and instruments especially - CASH, ETS, Bonds, Options, Futures, Swaps products
Experience in BFSI (Banking and Financial Industry) Domain applications with a proper understanding of the Trade Lifecycle.
Excellent communication, time management and project management skills.
Primary Location Full Time Salary Range of $175,000 - $200,000
Auto-ApplySite Reliability Engineer III- Kafka Platform Engineering
Reliability engineer job in Jersey City, NJ
JobID: 210662270 JobSchedule: Full time JobShift: Base Pay/Salary: Jersey City,NJ $133,000.00-$185,000.00 There's nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems.
As a Site Reliability Engineer III at JPMorgan Chase within the Infrastructure Platforms, you will solve complex and broad business problems with simple and straightforward solutions. Through code and cloud infrastructure, you will configure, maintain, monitor, and optimize applications and their associated infrastructure to independently decompose and iteratively improve on existing solutions. You are a significant contributor to your team by sharing your knowledge of end-to-end operations, availability, reliability, and scalability of your application or platform.
Job responsibilities
* Guides and assists others in the areas of building appropriate level designs and gaining consensus from peers where appropriate.
* Demonstrate deep knowledge of Kafka technology, Kafka connect framework, and distributed systems technologies, with the ability to operate in and migrate across public and private clouds.
* Collaborates with other software engineers and teams to design and implement deployment approaches using automated continuous integration and continuous delivery pipelines
* Collaborates with other software engineers and teams to design, develop, test, and implement availability, reliability, scalability, and solutions in their applications
* Implements infrastructure, configuration, and network as code for the applications and platforms in your remit.
* Collaborates with technical experts, key stakeholders, and team members to resolve complex problems.
* Understands service level indicators and utilizes service level objectives to proactively resolve issues before they impact customers.
* Contribute to the development of technical documentation, including service APIs using Swagger, ensuring robust logging, auditability, security, and monitoring features.
* Supports the adoption of site reliability engineering best practices within your team.
* Engage in periodic on-call rotation shifts, providing client support and ensuring thorough monitoring of the platform.
Required qualifications, capabilities, and skills
* Formal training or certification on computer science and reliability concepts and 3+ years applied experience.
* Proficient in site reliability culture and principles and familiarity with how to implement site reliability within an application or platform
* Proficient in at least one programming language such as Java/Spring Boot, python.
* Proficient knowledge of software applications and technical processes within a given technical discipline (e.g., Cloud, artificial intelligence, Android, etc.)
* Experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, etc.
* Experience with public cloud platforms like AWS, GCP or Azure.
* Experience with Kafka ecosystem products: Kafka, Kafka Connect, Kafka Streams.
* Experience with continuous integration and continuous delivery tools like Jenkins, GitLab, or Terraform.
* Familiarity with container and container orchestration such as ECS, Kubernetes, and Docker.
* Familiarity with troubleshooting common networking technologies and issues.
* Ability to contribute to large and collaborative teams by presenting information in a logical and timely manner with compelling language and limited supervision
* Ability to proactively recognize road blocks and demonstrates interest in learning technology that facilitates innovation
* Ability to identify new technologies and relevant solutions to ensure design constraints are met by the software team
* Ability to initiate and implement ideas to solve business problems.
Preferred qualifications, capabilities, and skills
* Familiarity with running Apache Flink.
* Understanding of authentication and authorization technologies (e.g., OAUTH, Kerberos).
* Experience with AWS cloud services and Kubernetes platform orchestration.
Auto-ApplyDevOps & Site Reliability Engineer - AWS/Terraform/PHP
Reliability engineer job in New York, NY
DevOps / Site Reliability Engineer (Remote)
Reports to: CTO, SportsRecruits
SportsRecruits is the leading sports recruiting network, connecting athletes, clubs, events, and college coaches in the recruiting process. The company's network and tools are trusted by sports organizations such as the IWLCA, IMLCA, NFHCA, and Junior Volleyball Association. Every year, millions of connections are made on the network, resulting in commitments to the best academic and athletic institutions.
SportsRecruits is part of IMG Academy, the world's leading sports education brand. IMG Academy provides a holistic education model that empowers student-athletes to win their future, preparing them for college and for life. IMG Academy provides growth opportunities for all student-athletes through an innovative suite of on-campus and online experiences:
Boarding school and camps, via a state-of-the-art campus in Bradenton, Fla.
Online coaching via IMG Academy+, with a focus on personal development through the lens of sport and performance
Online college recruiting, via NCSA and SportsRecruits, providing unmatched college recruiting education and services to student-athletes and their families, club coaches, and event operators, and is the premier service for college coaches.
SportsRecruits is an equal opportunity employer and embraces diversity and equal opportunity on our team. Just like the student-athletes we support, we are trying to get better and stronger as a team everyday. We are committed to building a team that represents a variety of backgrounds, perspectives, and skills. We truly believe that the more inclusive our team is, the better we can serve all student-athletes, as well as their families and coaches, who are pursuing their dreams.
About the Team
We are a product development team full of fun, intelligent, happy, and hardworking engineers, designers and product managers distributed across the United States. We are scaling our network and building innovative tools to empower student athletes, college coaches, and event operators. Our tools are built on top of technologies that span mobile and web applications, computer vision, and LLMs.
We're looking for a DevOps / Site Reliability Engineer to join our team. You will play a key role in ensuring that our systems are efficient, reliable, and scalable, while helping us improve developer productivity and application performance. You'll collaborate closely with developers, QA, product, and our cloud security engineer to streamline builds and deployments, maintain application infrastructure, and proactively solve issues before they impact our users.
Our stack includes:
Laravel + PHP8 backend APIs
Vue.js (v2 and v3) + Inertia.js + Tailwind frontend
React Native mobile applications
Python for internal tools and ML/LLM-based features
Infrastructure as code managed by Terraform
AWS ECS Fargate, AWS RDS, AWS ECS, SQS, MediaConvert, and more
Cloudflare DNS and workers
We emphasize performance, security, and maintainability-and we love solving problems that have real-world impact on student-athletes, coaches, and partners.
About the PositionWhat You'll Do
CI/CD & Deployments
Configure, manage, and improve Bitbucket pipelines for deploying our applications to staging and production.
Improve CI pipeline speed, reliability, and security in collaboration with our Cloud Security Engineer.
Assist developers and QA teams with deployments.
Work with Docker and AWS ECR for container builds and deployment workflows.
Monitoring & Incident Response
Review and investigate system issues flagged by Sentry, NewRelic, and CloudWatch.
Monitor application performance, identify bottlenecks, and propose solutions.
Respond to production and staging issues, including database latency, unresponsive resources, or failed jobs.
Environment & Infrastructure Management
Maintain and support non-production environments used by developers and QA.
Maintain and improve AWS infrastructure and Terraform resources.
Perform updates and upgrades to AWS services as needed to ensure reliability and ability to scale.
Collaboration & Continuous Improvement
Partner with engineers to design systems that are scalable, observable, and resilient.
Work closely with our cloud security engineer to ensure secure configurations in CI/CD, AWS, and containerized workloads.
Contribute ideas and improvements to workflows, automation, and monitoring strategies.
About YouMust-Haves:
3+ years of experience in DevOps, SRE, or related engineering roles.
Strong experience configuring CI/CD pipelines (Bitbucket Pipelines, GitHub Actions, or similar).
Experience configuring, debugging and deploying PHP applications
Hands-on experience with Docker and AWS ECR for container builds and deployments.
Strong experience with AWS services (EC2, RDS, ECS, Lambda, etc.) and Terraform for infrastructure as code.
Familiarity with monitoring and observability tools such as New Relic, Sentry, CloudWatch, or similar.
Strong troubleshooting skills for debugging performance issues in databases, applications, and distributed systems.
Experience with modern software development workflows (agile teams, code reviews, branching strategies).
Strong scripting and automation skills (Bash, Python, or similar).
Excellent communication skills and a collaborative mindset.
Nice-to-Haves:
Experience with ECS orchestration.
Familiarity with PHP/Laravel or JavaScript/React/Vue applications.
Previous experience supporting high-traffic SaaS platforms.
Laravel, Vue, or TailwindCSS experience
Familiarity with containerized deployments (Docker, ECS, etc.)
Experience working with 3rd-party APIs and async job queues (SQS, Redis)
Knowledge of AI tooling, LLM integration, or computer vision
Why Join Us?
Meaningful Work: Help shape a platform that impacts thousands of student-athletes' futures.
Modern Stack: Work with Laravel, Vue, React Native, Python, and AWS, backed by great tooling and infrastructure.
Growth-Oriented Culture: We prioritize learning, experimentation, and continuous improvement.
Remote Flexibility: We're a distributed team with asynchronous workflows and clear communication practices.
Benefits & Compensation
Competitive salary: $100,000 - $140,000 per year
Remote-first team culture
Health, dental, and vision coverage
401(k) with company match
Unlimited vacation policy
Auto-ApplySite Reliability Engineer
Reliability engineer job in New York, NY
The AI platform for investors and bankers that generates alpha and drives upside.
Founded in 2020 by George Sivulka and backed by Peter Thiel and Andreessen Horowitz, Hebbia powers investment decisions for BlackRock, KKR, Carlyle, Centerview, and 40% of the world's largest asset managers. Our flagship product, Matrix, delivers industry-leading accuracy, speed, and transparency in AI-driven analysis. It is trusted to help manage over $15 trillion in assets globally.
We deliver the intelligence that gives finance professionals a definitive edge. Our AI uncovers signals no human could see, surfaces hidden opportunities, and accelerates decisions with unmatched speed and conviction. We do not just streamline workflows. We transform how capital is deployed, how risk is managed, and how value is created across markets.
Hebbia is not a tool. Hebbia is the competitive advantage that drives performance, alpha, and market leadership.
The Role
As a highly skilled Site Reliability Engineer (SRE), you will contribute to building systems that optimize the uptime and reliability of our platform, and support the management and optimization of our DevOps and infrastructure operations. You will be responsible for owning our deployment pipelines, building and maintaining our continuous integration and continuous deployment (CI/CD) systems, ensuring the reliability and performance of our services, enhancing our observability, supporting our local development environments, and bolstering our security posture. Your technical expertise and problem-solving skills will contribute to the success of our AI products and shape the future of our technology stack.
Responsibilities
Assist in managing deployment pipelines to facilitate smooth and efficient software releases.
Help implement and maintain observability solutions for monitoring system performance and reliability.
Support local development environments to optimize developer workflows.
Work with development teams to ensure infrastructure aligns with project requirements.
Contribute to improving the security of our infrastructure by assisting with proactive measures and audits.
Assist in developing and maintaining automation scripts and tools to enhance operational efficiency.
Help troubleshoot and resolve infrastructure and application issues to minimize downtime and maintain smooth operations.
Participate in evaluating and integrating new technologies to enhance the scalability, reliability, and security of our infrastructure.
Who You Are
Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
5+ years software development experience at a venture-backed startup or top technology firm.
Proven experience as a Site Reliability Engineer, DevOps Engineer, or similar role.
Strong expertise in managing CI/CD pipelines and deployment automation.
Proficiency in cloud platforms such as AWS, Azure, or Google Cloud (we are an AWS shop).
Solid understanding of containerization and orchestration technologies such as Docker and Kubernetes.
Experience with monitoring and observability tools such as Datadog, Prometheus, Grafana, or similar.
Knowledge of infrastructure-as-code (IaC) tools such as Terraform or CloudFormation.
Familiarity with security best practices and tools for infrastructure and application security.
Excellent problem-solving skills and the ability to troubleshoot complex issues.
Strong communication skills and the ability to work effectively in a collaborative environment.
A proactive and self-motivated approach to learning and adopting new technologies.
Passion for continuous improvement and operational excellence.
Compensation
The salary range for this role is $160,000 to $300,000. This range may be inclusive of several career levels at Hebbia and will be narrowed during the interview process based on the candidate's experience and qualifications. Adjustments outside of this range may be considered for candidates whose qualifications significantly differ from those outlined in the job description.
Life @ Hebbia
PTO: Unlimited
Insurance: Medical + Dental + Vision + 401K
Eats: Catered lunch daily + doordash dinner credit if you ever need to stay late
Parental leave policy: 3 months non-birthing parent, 4 months for birthing parent
Fertility benefits: $15k lifetime benefit
New hire equity grant: competitive equity package with unmatched upside potential
#LI-Onsite
Auto-ApplySite Reliability Engineering
Reliability engineer job in New York, NY
Job Description
Forhyre is looking for engineers who can bring unique perspectives and innovative ideas to all areas of development and are interested in continuing to improve our platform through the ever-changing technology landscape.
To be successful in this role
You'll have the opportunity to design and implement major infrastructure components, systems, and developer-friendly capabilities to improve the availability, scalability, latency, and efficiency of our services
You will provide technical leadership to cross-functional engineering, infrastructure, and product teams, and evangelize cloud best practices while building a culture of reliability and observability
Engage in and improve the end to end lifecycle of software development--from inception and design, through deployment, operation and refinement of a highly distributed system running in public cloud
Serve as subject matter expert in an SRE mindset, best practices, and cloud-native principles
Scale systems sustainably through automation to improve reliability and velocity
Assist with all aspects of operational security and compliance
Run software performance analysis and system tuning
Design and implement tools to collect data from various sources and provide actionable insights
Participate in critical incident management and timely post-mortems of production incidents to drive practices around blameless analysis, resolution, and continuous improvement work with cross-functional teams Develop the rest of the team by conducting code reviews, providing mentorship, pairing, and training opportunities
Qualification & Skills
We are looking for Principal SRE with proven experience in running distributed systems at scale, in production
You have 15+ years of experience in relevant skills gained and developed in the same or similar role
Strong knowledge of container orchestration, preferably Kubernetes and networking technology
Hands-on experience in one or more languages, such as Node JS, Python, Go, Perl, Ruby, and Bash
Experience with SOA, Microservices architecture, API Management & Enterprise system Integrations
Strong production experience with cloud infrastructure, AWS, Azure & Google Cloud
Strong sense of ownership, and an ability to drive tasks to completion
Experience developing and monitoring distributed systems
Experience working in an Agile Environment with great collaboration skills
Site Reliability Engineer
Reliability engineer job in New York, NY
About Clay
Clay is a creative tool for growth. Our mission is to help businesses grow - without huge investments in tooling or manual labor. We're already helping over 100,000 people grow their business with Clay. From local pizza shops to enterprises like Anthropic and Notion, our tool lets you instantly translate any idea that you have for growing your company into reality.
We believe that modern GTM teams win by finding
GTM alpha
- a unique competitive edge powered by data, experimentation, and automation. Clay is the platform they use to uncover hidden signals, build custom plays, and launch faster than their competitors. We're looking for sharp, low-ego people to help teams find their GTM alpha.
Why is Clay the best place to work?
Customers love the product (100K+ users and growing)
We're growing a lot (6x YoY last year, and 10x YoY the two years before that)
Incredible culture (our customers keep applying to work here)
Well-resourced - We raised a $100M Series C in 2025 at a $3.1B valuation and are backed by world-class investors like Capital G (Google), Sequoia and Meritech
Read more about why people love working at Clay here and explore our wall of love to learn more about the product.
SRE @ Clay
In this role, you'll join our growing infrastructure team in building and fine-tuning our infrastructure to keep our services running smoothly. We're looking for someone who's excited about automation and continuous improvement. While your main focus will be on infrastructure, coding skills are a must. As a growing startup, we all jump in where needed, so you'll need to be comfortable taking on a variety of roles.
What You'll Do
Architect, design, implement, and manage robust, scalable, and secure infrastructure solutions.
Develop, maintain, and enforce best practices for CI/CD, infrastructure as code, and automation.
Oversee the management and optimization of cloud infrastructure, ensuring high availability, performance, and cost-efficiency.
Implement monitoring, logging, and alerting solutions to maintain system health and quickly resolve issues.
Lead incident response efforts, troubleshooting and resolving complex issues in a timely manner.
Participate in an oncall rotation.
Work with teams across the company to ensure we achieve the right balance of developer velocity, reliability and performance, and cost efficiency.
What You'll Bring
5+ years of experience
Experience with containerization and orchestration tools
Strong understanding of CI/CD concepts and tools
Knowledge of infrastructure automation tools
Experience with oncall and incident response
Proficiency in one or more programming languages
Familiarity with our stack or ability to learn unfamiliar technologies quickly:
Aurora Postgres RDS, Elasticache Redis, Docker + ECS, Lambda, OpenSearch
Terraform and Atlantis
CircleCI, Netlify, Playwright
Cloudwatch, Datadog, Mezmo
Typescript, Python
Auto-ApplySite Reliability Engineer
Reliability engineer job in New York, NY
Our client, a boutique US HedgeFund, is seeking an SRE who will sit at the center of trading operations and infrastructure to ensure the automation and delivery of productive and efficient trading systems. This is a newly created position which will provide autonomy to identify gaps, improve the whole life-cycle of services-from inception and design, through deployment, operation and refinement of our trading systems. The ideal candidate will demonstrate ownership of the Devops processes through a systematic problem-solving approach and a desire to build robust, cutting edge and scalable systems. As an SRE, you will:
Have a deep understanding of the trading workflow, ensuring its effectiveness across teams.
Assist in automating the release and deployment of software.
Streamline the build process.
Work in cluster environments.
Monitor and manage trading workflow, research and trading.
Build, streamline, organize and design framework.
Requirements
5+ years of experience in a DevOps/Cluster Engineer role.
Bachelor's Degree in Computer Science.
Knowledge of the Linux operating system, permissions, NFS.
Effective verbal and written communication skills.
Experience managing entire pipelines, working with tools such as Jenkins, Airflow and Ansible.
Highly productive in python, bash.
Experience in distributed computing/ parallized computing, cluster running computational jobs, developing system in this area.
Experience in data recording, storage, and maintenance (backups, redundancy, compression and archiving).
Knowledge of CMake.
Knowledge of the continuous integration and deployment of code as well as the development toolchain.
Hardware and software expertise.
Strong problem solving aptitude.
Additional skills/experience that will reflect favorably
Experience installing/configuring the ELK stack, including logstash input/output plugins.
Experience building Grafana/Kibana dashboards.
Familiarity with jupyter notebooks and Docker.
Trading operations experience, build environment experience, development experience in python and C++ preferred.
Production Environment Monitoring and Maintenance (diagnostics and troubleshooting live trading)
Cloud (aws/gcp/azure) and Distributed Computing (slurm, spark, etc)
Thank you for illuminating hiring with Quanta Search!
********************
Site Reliability Engineer (Python & Go)
Reliability engineer job in New York, NY
Site Reliability Engineer - (Linux & Python/Go)
New York, NY (Hybrid, 3 days in office)
Highly competitive compensation package
Join an elite technology and research group at the forefront of global finance, where world-class engineering and quantitative research converge to solve some of the most complex problems in any industry. Their teams are composed of passionate problem-solvers who operate in a dynamic, large-scale IT environment. We are seeking a visionary engineer to lead critical reliability and automation initiatives, ensuring the firm's complex trading and research platforms operate with maximum performance, scalability, and resilience.
The Role:
We are seeking a deeply experienced SRE to act as a Tech Lead for key infrastructure initiatives. This is a crucial, hands-on role for a hybrid systems and software engineer who thrives on solving complex problems at scale. You will be a key technical leader responsible for architecting and building the robust, automated systems that underpin the firm's critical operations. You will act as a force multiplier for the engineering organization by leading high-impact projects, mentoring other engineers, and setting the standard for technical excellence in reliability and performance.
Responsibilities:
Lead the design and execution of high-impact projects focused on improving the reliability, scalability, and performance of their core infrastructure.
Architect, build, and maintain mission-critical tools and automation in Python or Go to eliminate operational toil and enhance system capabilities.
Serve as a senior escalation point for complex Linux systems issues, diagnosing and resolving deep technical challenges related to performance, configuration, and stability.
Drive the architecture for scalable, resilient, and performant infrastructure, making key design decisions for production environments.
Mentor and guide other engineers, championing best practices in software development, infrastructure management and site reliability.
Your experience:
7+ years of experience in a senior site reliability, infrastructure, or software engineering role with a track record of success in complex, large-scale environments.
Expert-level proficiency in Python or Go, with a proven track record of engineering libraries, tools, or applications (not just scripting).
Deep, hands-on expertise with the Linux operating system, including performance tuning, troubleshooting, and systems administration in a large-scale environment.
Demonstrated experience leading technical projects, driving architectural decisions, and mentoring other engineers.
Strong knowledge of CI/CD, infrastructure-as-code (Ansible, Terraform), and containerization (Docker, Kubernetes).
Exceptional communication skills, with the ability to articulate complex technical concepts to a variety of audiences.
Lead Site Reliability Engineer
Reliability engineer job in Jersey City, NJ
Assume a critical role in defining the future of a globally recognized firm and have a direct and significant effect in a realm tailored for top achievers in site reliability.
As a Lead Site Reliability Engineer at JPMorgan Chase within Employee Platforms, you hold a leadership role in your team, demonstrate strong knowledge across multiple technical domains, and advise others on the technical and business issues facing them. Take lead and conduct resiliency design reviews, break up complex problems into digestible work for other engineers, act as a technical lead for medium to large-sized products, and provide advice and mentoring to other engineers.
Job responsibilities
Demonstrates and champions site reliability culture and practices and exerts technical influence throughout your team
Leads initiatives to improve the reliability and stability of your team's applications and platforms using data-driven analytics to improve service levels
Collaborates with team members to identify comprehensive service level indicators and stakeholders to establish reasonable service level objectives and error budgets with customers
Demonstrates a high level of technical expertise within one or more technical domains and proactively identifies and solves technology-related bottlenecks in your areas of expertise
Acts as the main point of contact during major incidents for your application and demonstrates the skills to identify and solve issues quickly to avoid financial losses
Documents and shares knowledge within your organization via internal forums and communities of practice
Required qualifications, capabilities, and skills
Formal training or certification on site reliability engineering concepts and 5+ years of applied experience
Deep proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices with the ability to implement these practices within an application or platform
Fluency in at least one programming language such as (e.g., Python, Java Spring Boot, .Net, etc.)
Deep knowledge of software applications and technical processes with emerging depth in one or more technical disciplines
Proficiency and experience in observability such as white and black box monitoring, SLO alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, etc.
Proficiency in continuous integration and continuous delivery tools (e.g., Jenkins, GitLab, Terraform, etc.)
Experience with container and container orchestration (e.g., ECS, Kubernetes, Docker, etc.)
Experience with troubleshooting common networking technologies and issues
Ability to identify and solve problems related to complex data structures and algorithms
Drive to self-educate and evaluate new technology
Preferred qualifications, capabilities, and skills
Experience with Splunk, Azure, and Microsoft 365 infrastructure optimization, especiallly migrating from monolith to distributed services.
Experience with cloud infrastructure management.
Demonstrated achievements with automation of operational excellence; especially pro-active monitoring.
Ability to teach new programming languages to team members
Ability to expand and collaborate across different levels and stakeholder groups
#LI-ID1
Auto-ApplyReliability Engineer
Reliability engineer job in New York, NY
Mini-Circuits designs, manufactures and distributes integrated circuits, modules, and sub-systems for high-performance radio frequency (RF) and microwave applications. With design, sales and manufacturing locations in over 30 countries, Mini-Circuits' products are used in a range of wired and wireless communications applications. Our products are also used in detection, measurement and imaging applications, including military communication, guidance and electronic countermeasure systems, commercial, scientific, military land, sea and aircraft; automotive systems, medical systems, and industrial test equipment.
Mini-Circuits' sells its products to over 20,000 customers globally through our direct sales force, applications engineering staff, sales representatives, as well as through our extensive website.
Position Summary: The Reliability Engineer is responsible for conducting reliability studies of existing products and coordinating new product qualification prior to market release. The candidate will work in collaboration with various teams including Reliability, Design Engineering, Product Engineering, Failure Analysis and Project Management teams. Salary Range: $99,000 - $117,000 per year Job Function:
Participate in the product development meetings and guide the team to develop reliable products that meet internal specifications and customer requirements.
Develop qualification plans for new products, primarily MMICs but also support other product lines including but not limited to Low Temperature Co-Fired Ceramics, PCBA products, RF accessories and Core & Wire Products.
Analyze new products for similarity with existing released products in terms of package, die process and design to determine Qualification by Similarity, thus streamlining qualification testing.
Design and execute both device level and package level qualification tests including but not limited to MSL pre-conditioning, Thermal cycling, UHAST, HTSL, ESD and Life Tests.
Define ESD Human Body Model (HBM) and Charged Device Model (CDM) tests as per JEDEC standards.
Collaborate with Engineering Test Teams to execute Accelerate Life Tests, High Temperature Operating Life Test.
Execute Mechanical stresses such as Vibration, Mechanical Shock, Constant Acceleration & Bend Testing.
Co-ordinate with external labs for outsourced tests.
Review RF Test data before and after stresses to analyze changes is performance.
Collaborate with Failure Analysis teams to understand the root cause of failures.
Identify and record any non-conformities. Monitor solution implementations to verify effectiveness of corrective actions.
Ensure On Time Completion of Qualification activities and escalate any potential delays.
Present Qualification results with all relevant stakeholders to help Design teams initiate changes to improve reliability performance.
Prepare written reports summarizing the results of product performance and failure analysis for both internal purposes as well as customer review.
Interface with customers and suppliers on product reliability as required.
Interface with supplier to purchase lab equipment.
Support reliability assessments originating from production of released products or customer returns.
Makes decisions within area of specialty, manages medium to large projects.
Promotes ISO9001/AS9100 Quality.
The duties, responsibilities and expectations described above are not a comprehensive list and additional tasks may be assigned to the member, within the scope of the position.
Qualifications:
BS in Mechanical Engineering, Electrical Engineering, Materials, Reliability, Industrial Engineering or Physics. Advanced degree preferred.
3-5 years' experience as a Reliability Engineer in Semiconductor or equivalent industry.
Familiarity with common industry standards including JEDEC, MIL-STD-883, MIL-STD-202 and AEC-Q.
Experience with Reliability Qualification by Similarity.
Experience with Environmental, Mechanical and ESD stresses.
Experience with problem solving methodologies and leading root cause analysis.
Experience with customer returns failure analysis support.
Must have familiarity with failure analysis techniques including Scanning Acoustic Microscopy (SEM), Radiographic Inspection (X-Ray), Cross-Section methods.
Familiarity with MTTF, MTBF Calculations.
Experience with Reliability prediction modeling and tools like Weibull++ (or equivalent reliability software)
Experience with Data analysis tools including Advanced Excel, JMP, Minitab.
Ability to analyze component performance data in reliability tests, including large variety of test parts and multiple design variations.
Experience with Design of Experiments, FMEA, product design reviews and DFM.
Excellent written and oral communication skills.
Physical Demands:
The physical demands described here are representative of those that must be met by an employee to successfully perform the essential functions of this job. While performing the duties of this job, the employee is regularly required to talk and hear. The employee frequently is required to stand, walk, sit and use hands to operate a computer keyboard. The employee is occasionally required to reach with hands and arms. The employee must occasionally lift and/or move up to 10 pounds. Specific vision abilities required by this job include close vision, and ability to adjust focus. Reasonable accommodations may be made to enable individuals with disabilities to perform the essential functions.
Additional Requirements/Skills:
Comply, understand, and support corporate safety initiatives to ensure a safe work environment.
Ability and willingness to abide by Company's Code of Conduct.
Occasional travel, some overnight, as required (up to 10%).
Disclaimer: The listed qualifications and requirements for each position are intended as guidelines. Mini-Circuits reserves the right to hire outside of these guidelines at Management's discretion.
Mini-Circuits is an Equal Opportunity Employer and does not discriminate on the basis of actual or perceived age, race, creed, color, national origin, sexual orientation, military status, sex, disability, predisposing genetic characteristics, marital status, familial status, gender identity, gender dysphoria, pregnancy-related condition, and domestic violence victim status or protected class characteristic, or any other protected characteristic as established by federal or state law.
Auto-Apply