Post job

Reliability Engineer jobs at Tradeweb - 97 jobs

  • Site Reliability Engineer

    Mio Partners 4.5company rating

    New York, NY jobs

    MIO Partners, Inc. (MIO) provides proprietary investment products to McKinsey's retirement plan and partners and offers independent, high-quality financial advice to McKinsey's partners. We manage a wide array of investment vehicles with significant expertise and a long and successful track record in alternative strategies, including hedge funds and private equity. We have a multibillion-dollar portfolio of assets under management, and we manage assets for and advise only McKinsey-related clients; we do not accept outside or third-party investments. MIO is a values-based organization that is strongly aligned with our investors' interests. MIO measures success as performance relative to a market-based benchmark. MIO, a 250+ person registered investment adviser, provides ample opportunities for somebody with an entrepreneurial drive to shine. We strive to meet the highest professional standards and build an organization that attracts, develops, and retains exceptional people. MIO is a wholly owned subsidiary of McKinsey, but our activities are kept entirely separate from those of the consulting Firm. Primary responsibilities The successful candidate will have extensive technical experience working with AWS cloud technologies, preferably for financial services firms, such as asset managers, hedge funds, and/or broker/dealers. The new hire must lead by example and work collaboratively to: Design and maintain monitoring systems and dashboards Architect and manage cloud infrastructure (AWS, Azure) with security, stability, and cost in mind Implement CI/CD pipelines for reliable software delivery Establish infrastructure as code practices using CDK, GitLab, AWS developer tools Contribute to MIO application codebase to follow resiliency and performance best practices Ensure application architectures follow cloud best practices for reliability, security, performance, and efficiency Work with development teams to improve deployment processes and system reliability Collaborate with business owners to translate business requirements into technical solutions with an eye toward technology consistency and best practices Work with engineers, business users, and other stakeholders to understand their needs and ensure solutions align with business goals Maintain detailed documentation for reference architectures, design patterns, and system configurations Raise the bar on our development capabilities, standards, and processes Synthesize requirements gathered from various teams within/outside of IT and suggest creative solutions; where appropriate, guiding MIO to “do it the right way” Following a scrum methodology, organize with end users, business analysts, and other architects and developers Recommend positive steps toward standardizing development processes, including technology selection, deployment steps, code reviews, and IT tools Partner with development, QA, and AppSecOps teams to promote standardization, consistency, and improved security posture Our applications are primarily developed using Python/Django and libraries such as Pandas, NumPy, PL/SQL. In addition, we utilize SQL Server, MySQL, Elastic Search, Redis, Kafka, Tableau, and various third-party APIs and data sources. Our applications are hosted in AWS using docker containers on ECS/EC2 platforms. Primary responsibilities estimated percentage allocation 25% Technology Leadership: design, mentoring, 15% Relationship Building: requirements 60% Heads Down Development Desired background Please note applicants must be authorized to work in the U.S. without current or future visa sponsorship At least 8+ years of hands-on experience in DevOps, SRE, or platform engineering roles Bachelor of science in computer science or other related discipline (although strong experience with a less directly related degree will be considered) Strong experience in AWS Cloud technologies Knowledge of CI/CD pipeline tools (GitLab pipelines, Jenkins etc.) Understanding of monitoring and observability tools (ELK, Dynatrace, Datadog etc.) Experience with microservices, serverless architectures, and containerization Proficiency in AWS cloud platform including infrastructure-as-code and CI/CD pipelines Formal problem-solving and/or analytical training/experience a plus, as is experience working with management consultants Good intuition for end-user requirements gathering; iterative and collaborative approach to design Strong client relationship management skills and excellent written/verbal communication skills to interact at all levels ***************** MIO Partners, Inc. (MIO) is an equal opportunity employer. MIO will consider all applicants regardless of race, color, religion, sex, sexual orientation, gender identity, national origin, veteran status, or disability status. MIO has adopted a flexible, hybrid model that supports a blend of in-office and remote work. Our office is in New York City. Certain US states require MIO Partners, Inc. to include a reasonable estimate of the salary range for this role. Actual salaries may vary and may be above or below the range based on various factors, including, but not limited to an individual's assigned office location, experience, and expertise. Certain roles are also eligible for bonuses, subject to MIO's discretion and based on factors such as individual and/or organizational performance. Additionally, MIO offers a comprehensive benefits package, including medical, dental and vision coverage, telemedicine services, life, accident and disability insurance, parental leave and family planning benefits, caregiving resources, a generous retirement program, financial guidance, and paid time off. Base salary range$175,000-$200,000 USD MIO Partners, Inc. (MIO) is an equal opportunity employer. MIO will consider all applicants regardless of race, color, religion, sex, sexual orientation, gender identity, national origin, veteran status, or disability status. We are committed to protecting your privacy. Please review our Applicant Privacy Policy for a detailed explanation of how we collect, use, and protect your personal information.
    $175k-200k yearly Auto-Apply 58d ago
  • Job icon imageJob icon image 2

    Looking for a job?

    Let Zippia find it for you.

  • Site Reliability Engineer

    The Voleon Group 4.1company rating

    Remote

    Voleon is a technology company that applies state-of-the-art AI and machine learning techniques to real-world problems in finance. For nearly two decades, we have led our industry and worked at the frontier of applying AI/ML to investment management. We have become a multibillion-dollar asset manager, and we have ambitious goals for the future. Your colleagues will include internationally recognized experts in artificial intelligence and machine learning research as well as highly experienced finance and technology professionals. The people who shape our company come from other backgrounds, including concert music performances, humanitarian aid, opera singing, sports writing, and BMX racing. You will be part of a team that loves to succeed together. In addition to our enriching and collegial working environment, we offer highly competitive compensation and benefits packages, technology talks by our experts, a beautiful modern office, daily catered lunches, and more. As a Site Reliability Engineer (SRE), you will work at the intersection of production operations and software development as you improve, manage, and monitor production-critical infrastructure and data pipelines. At Voleon, many SREs serve together on a Production Operations team tasked with improving shared production infrastructure. Others are embedded with teams of software engineers to improve specific production systems owned by those teams. Voleon SREs work on important real-world problems and collaborate with passionate and talented colleagues in an empowering, results-driven environment. This role is a way to make a real difference: your contributions will make our critical systems more reliable, lower operational risk, and increase the efficiency of our engineering effort.Responsibilities Improve fault-tolerance and maintainability of code in proprietary data pipelines and trading systems Diagnose and fix bugs in code Lead complex deployments Automate manual workflows Track and prioritize outstanding production-related issues Share an on-call rotation responding to incidents to ensure the continuous operation of production-critical systems Requirements Experience with coding and debugging Python Experience with Linux Familiarity with Relational Databases & SQL Sharp analytical and problem-solving skills and a persistent drive to make things work (better) Strong growth mindset and a passion for learning Strong technical communication skills Attention to detail 2 years of relevant industry experience An undergraduate degree or comparable training in a quantitative field or equivalent, relevant industry experience Preferred Qualifications Familiarity with best practices concerning code maintainability, documentation, quality assurance, continuous integration and deployment Experience supporting production systems Experience with any of the following: gRPC microservices, Postgres, Pandas, Golang, R, Git, Jenkins, Bazel, Prometheus, Grafana, Airflow, Kubernetes The base salary for this position is $120,000 to $160,000 in the location(s) of this posting. Individual salaries are determined through a variety of factors, including, but not limited to, education, experience, knowledge, skills, and geography. Base salary does not include other forms of total compensation such as bonus compensation and other benefits. Our benefits package includes medical, dental and vision coverage, life and AD&D insurance, 20 days of paid time off, 9 sick days, and a 401(k) plan with a company match. “Friends of Voleon” Candidate Referral ProgramIf you have a great candidate in mind for this role and would like to have the potential to earn $7,500 - $15,000 if your referred candidate is successfully hired and employed by The Voleon Group, please use this form to submit your referral. For more details regarding eligibility, terms and conditions please make sure to review the Voleon Referral Bonus Program. Equal Opportunity EmployerThe Voleon Group is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
    $120k-160k yearly Auto-Apply 50d ago
  • Principal Site Reliability Engineer - Remote

    Donnelley Financial Solutions 4.8company rating

    Remote

    Join a dynamic team at the pulse of global markets, where we deliver innovative software and service solutions for essential financial reporting and capital markets transactions. At DFIN, we are a values-driven organization that empowers you to build a fulfilling career while bringing your authentic self to work every day. Our "Win as One" mentality ensures that our team's success is directly linked to Client, Shareholder and Employee Satisfaction. Recognized as one of AMERICA'S MOST LOVED WORKPLACES for five consecutive years and a Built In Best Places to Work for six years, we are committed to our employees' total well-being. Enjoy competitive compensation, a flexible workplace, comprehensive benefits, and opportunities for professional growth. Bring your passion and talents to DFIN - because being YOU thrives here. Summary: We are looking for technical team members at all levels who want to push themselves to deliver best in market SaaS solutions. We offer a challenging environment where you will have to grow, adapt and use your skills consistently. Our customers rely on us in the moments that matter. Engineering delivers on that promise. The Principal Site Reliability Engineer - Cloud is responsible for designing, building, securing, monitoring and maintaining our SaaS product cloud infrastructure so it is fast, cost effective, stable and optimized for our customers. SRE's at DFIN take on availability, performance, managing change, monitoring, response and are guardians of non-functional requirements. You either have a SaaS cloud infrastructure background in Azure or AWS with a programmatic, automated mindset or are someone that comes with a software engineering background with SaaS cloud infrastructure experience in Azure or AWS. The SRE goal is to build automated systems that reduce or eliminate manual work to keep our products up and running and performing optimally. We are looking for someone who thrives on collaboration within the team and across other groups and can lead colleagues independently to deliver solutions to complex problems. Responsibilities: * Champion and implement a culture to maintain performant, reliable, secure, cost-effective platform cloud infrastructure in DFIN SaaS products based on operationalized processes you define * Champion security of our cloud infrastructure collaborating with Security and Governance teams and using static and dynamic tooling * Champion and implement application and cloud infrastructure monitoring and alerting to prevent client impacting issues by ensuring system availability, performance and scalability to maintain SLOs and SLAs * Optimize cloud infrastructure and application performance at scale while maintaining effective cost controls * Automate cloud infrastructure buildout and maintenance including system operational runbooks * Dive deep into technology and stay on the forefront of the latest tools, technologies, and strategies; help evaluate, prototype, and integrate them into operationalized work processes * Perform with broad independence and deliver on project milestones and tasks you define on schedule while communicating progress regularly * Build strong relationships with SRE team members and software engineering teams to hold each other accountable for quality expectations * Learn continuously and apply lessons learned * Evangelize best practices, eliminate bottlenecks, and improve process * Participate in on-call duties 365/24/7 and lead the triage and RCA of production incidents Qualifications: * 8+ years experience designing, building, securing, monitoring and maintaining cloud infrastructure in Azure or AWS * 5+ years experience creating, configuring, maintaining and monitoring Kubernetes clusters (AKS or EKS) in cloud infrastructure to optimize application performance and reliability * 5+ years building and deploying Infrastructure as Code with Terraform or similar technology * 5+ years experience with common cloud networking, firewall and load balancing configuration * 5+ years experience writing software in any modern software language such as C# .NET, Java * 5+ years experience creating automated deployments with tools such as Harness, Azure DevOps, Ansible or Jenkins to manage Infrastructure as Code and software build and deployment in a continuous integration (CI) / continuous delivery (CD) environment * 5+ years experience implementing production performance, availability, and scalability monitoring and alerting using a tool such as New Relic, Dynatrace, DataDog or AppDynamics * 5+ years experience supporting public client facing revenue generating systems * Experiencing monitoring and preventing issues with databases and database queries (SQL) using tools like Solarwinds Database Performance Analyzer, Idera SQL Diagnostic Manager, or Redgate SQL Monitor * Experience planning, coordinating, developing and executing all stages of post deployment verification test scripts * Experience securing Windows or Linux systems in 24x7 production environment * BS in Computer Science or equivalent work experience It is the policy of Donnelley Financial Solutions to select, place, and manage all its employees without discrimination based on race, color, national origin, gender, age, religion, actual or perceived disability, veteran status, actual or perceived sexual orientation, genetic information or any other protected status. If you are a qualified individual with a disability or a disabled veteran, you have the right to request a reasonable accommodation if you are unable or limited in your ability to use or access jobs.dfinsolutions.com as a result of your disability. You can request a reasonable accommodation by sending an email to ***********************************. At DFIN, protecting your identity is a top priority. Please be aware of scammers impersonating DFIN recruiters. DFIN recruiters will never request personal information via email or text. You will only receive a text from us if you've already been in contact. All automated messages will come from ***********************************. If you ever have doubts about the legitimacy of any communication from us, please do not hesitate to reach out for verification via *********************************** (this email is for general TA questions and is not used for updates on your application status). #BI-Remote Job Segment: Cloud, Database, SQL, Testing, Linux, Technology
    $105k-156k yearly est. 30d ago
  • Electronic Trading Reliability Engineer

    Barclays Plc 4.6company rating

    New York jobs

    Purpose of the role To apply software engineering techniques, automation, and best practices in incident response, to ensure the reliability, availability, and scalability of the systems, platforms, and technology through them. Accountabilities * Availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning. * Resolution, analysis and response to system outages and disruptions, and implement measures to prevent similar incidents from recurring. * Development of tools and scripts to automate operational processes, reducing manual workload, increasing efficiency, and improving system resilience. * Monitoring and optimisation of system performance and resource usage, identify and address bottlenecks, and implement best practices for performance tuning. * Collaboration with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle, and work closely with other teams to ensure smooth and efficient operations. * Stay informed of industry technology trends and innovations, and actively contribute to the organization's technology communities to foster a culture of technical excellence and growth. Assistant Vice President Expectations * To advise and influence decision making, contribute to policy development and take responsibility for operational effectiveness. Collaborate closely with other functions/ business divisions. * Lead a team performing complex tasks, using well developed professional knowledge and skills to deliver on work that impacts the whole business function. Set objectives and coach employees in pursuit of those objectives, appraisal of performance relative to objectives and determination of reward outcomes * If the position has leadership responsibilities, People Leaders are expected to demonstrate a clear set of leadership behaviours to create an environment for colleagues to thrive and deliver to a consistently excellent standard. The four LEAD behaviours are: L - Listen and be authentic, E - Energise and inspire, A - Align across the enterprise, D - Develop others. * OR for an individual contributor, they will lead collaborative assignments and guide team members through structured assignments, identify the need for the inclusion of other areas of specialisation to complete assignments. They will identify new directions for assignments and/ or projects, identifying a combination of cross functional methodologies or practices to meet required outcomes. * Consult on complex issues; providing advice to People Leaders to support the resolution of escalated issues. * Identify ways to mitigate risk and developing new policies/procedures in support of the control and governance agenda. * Take ownership for managing risk and strengthening controls in relation to the work done. * Perform work that is closely related to that of other areas, which requires understanding of how areas coordinate and contribute to the achievement of the objectives of the organisation sub-function. * Collaborate with other areas of work, for business aligned support areas to keep up to speed with business activity and the business strategy. * Engage in complex analysis of data from multiple sources of information, internal and external sources such as procedures and practises (in other areas, teams, companies, etc).to solve problems creatively and effectively. * Communicate complex information. 'Complex' information could include sensitive information or information that is difficult to communicate because of its content or its audience. * Influence or convince stakeholders to achieve outcomes. All colleagues will be expected to demonstrate the Barclays Values of Respect, Integrity, Service, Excellence and Stewardship - our moral compass, helping us do what we believe is right. They will also be expected to demonstrate the Barclays Mindset - to Empower, Challenge and Drive - the operating manual for how we behave. Embark on a transformative journey as an Electronic Trading Reliability Engineer. At Barclays, our vision is clear - to redefine the future of banking and help craft innovative solutions. In this role, you'll support and enhance electronic trading platforms that underpin critical business operations across the firm. Working within a small, collaborative team, you'll help ensure the stability and performance of fast-moving, low-latency trading systems, while also contributing to modernization initiatives shaping the next generation of our electronic trading infrastructure. To be successful as a Electronic Trading Reliability Engineer, you should have experience with: * Supporting electronic trading systems, including exchange gateways, trading algorithms, dark pools, and market data platforms, using industry protocols such as FIX across asset classes including Equities, Rates, Futures, or FX * Troubleshooting and supporting complex trading environments using Linux platforms, scripting, monitoring tools (e.g., ITRS), and observability practices * Developing and maintaining automation scripts to improve operational efficiency, system stability, and incident response Some other highly valued skills may include: * Programming experience in Java, Python, or C++ to extend automation capabilities and strengthen application reliability * Working with KDB to manage, analyze, and troubleshoot high-volume time-series trading data * Leveraging SRE and monitoring tooling such as Grafana or Elastic to enhance system observability and operational resilience You may be assessed on the key critical skills relevant for success in this role, such as risk and controls, change and transformation, business acumen, strategic thinking, digital and technology, as well as job-specific technical skills. This role is located in New York, NY. Minimum Salary: $120,000 Maximum Salary: $175,000 The minimum and maximum salary/rate information above includes only base salary or base hourly rate. It does not include any other type of compensation or benefits that may be available.
    $120k-175k yearly 7d ago
  • Data Reliability Engineer II

    Zeta 4.4company rating

    Ridgefield, NJ jobs

    Job DescriptionAbout us Build the future of banking.Zeta is a next-generation banking technology company providing cloud-native, fully stackable processing and core banking platforms for issuers. With a focus on scalability, compliance, and innovation, Zeta empowers financial institutions to modernize their technology infrastructure and deliver secure, seamless digital banking experiences. Our impact runs at real-world scale. Today, over 25 million cards are live on Zeta-powered platforms across 7 countries, supported by a passionate team of 1,700+ Zetanauts across India, the US, EMEA, and Asia. Backed by SoftBank Vision Fund, Mastercard, and other reputed strategic investors, we reached a valuation of $2 billion in 2025. Our focus is on establishing product lines that focus on key outcomes by addressing real customer pain points, modernizing legacy systems, and strengthening core fundamentals. As a result, our systems and platforms support a wide range of banking and payments capabilities, including:1. Tachyon, our cloud-native banking stack built for population-scale systems2. Cipher, our unified authentication platform for secure, high-volume banking environments3. Digital Credit as a Service, enabling banks to launch credit lines on UPI4. Elena, our intelligent and conversational AI platform for banking5. Pixel, India's first digital-native credit card, launched in partnership with HDFC Bank, for whom we also revamped their PayZapp mobile app: Winner of the Celent Model Bank Award for Payments Innovation 20246. Sparrow, the leading card experience for non-prime cardholders in the US …and more across cards, payments, lending, and core banking. We are an engineering-first organization that values ownership, bias for action, and long-term thinking. Together, we solve some of the hardest problems in banking tech. Our culture is built around trust, collaboration, and creating the conditions for you to drive impact proportionate to your potential. Reinforcing our commitment to creating an inclusive and supportive workplace, we have been consistently recognized as a Great Place to Work. If you want to build cutting-edge banking tech that enables banks to serve millions reliably, securely, and at a population scale, Zeta is your playground.If you would like to learn more about how we have grown and evolved over the years, watch our journey here. You can also explore our website and follow us on LinkedIn, Instagram,YouTube, and X. Zeta is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We encourage applicants from all backgrounds, cultures, and communities to apply and believe that a diverse workforce is key to our success.Responsibilities Proactively monitor PostgreSQL RDS instances for performance, availability, and resource utilization (CPU, memory, storage, connections) using established monitoring tools (e.g., CloudWatch, Prometheus). Assist in identifying performance bottlenecks in PostgreSQL RDS. Apply basic performance tuning techniques like reviewing query execution plans, adding missing indexes, and recommending parameter adjustments. Monitor the health and performance of Debezium and Kafka Connect connectors, identifying and troubleshooting basic issues related to data capture and delivery. Monitor Apache Nifi data flows for errors, backpressure, and performance issues. Assist in troubleshooting and resolving common Nifi flow failures. Provide support for data related issues and participate in root cause analysis. Monitor the execution of Apache Airflow DAGs, identify failed tasks, and troubleshooting and re-runs. Develop and maintain automation scripts and infrastructure as code (IAC) templates (e.g., using Crossplane, Terraform) to automate routine database tasks, deployments, and updates. Participate in on-call rotations to respond to database-related incidents and perform troubleshooting and root cause analysis. Assist in implementing and maintaining security best practices for cloud databases, including access controls, encryption, and compliance with regulatory requirements. Regularly audit and assess database security configurations. Configure and manage database backup and recovery strategies to ensure data integrity and availability in case of failures or data loss. Analyse database query performance and collaborate with developers to optimize SQL queries and schemas. Participate in continuous improvement initiatives to enhance the reliability, scalability, and performance of cloud databases. Assist in the design and optimization of database schemas for cloud environments. Skills Familiarity with data pipeline concepts and technologies like Debezium, Kafka Connect, Apache Nifi. Basic understanding of Amazon Redshift and S3. Exposure to Apache Spark for data processing. Basic understanding of Apache Airflow for workflow orchestration. Strong SQL scripting skills for querying and basic data manipulation. Familiarity with scripting languages (e.g., Python, Bash) is a plus. Knowledge of database security best practices, including access controls, encryption, and compliance with regulatory requirements (e.g., GDPR, HIPAA). Having ‘AWS Certified Database - Specialty' certification is a plus Experience and Qualifications Bachelor's degree in Computer Science, Information Technology, or a related field. 3-5 years of experience in database administration, with a focus on PostgreSQL. 1-2 years of hands-on experience with PostgreSQL RDS.
    $141k-187k yearly est. 2d ago
  • Data Reliability Engineer II

    Zeta 4.4company rating

    Ridgefield, NJ jobs

    About us Build the future of banking.Zeta is a next-generation banking technology company providing cloud-native, fully stackable processing and core banking platforms for issuers. With a focus on scalability, compliance, and innovation, Zeta empowers financial institutions to modernize their technology infrastructure and deliver secure, seamless digital banking experiences. Our impact runs at real-world scale. Today, over 25 million cards are live on Zeta-powered platforms across 7 countries, supported by a passionate team of 1,700+ Zetanauts across India, the US, EMEA, and Asia. Backed by SoftBank Vision Fund, Mastercard, and other reputed strategic investors, we reached a valuation of $2 billion in 2025. Our focus is on establishing product lines that focus on key outcomes by addressing real customer pain points, modernizing legacy systems, and strengthening core fundamentals. As a result, our systems and platforms support a wide range of banking and payments capabilities, including:1. Tachyon, our cloud-native banking stack built for population-scale systems2. Cipher, our unified authentication platform for secure, high-volume banking environments3. Digital Credit as a Service, enabling banks to launch credit lines on UPI4. Elena, our intelligent and conversational AI platform for banking5. Pixel, India's first digital-native credit card, launched in partnership with HDFC Bank, for whom we also revamped their PayZapp mobile app: Winner of the Celent Model Bank Award for Payments Innovation 20246. Sparrow, the leading card experience for non-prime cardholders in the US …and more across cards, payments, lending, and core banking. We are an engineering-first organization that values ownership, bias for action, and long-term thinking. Together, we solve some of the hardest problems in banking tech. Our culture is built around trust, collaboration, and creating the conditions for you to drive impact proportionate to your potential. Reinforcing our commitment to creating an inclusive and supportive workplace, we have been consistently recognized as a Great Place to Work. If you want to build cutting-edge banking tech that enables banks to serve millions reliably, securely, and at a population scale, Zeta is your playground.If you would like to learn more about how we have grown and evolved over the years, watch our journey here. You can also explore our website and follow us on LinkedIn, Instagram,YouTube, and X. Zeta is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We encourage applicants from all backgrounds, cultures, and communities to apply and believe that a diverse workforce is key to our success.Responsibilities Proactively monitor PostgreSQL RDS instances for performance, availability, and resource utilization (CPU, memory, storage, connections) using established monitoring tools (e.g., CloudWatch, Prometheus). Assist in identifying performance bottlenecks in PostgreSQL RDS. Apply basic performance tuning techniques like reviewing query execution plans, adding missing indexes, and recommending parameter adjustments. Monitor the health and performance of Debezium and Kafka Connect connectors, identifying and troubleshooting basic issues related to data capture and delivery. Monitor Apache Nifi data flows for errors, backpressure, and performance issues. Assist in troubleshooting and resolving common Nifi flow failures. Provide support for data related issues and participate in root cause analysis. Monitor the execution of Apache Airflow DAGs, identify failed tasks, and troubleshooting and re-runs. Develop and maintain automation scripts and infrastructure as code (IAC) templates (e.g., using Crossplane, Terraform) to automate routine database tasks, deployments, and updates. Participate in on-call rotations to respond to database-related incidents and perform troubleshooting and root cause analysis. Assist in implementing and maintaining security best practices for cloud databases, including access controls, encryption, and compliance with regulatory requirements. Regularly audit and assess database security configurations. Configure and manage database backup and recovery strategies to ensure data integrity and availability in case of failures or data loss. Analyse database query performance and collaborate with developers to optimize SQL queries and schemas. Participate in continuous improvement initiatives to enhance the reliability, scalability, and performance of cloud databases. Assist in the design and optimization of database schemas for cloud environments. Skills Familiarity with data pipeline concepts and technologies like Debezium, Kafka Connect, Apache Nifi. Basic understanding of Amazon Redshift and S3. Exposure to Apache Spark for data processing. Basic understanding of Apache Airflow for workflow orchestration. Strong SQL scripting skills for querying and basic data manipulation. Familiarity with scripting languages (e.g., Python, Bash) is a plus. Knowledge of database security best practices, including access controls, encryption, and compliance with regulatory requirements (e.g., GDPR, HIPAA). Having ‘AWS Certified Database - Specialty' certification is a plus Experience and Qualifications Bachelor's degree in Computer Science, Information Technology, or a related field. 3-5 years of experience in database administration, with a focus on PostgreSQL. 1-2 years of hands-on experience with PostgreSQL RDS.
    $141k-187k yearly est. Auto-Apply 60d+ ago
  • Lead Site Reliability Engineer, AI/ML Platform

    Jpmorgan Chase 4.8company rating

    Jersey City, NJ jobs

    Responsibilities: + Design and implement solutions to enhance the reliability and scalability of AI/ML platforms and applications to accommodate fast growing demands. + Partner with product engineering teams to ensure the AI/ML systems are reliable and high performing. + Develop observability, security, automation and fin-ops tools and orchestration. + Provide strategic technology leadership by defining and evaluating standards and architecture for reliability, observability and automation frameworks. + Build strong cross-functional relationships that foster engagements across the organization and deliver solutions to user problems. + Debug and solve issues in a production environment, identify root cause and remediate. + Participates in on-call rotations, incident management and escalation workflows. + Take full ownership of problems, develop solutions, and acquire new knowledge to complete the task. + Mentor and guide junior engineers. Required Qualifications: + Bachelor's degree in computer science, Information Technology, or equivalent technical qualification with 5+ years professional experience. + Expertise in SRE principles, reliability, scalability and performance of application and infrastructure. + Have hands-on experience with cloud platforms (AWS, GCP, Azure) and IaC tools (Terraform, Ansible). + Extensive experience implementing advanced observability using tools like Open Telemetry, Dynatrace, Grafana, and/or cloud-native services. + Experience in architecting distributed systems and cloud-native architecture in AWS. + Systematic problem-solving and troubleshooting skills in a complex system. + Excellent communication skills and ability to represent and present business and technical concepts to stakeholders. + Self-managed, self-motivated with strong sense of ownership, urgency, and drive Good to have: + Prior experience working in AI, ML, or Data engineering. + Prior experience developing AI Ops/AI Agents. + Multi cloud experience (AWS, GCP, Azure) is a plus JPMorganChase, one of the oldest financial institutions, offers innovative financial solutions to millions of consumers, small businesses and many of the world's most prominent corporate, institutional and government clients under the J.P. Morgan and Chase brands. Our history spans over 200 years and today we are a leader in investment banking, consumer and small business banking, commercial banking, financial transaction processing and asset management. We offer a competitive total rewards package including base salary determined based on the role, experience, skill set and location. Those in eligible roles may receive commission-based pay and/or discretionary incentive compensation, paid in the form of cash and/or forfeitable equity, awarded in recognition of individual achievements and contributions. We also offer a range of benefits and programs to meet employee needs, based on eligibility. These benefits include comprehensive health care coverage, on-site health and wellness centers, a retirement savings plan, backup childcare, tuition reimbursement, mental health support, financial coaching and more. Additional details about total compensation and benefits will be provided during the hiring process. We recognize that our people are our strength and the diverse talents they bring to our global workforce are directly linked to our success. We are an equal opportunity employer and place a high value on diversity and inclusion at our company. We do not discriminate on the basis of any protected attribute, including race, religion, color, national origin, gender, sexual orientation, gender identity, gender expression, age, marital or veteran status, pregnancy or disability, or any other basis protected under applicable law. We also make reasonable accommodations for applicants' and employees' religious practices and beliefs, as well as mental health or physical disability needs. Visit our FAQs for more information about requesting an accommodation. JPMorgan Chase & Co. is an Equal Opportunity Employer, including Disability/Veterans **Base Pay/Salary** Jersey City,NJ $152,000.00 - $215,000.00 / year
    $152k-215k yearly 34d ago
  • Site Reliability Engineer (SRE)

    Luma Financial Technologies 3.3company rating

    New York jobs

    About the role At Luma, our Site Reliability Engineer (SRE) team keeps our platform reliable, secure, and lightning fast. They own everything from AWS infrastructure and Kubernetes clusters to CI/CD pipelines, monitoring, and alerting. If you're passionate about tackling big challenges, automating at scale, and making systems more resilient, we'd love to have you on the team. Please note: This position is required to work from Luma's Cincinnati, OH or New York, NY office 2-3 days/week SPONSORSHIP FOR U.S WORK AUTHORIZATION IS NOT AVAILABLE FOR THIS OPPORTUNITY What you'll do Collaborate with product engineering teams to design and build the infrastructure their services run on. Keep our Kubernetes clusters on AWS EKS running smoothly, secure, and ready to scale. Design and deliver resilience strategies that cover multi-region architecture, backups, disaster recovery, and failover. Automate infrastructure with Terraform and Infrastructure-as-Code, reducing manual effort and human error. Help teams ship faster by improving CI/CD pipelines and deployment practices. Monitor performance and reliability using modern observability tools. Support on-call rotations and lead incident response with a focus on long-term fixes. What We're Looking For You code to solve problems and are comfortable in one of the following languages: Python, Bash, Go, Java, or similar. You have strong experience with AWS (RDS, CloudFront, IAM, VPCs), Terraform, and Kubernetes. You are resilience focused, with experience designing and running systems that remain dependable during failures and recover seamlessly. You have hands-on experience improving and operating CI/CD pipelines (e.g., CircleCI, GitHub Actions, or similar) to help teams ship faster with confidence. You stay calm under pressure, bringing incident response expertise and strong root-cause analysis skills. Most importantly, you are a team player who brings clear communication, strong collaboration, and a mindset of continuous improvement. Please note: sponsorship for U.S. work authorization is not available for this opportunity.
    $102k-145k yearly est. 13d ago
  • Site Reliability Engineer - Capital Markets

    Jefferies Financial Group Inc. 4.8company rating

    Jersey City, NJ jobs

    Jefferies is seeking for Site Reliability Engineer to play an instrumental role in supporting Equity Front office trading application, risk and middle office real time products, developed and used for Equity Cash and ETS application. As part of the wider platform engineering team, you will be working closely with the Business users interactively throughout the day, along with technical, analysis and testing colleagues. Investigation and resolution of the work items at hand will require competent technical skills and a keen intellect. The business is a growth area, with current investments taking place in all the technology, business and middle office areas. Responsibilities: Front Line Site Reliable Engineering and Support functions for Equity trading systems used by Jefferies clients as well as internal users. Build monitoring tools for application and infrastructure components. Implement and manage scalable infrastructure using cloud-native technologies and tools. Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault finding. Partner with business, development and infrastructure teams to improve services through rigorous testing and release procedures. Develop and maintain CI/CD pipelines to streamline deployment processes. Expedient deployment of new systems. Capacity planning, Platform Management, and support for increasing volumes and business growth. Create sustainable systems and services through automation. Collaborate with Application team to establish and enforce production and development standards. Document procedures, best practices and troubleshooting FAQs. Resolve complex application and technical problems. Debugging the system and fixing the production related issues. Escalate / follow-up on permanent fix for development related issues. Lead incident response efforts and post-mortem analysis to prevent future occurrences. Handles complex operational tasks and recommends process and technology changes. Global support and includes weekend availability to troubleshoot production related issues and perform checkouts. Ability to work both independently and in groups in an energetic, diverse environment. Participate in on-call rotations to ensure 24/7 system availability and support. Support compliance and legal queries. Qualifications: Strong experience in Windows and Linux/Unix services. Strong experience in scripting language like Power shell, Python and SQL. Strong Knowledge of monitoring tools - Nagios, Splunk, OTEL, Datadog Strong Knowledge of FIX protocol Strong Domain skills - Must have working experience in Capital Markets across modules and instruments especially - CASH, ETS, Bonds, Options, Futures, Swaps products Experience in BFSI (Banking and Financial Industry) Domain applications with a proper understanding of the Trade Lifecycle. Excellent communication, time management and project management skills. Primary Location Full Time Salary Range of $175,000 - $200,000
    $175k-200k yearly Auto-Apply 60d+ ago
  • Reliability Engineer

    Tata Consulting Services 4.3company rating

    Marlborough, MA jobs

    * SRE to quickly write automations, self-heal scripts, understanding and finding resolutions for errors from Microservices basically any from any stack ( Full-Stack capable). * Operations skillset with enough attitude to scale to a Reliability Engineer * Should be able to handle customer communication and coordination with offshore team. TCS Employee Benefits Summary: * Discretionary Annual Incentive. * Comprehensive Medical Coverage: Medical & Health, Dental & Vision, Disability Planning & Insurance, Pet Insurance Plans. * Family Support: Maternal & Parental Leaves. * Insurance Options: Auto & Home Insurance, Identity Theft Protection. * Convenience & Professional Growth: Commute r Benefits & Certification & Training Reimbursement. * Time Off: Vacation, Time Off, Sick Leave & Holidays. * Legal & Financial Assistance: Legal Assistance, 401K Plan, Performance Bonus, College Fund, Student Loan Refinancing. # LI-RJ2 Salary Range - $100,000-$120,000 a year
    $100k-120k yearly 22d ago
  • Reliability Engineer (SRE OMS)

    Tata Consulting Services 4.3company rating

    Marlborough, MA jobs

    * SRE with Sterling OMS Skillset with adaptability to Distributed Systems, developing Automations with AI/GenAI tool etc * Operations skillset with enough attitude to scale to a Reliability Engineer. * Should be able to handle customer communication and coordination with offshore team. TCS Employee Benefits Summary: * Discretionary Annual Incentive. * Comprehensive Medical Coverage: Medical & Health, Dental & Vision, Disability Planning & Insurance, Pet Insurance Plans. * Family Support: Maternal & Parental Leaves. * Insurance Options: Auto & Home Insurance, Identity Theft Protection. * Convenience & Professional Growth: Commute r Benefits & Certification & Training Reimbursement. * Time Off: Vacation, Time Off, Sick Leave & Holidays. * Legal & Financial Assistance: Legal Assistance, 401K Plan, Performance Bonus, College Fund, Student Loan Refinancing. # LI-RJ2 Salary Range - $100,000-$120,000 a year
    $100k-120k yearly 22d ago
  • Site Reliability Engineer

    FIS Capital Markets 4.4company rating

    New York jobs

    We are FIS. Our technology powers the world's economy and our teams bring innovation to life. We champion diversity to deliver the best products and solutions for our colleagues, clients and communities. If you're ready to start learning, growing and making an impact with a career in fintech, we'd like to know: Are you FIS? NOTE: 1: This position is hybrid (3 days onsite) in our FIS Office locations in New York City (New York), Milwaukee (Wisconsin), Jacksonville (Florida) & Atlanta (Georgia). 2: Current and future sponsorship are not available for this position About the Team: This position is under our CTO org to support SRE functions for innovation and growth for the Banking Solutions, Payments and Capital Markets business. This role will report under our Wealth and Retirement team. FIS empowers small to large retirement plan providers around the globe with a comprehensive, integrated suite of retirement solutions. Our industry-leading offering includes an extensive selection of technology and services. From record keeping technology and plan administration and compliance solutions, to financial wellness that supports all aspects of retirement services business and positions our clients for future growth. We are looking for a talented resource, who is comfortable working multiple appropriately prioritized issues and/or projects at a time. One who desires to be a part of this global dynamic retirement technology team where they will grow personally, technically and professionally. What you will be doing: Site Reliability Engineer will play a critical role in driving innovation and growth for the Banking Solutions, Payments and Capital Markets business. In this role, the candidate will have the opportunity to make a lasting impact on the company's transformation journey, drive customer-centric innovation and automation, and position the organization as a leader in the competitive banking, payments and investment landscape. Your broad responsibilities will include: owning the technical engagement and ultimate success around specific modernization projects and have hands-on experience with AWS technologies as well as broad know-how around how applications and services are constructed using the AWS platform. Design and maintain monitoring solutions for infrastructure, application performance, and user experience. Implement automation tools to streamline tasks, scale infrastructure, and ensure seamless deployments. Ensure application reliability, availability, and performance, minimizing downtime and optimizing response times. Lead incident response, including identification, triage, resolution, and post-incident analysis. Conduct capacity planning, performance tuning, and resource optimization. Collaborate with security teams to implement best practices and ensure compliance. Manage deployment pipelines and configuration management for consistent and reliable app deployments. Develop and test disaster recovery plans and backup strategies. Collaborate with development, QA, DevOps, and product teams to align on reliability goals and incident response processes. Participate in on-call rotations and provide 24/7 support for critical incidents. What you bring: Proficiency in development technologies, architectures, and platforms (web, mobile apps, API). Experience with AWS cloud platform and IaC tools. AWS CLI: used for a few tasks, changing AWS configurations, uploading/downloading files to s3, downloading kubectl context Terraform: defining the infrastructure as code, where all AWS resource configurations gets configured/changed Kubectl: for communication with EKS Knowledge of monitoring tools (Prometheus, Grafana, DataDog) and logging frameworks (Splunk, ELK Stack). Experience in incident management and post-mortem reviews. Strong troubleshooting skills for complex technical issues. Proficiency in scripting languages (Python, Bash) and automation tools (Terraform, Ansible). Experience with CI/CD pipelines (Jenkins, GitLab CI/CD,). Ownership approach to engineering and product outcomes. Excellent interpersonal communication, negotiation, and influencing skills. Must have the following Certifications: AWS Cloud Practitioner AWS Solutions Architect Associate OR AWS Certified Devops Engineer Tech Stack: Linux, Git, Docker, Kubernetes /OpenShift, Helm, Jenkins, Harness, CheckMarx, SonarQube, Maven, Node, Artifactory, FlyWay, Splunk, KeyFactor, HashiCorp Vault, CyberArk, SNOW, Jira, Confluence, Oracle DB, PostgreSQL DB, EC2, EKS, KMS, Secrets Manager (Stores passwords and enables auto-rotation), RDS: Postgres DB, Redis: In-Memory Cache, S3: object storage, Security Groups: Ingress/Egress Firewall configurations for each service, IAM: Access control to AWS resources, Route53, SFTP, Scheduler like Tivoli is nice to know. What we offer you: At FIS, we hire the best. In return, you receive exceptional benefits including: Opportunities to innovate in fintech Tools for personal and professional growth Inclusive and diverse work environment Resources to invest in your community Competitive salary and benefits NOTE: 1: This position is hybrid (3 days onsite) in our FIS Office locations in New York City (New York), Milwaukee (Wisconsin), Jacksonville (Florida) & Atlanta (Georgia). 2: Current and future sponsorship are not available for this position FIS is committed to providing its employees with an exciting career opportunity and competitive compensation. The pay range for this full-time position is $170,550.00 - $286,520.00 and reflects the minimum and maximum target for new hire salaries for this position based on the posted role, level, and location. Within the range, actual individual starting pay is determined by additional factors, including job-related skills, experience, and relevant education or training. Any changes in work location will also impact actual individual starting pay. Please consult with your recruiter about the specific salary range for your preferred location during the hiring process. Privacy Statement FIS is committed to protecting the privacy and security of all personal information that we process in order to provide services to our clients. For specific information on how FIS protects personal information online, please see the Online Privacy Notice. EEOC Statement FIS is an equal opportunity employer. We evaluate qualified applicants without regard to race, color, religion, sex, sexual orientation, gender identity, marital status, genetic information, national origin, disability, veteran status, and other protected characteristics. The EEO is the Law poster is available here supplement document available here For positions located in the US, the following conditions apply. If you are made a conditional offer of employment, you will be required to undergo a drug test. ADA Disclaimer: In developing this job description care was taken to include all competencies needed to successfully perform in this position. However, for Americans with Disabilities Act (ADA) purposes, the essential functions of the job may or may not have been described for purposes of ADA reasonable accommodation. All reasonable accommodation requests will be reviewed and evaluated on a case-by-case basis. Sourcing Model Recruitment at FIS works primarily on a direct sourcing model; a relatively small portion of our hiring is through recruitment agencies. FIS does not accept resumes from recruitment agencies which are not on the preferred supplier list and is not responsible for any related fees for resumes submitted to job postings, our employees, or any other part of our company. #pridepass
    $90k-116k yearly est. Auto-Apply 60d+ ago
  • Site Reliability Engineer - Capital Markets

    Jefferies Financial Group Inc. 4.8company rating

    New York, NY jobs

    Jefferies is seeking for Site Reliability Engineer to play an instrumental role in supporting Equity Front office trading application, risk and middle office real time products, developed and used for Equity Cash and ETS application. As part of the wider platform engineering team, you will be working closely with the Business users interactively throughout the day, along with technical, analysis and testing colleagues. Investigation and resolution of the work items at hand will require competent technical skills and a keen intellect. The business is a growth area, with current investments taking place in all the technology, business and middle office areas. Responsibilities: * Front Line Site Reliable Engineering and Support functions for Equity trading systems used by Jefferies clients as well as internal users. * Build monitoring tools for application and infrastructure components. * Implement and manage scalable infrastructure using cloud-native technologies and tools. * Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault finding. * Partner with business, development and infrastructure teams to improve services through rigorous testing and release procedures. * Develop and maintain CI/CD pipelines to streamline deployment processes. * Expedient deployment of new systems. Capacity planning, Platform Management, and support for increasing volumes and business growth. * Create sustainable systems and services through automation. * Collaborate with Application team to establish and enforce production and development standards. * Document procedures, best practices and troubleshooting FAQs. * Resolve complex application and technical problems. * Debugging the system and fixing the production related issues. * Escalate / follow-up on permanent fix for development related issues. * Lead incident response efforts and post-mortem analysis to prevent future occurrences. * Handles complex operational tasks and recommends process and technology changes. * Global support and includes weekend availability to troubleshoot production related issues and perform checkouts. * Ability to work both independently and in groups in an energetic, diverse environment. * Participate in on-call rotations to ensure 24/7 system availability and support. * Support compliance and legal queries. Qualifications: * Strong experience in Windows and Linux/Unix services. * Strong experience in scripting language like Power shell, Python and SQL. * Strong Knowledge of monitoring tools - Nagios, Splunk, OTEL, Datadog * Strong Knowledge of FIX protocol * Strong Domain skills - Must have working experience in Capital Markets across modules and instruments especially - CASH, ETS, Bonds, Options, Futures, Swaps products * Experience in BFSI (Banking and Financial Industry) Domain applications with a proper understanding of the Trade Lifecycle. * Excellent communication, time management and project management skills. Primary Location Full Time Salary Range of $175,000 - $200,000
    $175k-200k yearly Auto-Apply 50d ago
  • Site Reliability Engineer III- Kafka Platform Engineering

    Jpmorgan Chase 4.8company rating

    Jersey City, NJ jobs

    There's nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems. As a Site Reliability Engineer III at JPMorgan Chase within the Infrastructure Platforms, you will solve complex and broad business problems with simple and straightforward solutions. Through code and cloud infrastructure, you will configure, maintain, monitor, and optimize applications and their associated infrastructure to independently decompose and iteratively improve on existing solutions. You are a significant contributor to your team by sharing your knowledge of end-to-end operations, availability, reliability, and scalability of your application or platform. **Job responsibilities** + Guides and assists others in the areas of building appropriate level designs and gaining consensus from peers where appropriate. + Demonstrate deep knowledge of Kafka technology, Kafka connect framework, and distributed systems technologies, with the ability to operate in and migrate across public and private clouds. + Collaborates with other software engineers and teams to design and implement deployment approaches using automated continuous integration and continuous delivery pipelines + Collaborates with other software engineers and teams to design, develop, test, and implement availability, reliability, scalability, and solutions in their applications + Implements infrastructure, configuration, and network as code for the applications and platforms in your remit. + Collaborates with technical experts, key stakeholders, and team members to resolve complex problems. + Understands service level indicators and utilizes service level objectives to proactively resolve issues before they impact customers. + Contribute to the development of technical documentation, including service APIs using Swagger, ensuring robust logging, auditability, security, and monitoring features. + Supports the adoption of site reliability engineering best practices within your team. + Engage in periodic on-call rotation shifts, providing client support and ensuring thorough monitoring of the platform. **Required qualifications, capabilities, and skills** + Formal training or certification on computer science and reliability concepts and 3+ years applied experience. + Proficient in site reliability culture and principles and familiarity with how to implement site reliability within an application or platform + Proficient in at least one programming language such as Java/Spring Boot, python. + Proficient knowledge of software applications and technical processes within a given technical discipline (e.g., Cloud, artificial intelligence, Android, etc.) + Experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, etc. + Experience with public cloud platforms like AWS, GCP or Azure. + Experience with Kafka ecosystem products: Kafka, Kafka Connect, Kafka Streams. + Experience with continuous integration and continuous delivery tools like Jenkins, GitLab, or Terraform. + Familiarity with container and container orchestration such as ECS, Kubernetes, and Docker. + Familiarity with troubleshooting common networking technologies and issues. + Ability to contribute to large and collaborative teams by presenting information in a logical and timely manner with compelling language and limited supervision + Ability to proactively recognize road blocks and demonstrates interest in learning technology that facilitates innovation + Ability to identify new technologies and relevant solutions to ensure design constraints are met by the software team + Ability to initiate and implement ideas to solve business problems. **Preferred qualifications, capabilities, and skills** + Familiarity with running Apache Flink. + Understanding of authentication and authorization technologies (e.g., OAUTH, Kerberos). + Experience with AWS cloud services and Kubernetes platform orchestration. JPMorganChase, one of the oldest financial institutions, offers innovative financial solutions to millions of consumers, small businesses and many of the world's most prominent corporate, institutional and government clients under the J.P. Morgan and Chase brands. Our history spans over 200 years and today we are a leader in investment banking, consumer and small business banking, commercial banking, financial transaction processing and asset management. We offer a competitive total rewards package including base salary determined based on the role, experience, skill set and location. Those in eligible roles may receive commission-based pay and/or discretionary incentive compensation, paid in the form of cash and/or forfeitable equity, awarded in recognition of individual achievements and contributions. We also offer a range of benefits and programs to meet employee needs, based on eligibility. These benefits include comprehensive health care coverage, on-site health and wellness centers, a retirement savings plan, backup childcare, tuition reimbursement, mental health support, financial coaching and more. Additional details about total compensation and benefits will be provided during the hiring process. We recognize that our people are our strength and the diverse talents they bring to our global workforce are directly linked to our success. We are an equal opportunity employer and place a high value on diversity and inclusion at our company. We do not discriminate on the basis of any protected attribute, including race, religion, color, national origin, gender, sexual orientation, gender identity, gender expression, age, marital or veteran status, pregnancy or disability, or any other basis protected under applicable law. We also make reasonable accommodations for applicants' and employees' religious practices and beliefs, as well as mental health or physical disability needs. Visit our FAQs for more information about requesting an accommodation. JPMorgan Chase & Co. is an Equal Opportunity Employer, including Disability/Veterans **Base Pay/Salary** Jersey City,NJ $133,000.00 - $185,000.00 / year
    $133k-185k yearly 60d+ ago
  • Site Reliability Engineer III- Kafka Platform Engineering

    Jpmorganchase 4.8company rating

    Jersey City, NJ jobs

    There's nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems. As a Site Reliability Engineer III at JPMorgan Chase within the Infrastructure Platforms, you will solve complex and broad business problems with simple and straightforward solutions. Through code and cloud infrastructure, you will configure, maintain, monitor, and optimize applications and their associated infrastructure to independently decompose and iteratively improve on existing solutions. You are a significant contributor to your team by sharing your knowledge of end-to-end operations, availability, reliability, and scalability of your application or platform. Job responsibilities Guides and assists others in the areas of building appropriate level designs and gaining consensus from peers where appropriate. Demonstrate deep knowledge of Kafka technology, Kafka connect framework, and distributed systems technologies, with the ability to operate in and migrate across public and private clouds. Collaborates with other software engineers and teams to design and implement deployment approaches using automated continuous integration and continuous delivery pipelines Collaborates with other software engineers and teams to design, develop, test, and implement availability, reliability, scalability, and solutions in their applications Implements infrastructure, configuration, and network as code for the applications and platforms in your remit. Collaborates with technical experts, key stakeholders, and team members to resolve complex problems. Understands service level indicators and utilizes service level objectives to proactively resolve issues before they impact customers. Contribute to the development of technical documentation, including service APIs using Swagger, ensuring robust logging, auditability, security, and monitoring features. Supports the adoption of site reliability engineering best practices within your team. Engage in periodic on-call rotation shifts, providing client support and ensuring thorough monitoring of the platform. Required qualifications, capabilities, and skills Formal training or certification on computer science and reliability concepts and 3+ years applied experience. Proficient in site reliability culture and principles and familiarity with how to implement site reliability within an application or platform Proficient in at least one programming language such as Java/Spring Boot, python. Proficient knowledge of software applications and technical processes within a given technical discipline (e.g., Cloud, artificial intelligence, Android, etc.) Experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, etc. Experience with public cloud platforms like AWS, GCP or Azure. Experience with Kafka ecosystem products: Kafka, Kafka Connect, Kafka Streams. Experience with continuous integration and continuous delivery tools like Jenkins, GitLab, or Terraform. Familiarity with container and container orchestration such as ECS, Kubernetes, and Docker. Familiarity with troubleshooting common networking technologies and issues. Ability to contribute to large and collaborative teams by presenting information in a logical and timely manner with compelling language and limited supervision Ability to proactively recognize road blocks and demonstrates interest in learning technology that facilitates innovation Ability to identify new technologies and relevant solutions to ensure design constraints are met by the software team Ability to initiate and implement ideas to solve business problems. Preferred qualifications, capabilities, and skills Familiarity with running Apache Flink. Understanding of authentication and authorization technologies (e.g., OAUTH, Kerberos). Experience with AWS cloud services and Kubernetes platform orchestration.
    $113k-140k yearly est. Auto-Apply 60d+ ago
  • Site Reliability Engineer III - Newton

    Jpmorganchase 4.8company rating

    Jersey City, NJ jobs

    There's nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems. As a Site Reliability Engineer III at JPMorgan Chase within Climate Risk Technology team, you will solve complex and broad business problems with simple and straightforward solutions. Through code and cloud infrastructure, you will configure, maintain, monitor, and optimize applications and their associated infrastructure to independently decompose and iteratively improve on existing solutions. You are a significant contributor to your team by sharing your knowledge of end-to-end operations, availability, reliability, and scalability of your application or platform. Job responsibilities Guides and assists others in the areas of building appropriate level designs and gaining consensus from peers where appropriate Collaborates with other software engineers and teams to design and implement deployment approaches using automated continuous integration and continuous delivery pipelines Collaborates with other software engineers and teams to design, develop, test, and implement availability, reliability, scalability, and solutions in their applications Implements infrastructure, configuration, and network as code for the applications and platforms in your remit Collaborates with technical experts, key stakeholders, and team members to resolve complex problems Understands service level indicators and utilizes service level objectives to proactively resolve issues before they impact customers Supports the adoption of site reliability engineering best practices within your team Required qualifications, capabilities, and skills Formal training or certification on software engineering concepts and applied experience Proficient in site reliability culture and principles and familiarity with how to implement site reliability within an application or platform Proficient in at least one programming language such as SQL, Java/Spring Boot & Python Proficient knowledge of software applications and technical processes within a given technical discipline (e.g., AWS, Kubernetes, etc.) Experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk Experience with continuous integration and continuous delivery tools like Jenkins, GitLab, or Terraform Ability to contribute to large and collaborative teams by presenting information in a logical and timely manner with compelling language and limited supervision Ability to proactively recognize road blocks and demonstrates interest in learning technology that facilitates innovation Ability to identify new technologies and relevant solutions to ensure design constraints are met by the software team Ability to initiate and implement ideas to solve business problems Preferred qualifications, capabilities, and skills Knowledge of HDFS, Hadoop, Databricks Knowledge of Airflow, Control-M Familiarity with container and container orchestration such as ECS, Kubernetes, and Docker Familiarity with troubleshooting common networking technologies and issues
    $113k-140k yearly est. Auto-Apply 16d ago
  • Site Reliability Engineer III

    Jpmorganchase 4.8company rating

    Jersey City, NJ jobs

    There's nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems. As a Site Reliability Engineer III at JPMorgan Chase within the [insert LOB or sub LOB], you will solve complex and broad business problems with simple and straightforward solutions. Through code and cloud infrastructure, you will configure, maintain, monitor, and optimize applications and their associated infrastructure to independently decompose and iteratively improve on existing solutions. You are a significant contributor to your team by sharing your knowledge of end-to-end operations, availability, reliability, and scalability of your application or platform. Job responsibilities Guides and assists others in the areas of building appropriate level designs and gaining consensus from peers where appropriate Collaborates with other software engineers and teams to design and implement deployment approaches using automated continuous integration and continuous delivery pipelines Collaborates with other software engineers and teams to design, develop, test, and implement availability, reliability, scalability, and solutions in their applications Implements infrastructure, configuration, and network as code for the applications and platforms in your remit Collaborates with technical experts, key stakeholders, and team members to resolve complex problems Understands service level indicators and utilizes service level objectives to proactively resolve issues before they impact customers Supports the adoption of site reliability engineering best practices within your team Required qualifications, capabilities, and skills Formal training or certification on software engineering concepts and 3+ years applied experience Proficient in site reliability culture and principles and familiarity with how to implement site reliability within an application or platform Proficient in at least one programming language such as Python, Java/Spring Boot, and .Net Proficient knowledge of software applications and technical processes within a given technical discipline (e.g., Cloud, artificial intelligence, Android, etc.) Experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others Experience with continuous integration and continuous delivery tools like Jenkins, GitLab, or Terraform Familiarity with container and container orchestration such as ECS, Kubernetes, and Docker Familiarity with troubleshooting common networking technologies and issues Ability to contribute to large and collaborative teams by presenting information in a logical and timely manner with compelling language and limited supervision Preferred qualifications, capabilities, and skills Ability to proactively recognize road blocks and demonstrates interest in learning technology that facilitates innovation Ability to identify new technologies and relevant solutions to ensure design constraints are met by the software team Ability to initiate and implement ideas to solve business problems
    $113k-140k yearly est. Auto-Apply 60d+ ago
  • Lead Site Reliability Engineer, AI/ML Platform

    Jpmorganchase 4.8company rating

    Jersey City, NJ jobs

    Responsibilities: Design and implement solutions to enhance the reliability and scalability of AI/ML platforms and applications to accommodate fast growing demands. Partner with product engineering teams to ensure the AI/ML systems are reliable and high performing. Develop observability, security, automation and fin-ops tools and orchestration. Provide strategic technology leadership by defining and evaluating standards and architecture for reliability, observability and automation frameworks. Build strong cross-functional relationships that foster engagements across the organization and deliver solutions to user problems. Debug and solve issues in a production environment, identify root cause and remediate. Participates in on-call rotations, incident management and escalation workflows. Take full ownership of problems, develop solutions, and acquire new knowledge to complete the task. Mentor and guide junior engineers. Required Qualifications: Bachelor's degree in computer science, Information Technology, or equivalent technical qualification with 5+ years professional experience. Expertise in SRE principles, reliability, scalability and performance of application and infrastructure. Have hands-on experience with cloud platforms (AWS, GCP, Azure) and IaC tools (Terraform, Ansible). Extensive experience implementing advanced observability using tools like Open Telemetry, Dynatrace, Grafana, and/or cloud-native services. Experience in architecting distributed systems and cloud-native architecture in AWS. Systematic problem-solving and troubleshooting skills in a complex system. Excellent communication skills and ability to represent and present business and technical concepts to stakeholders. Self-managed, self-motivated with strong sense of ownership, urgency, and drive Good to have: Prior experience working in AI, ML, or Data engineering. Prior experience developing AI Ops/AI Agents. Multi cloud experience (AWS, GCP, Azure) is a plus
    $113k-140k yearly est. Auto-Apply 37d ago
  • Site Reliability Engineer II-2

    Mastercard 4.7company rating

    Bogota, NJ jobs

    Our Purpose Mastercard powers economies and empowers people in 200+ countries and territories worldwide. Together with our customers, we're helping build a sustainable economy where everyone can prosper. We support a wide range of digital payments choices, making transactions secure, simple, smart and accessible. Our technology and innovation, partnerships and networks combine to deliver a unique set of products and services that help people, businesses and governments realize their greatest potential. Title and Summary Site Reliability Engineer II-2 Overview The GBSC EPMS team is looking for a Site Reliability Engineer who can help us solve problems, implement automation, and leverage best practices. * Are you a born problem solver who loves to figure out how something works? * Are you a detail -oriented individual who enjoys complex problem solving? * Do you love determining the correct actions required to fix a problem? * Do you have a low tolerance for manual work and look to automate everything you can? Business Operations is leading the Site Reliability Engineering (SRE) transformation at Mastercard through our tooling and by being an advocate for change & standards throughout the development, quality, release, and product organizations. We need team members with an appetite for change and pushing the boundaries of what can be done with automation. Experience in working across development, operations, and product teams to prioritize needs and to build relationships is a must. Responsibilities * Engage in and improve the whole lifecycle of services-from inception and design, through deployment, operation and refinement. * Analyze ITSM activities of the platform and provide feedback loop to development teams on operational gaps or resiliency concerns * Support services before they go live through activities such as system design consulting, capacity planning and launch reviews. * Maintain services once they are live by measuring and monitoring availability, latency and overall system health. * Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity. * Support the application CI/CD pipeline for promoting software into higher environments through validation and operational gating, and lead Mastercard in DevOps automation and best practices. * Practice sustainable incident response and blameless postmortems. * Take a holistic approach to problem solving, by connecting the dots during a production event thru the various technology stack that makes up the platform, to optimize mean time to recover * Work with a global team spread across tech hubs in multiple geographies and time zones * Share knowledge and mentor junior resources All About You * BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics), or equivalent practical experience. * Experience with algorithms, data structures, scripting, pipeline management, software design and OLAP systems. * Hands on experience with understanding custom objects using JavaScript, HTML5, CSS and API integrations. * Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive. * Ability to help debug and optimize code and automate routine tasks. * We support many different stakeholders. Experience in dealing with difficult situations and making decisions with a sense of urgency is needed. * Experience in one or more of the following is preferred: C, C++, Java, Python, Go, Perl, Ruby, MDX. * Interest in designing, analyzing and troubleshooting large-scale distributed systems. * We need team members with an appetite for change and pushing the boundaries of what can be done with automation. Experience in working across development, operations, and product teams to prioritize needs and to build relationships is a must. Corporate Security Responsibility All activities involving access to Mastercard assets, information, and networks comes with an inherent risk to the organization and, therefore, it is expected that every person working for, or on behalf of, Mastercard is responsible for information security and must: * Abide by Mastercard's security policies and practices; * Ensure the confidentiality and integrity of the information being accessed; * Report any suspected information security violation or breach, and * Complete all periodic mandatory security trainings in accordance with Mastercard's guidelines.
    $88k-119k yearly est. Auto-Apply 5d ago
  • Site Reliability Engineer 2

    Drivewealth 4.0company rating

    New York, NY jobs

    DriveWealth is a global B2B financial technology organization dedicated to democratizing access to financial independence around the world. Our mission is realized through an API-based platform, empowering our partners to offer seamless investing and trading experiences to clients worldwide, all from their mobile devices. Our technology provides partners with a modern, extensible toolkit, enabling traditional investment workflows and innovative techniques like fractional share ownership. DriveWealth has evolved into a global platform offering trading of US equities, mutual funds, ETFs, fixed income, and options. We seek enthusiastic professionals to contribute diverse perspectives and experiences to our Brokerage-as-a-Service platform. Our culture blends the pace and opportunity of a tech start-up with the impact, stability, and significance of Wall Street. We encourage creativity and experimentation while ensuring institutional-grade execution and regulatory compliance in everything we do. We value diversity and inclusion, celebrating the unique differences of our employees as we scale and grow together. We're guided by operating principles grounded in accountability, teamwork, integrity, and solutions built to scale. Join us! About The Role As a Site Reliability Engineer 2, you will enhance the reliability and performance of our Brokerage-as-a-Service platform during critical 7/24 operations. This role demands a proactive approach to managing technical challenges and system optimizations that align with our global operational strategies. What You'll Do Support the SRE team in developing and implementing enhancements to support workflows, focusing on automation and efficiency improvements. Handle technical escalations, troubleshoot complex issues, and actively participate in on-call rotations to ensure rapid response and resolution during non-traditional hours. Adhere and administer incident and change management policies. Coordinate incident resolution efforts and implement change management protocols to maintain and enhance system reliability, especially during critical system operations at night. Work closely with the New York office to ensure smooth operation and alignment of SRE practices across time zones. What You'll Need 3+ years in a SRE role or a similar position, demonstrating deep knowledge and expertise in site reliability engineering and operations. Working knowledge in REST APIs and understanding of API integration. Python proficiency in scripting for automation and system management, with a track record of developing and implementing automation solutions. SQL and Database expertise in transactional databases, including querying and troubleshooting. Analytical and troubleshooting skills with a demonstrated ability to perform troubleshooting and root cause analysis of technical issues. Availability for flexible work hours and willingness to cover US markets trading sessions, including L2 on-call coverage. Knowledge of Change Management Process and Risk Management. Nice to Have, But No Required Experience in the brokerage or financial industry Proficient with cloud services, particularly AWS, and knowledgeable about cloud architecture best practices, including IAM, EC2, S3, and DynamoDB Experience maintaining and supporting containerized systems, with familiarity in orchestration tools Knowledge of Infrastructure as Code (IaC) practices and tools such as Terraform or CloudFormation Ability to manage and troubleshoot job scheduling tools like Rundeck or Apache Airflow Advanced skills in managing containerized environments using Kubernetes and OpenShift Practical experience with Confluent Cloud for event streaming architectures Experience with Java applications and a basic understanding of using the browser developer console for front-end debugging Additional Notes: This role is critical for our continuous operations and requires a commitment to nighttime hours, aligning with the global nature of our financial services. Candidates must be prepared for intense collaboration periods and proactive communication across global teams. Applicants must be authorized to work for any employer in the U.S. DriveWealth is unable to sponsor or take over sponsorship of an employment Visa at this time. Compensation Compensation package offerings are based on candidate experience and technical qualifications, as it relates to the role. These are identified and determined throughout your interviewing experience. Please note: this role is expected to come into our office on a cadence set by the Hiring Manager/Team. New York, NY (Hybrid) Pay Range$70,000-$120,000 USD Benefits Competitive medical, dental, and vision insurance options Mental health resources Generous paid time off with observed holidays (varies per country) Paid parental leave for biological and adoptive parents Up to $2,500 or local equivalent each year to invest in continued education and personal development Up to $900 each year or local equivalent for fitness and wellness reimbursement Company-provided phone (varies by country) For HQ in-office employees, a daily lunch stipend, unlimited snacks, and engaging office space in the Financial District Pre-tax commuter benefits (US only) Employer 401K match (US only) Benefit offerings vary based on country and are subject to change. Equal Employment Opportunity To build technology and products that are used and loved by people and solve real-world problems, we need to build a team with many different perspectives and experiences. We are an equal opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status. We encourage candidates from all backgrounds to apply. Applicants in need of special assistance or accommodation during the interview process or in accessing our website may contact us at **************************. Agency Disclaimer DriveWealth does not accept agency resumes. Please do not forward resumes to our jobs alias, employees, or any other organization location. DriveWealth is not responsible for any fees related to unsolicited resumes.
    $70k-120k yearly Auto-Apply 1d ago

Learn more about Tradeweb jobs