Post job

Reliability Engineer jobs at Apixio - 101 jobs

  • Senior SRE - Remote-First, Observability & Reliability

    Captivateiq, Inc. 4.3company rating

    San Francisco, CA jobs

    A tech company focused on sales performance is seeking a Site Reliability Engineer in San Francisco. This role involves collaborating with development teams, automating infrastructure, and ensuring service reliability. Ideal candidates will have extensive experience in SRE or DevOps, with skills in infrastructure as code and strong communication abilities. The company offers generous benefits including health coverage and a 401k plan, fostering a diverse and inclusive work environment. #J-18808-Ljbffr
    $142k-189k yearly est. 2d ago
  • Job icon imageJob icon image 2

    Looking for a job?

    Let Zippia find it for you.

  • Remote JavaScript Engineer for AI Training & Code Quality

    Labelbox 4.3company rating

    San Francisco, CA jobs

    A leading AI solutions firm is looking for a JavaScript Developer to work remotely on AI-related projects. The successful candidate will review AI-generated JavaScript code, develop high-quality solutions, and create explanations for code logic. A Bachelor's degree in Computer Science and 3-5 years of experience with JavaScript frameworks like React or Node.js are required. This hourly contract offers flexible working hours ranging from 10 to 40 hours per week. #J-18808-Ljbffr
    $85k-115k yearly est. 3d ago
  • Principal Site Reliability Engineer

    Expel 4.3company rating

    Remote

    Your passion for uptime was forged from experience in production and refined through incident response. You're an Expel Principal Site Reliability Engineer - a protector, champion, and leader of Expel's reputation for service reliability. Innovation comes naturally to you, but you're also eager to help others. You understand that operational reliability is a shared mission across all of engineering, and that your role is to make it as easy as possible for Expel to achieve that mission. You spend your mornings collaborating with architects and product stakeholders to outline the next quarter's reliability initiatives, then the afternoon pair-programming with a junior SRE to mentor them in debugging a tricky Kubernetes deployment. You apply your dedication to reliability and collaboration with the broader SRE community to ensure Expel maintains outstanding reliability standards within the cloud native ecosystem. You take pride in all the nines of uptime you've achieved, but you know that an SRE's job is never done! What Expel can do for you Provide an opportunity to grow and maintain reliability-focused platform features within a cloud native engineering platform using modern infrastructure and tooling (Kubernetes/GKE/EKS runtime, Hashicorp toolset, etc) Provide you a mission you can get behind: stopping evil hackers so our customers can focus on their business Be included in a company focused on creating opportunities to do interesting work and creating space for employees to learn and grow An opportunity to contribute to a best-in-class product A leadership team that's embraced modern Site Reliability principles, as outlined by Google and other industry leaders. What you can do for Expel Lead project work to build and maintain platform features that cut across the Expel product's reliability, networking, and cloud infrastructure. Contribute by pushing IaC commits daily, with occasional opportunities to write and test application code in Python, Golang, and Javascript Mentor and motivate service owners on how to use the platform in order to deploy, measure, monitor, and operate their own services at scale. Participate in a weekly support rotation that includes taking the on-call pager and providing nearly on-demand working-hours support to platform users. Lead incident response, triage, and root cause analysis support Poke fun at our leadership team in creative ways. What you should bring with you A passion for learning and improving your work product Significant experience operating Kubernetes within highly distributed environments Experience running systems in GCP or AWS Exposure to monitoring and observability infrastructure and standard methodologies An understanding of infrastructure-as-code practices, tools, and patterns Some experience developing software in Linux environments, preferably with Python and/or Golang A customer-minded approach that enables the success of platform users as well as building trust across the organization. A collaborative disposition that allows you to work optimally on and across teams Six years of systems experience either in operations or development Missing some items on the list? That's ok! We still want to talk to you! How our team works together We build and run teams where everyone is pulling in the same direction and is learning from each other: We work out of a shared backlog We pair-program weekly, as it makes sense We peer-review everything We do weekly blame-free retros to reinforce what's going well, so we do more of it, and surface what's not going well, so we can do something about it. Same thing for projects and significant operational problems. Our hiring process We respect your time. You'll hear from us by the end of the next business day after completing an interview. We also have a goal that all Expletives have a great manager and have a voice in how their team is run and who runs it. It's not the shortest process in the industry, but you'll get to meet nearly everyone you'll work with day-to-day and your Engineering leadership. New Expletives consistently say our interview process gave them an accurate picture of what it's like to work here. Here's our 3-stage process for this position (5.5 hours total interviewing time): Chat with a recruiter (30 min) Video interview with hiring manager (Engineering Manager) (60 minutes) Pair programming interview (with two engineers) (60 minutes) “Virtual onsite interview” (can be scheduled contiguous or broken up, 60 minutes each): Engineering leadership (Engineering Director and Manager of Delivery Experience) System design interview (with two engineers) Technology and skills interview (with two engineers) Additional details The base salary range for this role is between $167,300 USD and $242,600 USD + bonus eligibility and equity. We believe in paying transparently and equitably. Your salary will ultimately be based on factors such as your experience, skills, team equity, and market data. You'll also be eligible for unlimited PTO (which we model and encourage), work location flexibility, up to 24 weeks of parental leave, and really excellent health benefits. We're only hiring those authorized to work in the United States. We do not currently sponsor immigration visas. We're an Equal Opportunity Employer: You'll receive consideration for employment without regard to race, sex, color, religion, sexual orientation, gender identity, national origin, protected veteran status, or on the basis of disability. We'll ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please let us know if you need accommodation of any kind. #LI-Remote Salary Range$167,300-$242,600 USD
    $167.3k-242.6k yearly Auto-Apply 60d+ ago
  • Staff Site Reliability Engineer

    Alphasense 4.0company rating

    Remote

    The world's most sophisticated companies rely on AlphaSense to remove uncertainty from decision-making. With market intelligence and search built on proven AI, AlphaSense delivers insights that matter from content you can trust. Our universe of public and private content includes equity research, company filings, event transcripts, expert calls, news, trade journals, and clients' own research content. The acquisition of Tegus by AlphaSense in 2024 advances our shared mission to empower professionals to make smarter decisions through AI-driven market intelligence. Together, AlphaSense and Tegus will accelerate growth, innovation, and content expansion, with complementary product and content capabilities that enable users to unearth even more comprehensive insights from thousands of content sets. Our platform is trusted by over 6,000 enterprise customers, including a majority of the S&P 500. Founded in 2011, AlphaSense is headquartered in New York City with more than 2,000 employees across the globe and offices in the U.S., U.K., Finland, India, Singapore, Canada, and Ireland. Come join us! About the Role: Our Site Reliability Engineering team is growing, and we are looking for a highly experienced Staff Site Reliability Engineer to help shape the future of reliability, scalability, and performance at AlphaSense. This is a hands-on, high-impact role where you will architect core reliability platforms, lead by example in incident response, and drive cultural adoption of SRE best practices across our global engineering organization. Our mission is to engineer our platform to the reliability standards of mission-critical systems, targeting 99.99% uptime, while continuously enhancing our systems and processes. This role is key to that mission and goes beyond traditional system maintenance; it's about pioneering the platforms, practices, and culture that enable engineering to scale effectively. You will act as a force multiplier, mentoring fellow engineers, influencing architectural decisions, and setting the technical bar for reliability across the company. Who You Are: 8+ years of experience in Site Reliability Engineering, DevOps, or a similar role, with at least 3+ of those years operating in a Senior+ SRE position Strong background in running production SaaS systems at scale. Proficiency in at least one programming/scripting language (Python, Go, or similar). Hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes. Deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing). Experience with monitoring & alerting (Prometheus, Grafana, Datadog, ELK). Familiarity with advanced observability (OTEL, continuous profiling). Proven incident management experience, including leading high-severity incidents and postmortems. Strong troubleshooting skills across the full stack. Excellent communication and collaboration skills. What You'll Do: Architect Reliability Paved Paths: Build frameworks and self-service tooling that let teams own the reliability of their services in a “You Build It, You Run It” culture. Lead AI-Driven Reliability: Drive our AIOps strategy - automating diagnostics, remediation, and proactive failure prevention. Champion Reliability Culture: Embed SRE practices across engineering via design reviews, production readiness, and operational standards. Incident Leadership: Act as Incident Commander during critical events, modeling operational excellence, and ensuring blameless postmortems lead to lasting improvements. Advance Observability: Deliver end-to-end monitoring, tracing, and profiling (Prometheus, Grafana, OTEL, Continuous Profiling) to optimize performance proactively. Mentor & Multiply: Elevate engineers across SRE and product teams through mentorship, technical guidance, and knowledge sharing. For base compensation, we set standard ranges for all roles based on function and level benchmarked against similar stage growth companies and internal comparables. In order to be compliant with local legislation, as well as to provide greater transparency to candidates, we share salary ranges on all job postings regardless of desired hiring location. Final offer amounts are determined by multiple factors including candidate experience/expertise and may vary from the amounts listed below. You may also be offered equity, and a generous benefits program. Compensation Range$150,000-$225,000 USD AlphaSense is an equal-opportunity employer. We are committed to a work environment that supports, inspires, and respects all individuals. All employees share in the responsibility for fulfilling AlphaSense's commitment to equal employment opportunity. AlphaSense does not discriminate against any employee or applicant on the basis of race, color, sex (including pregnancy), national origin, age, religion, marital status, sexual orientation, gender identity, gender expression, military or veteran status, disability, or any other non-merit factor. This policy applies to every aspect of employment at AlphaSense, including recruitment, hiring, training, advancement, and termination. In addition, it is the policy of AlphaSense to provide reasonable accommodation to qualified employees who have protected disabilities to the extent required by applicable laws, regulations, and ordinances where a particular employee works. Recruiting Scams and Fraud We at AlphaSense have been made aware of fraudulent job postings and individuals impersonating AlphaSense recruiters. These scams may involve fake job offers, requests for sensitive personal information, or demands for payment. Please note: AlphaSense never asks candidates to pay for job applications, equipment, or training. All official communications will come from ******************* email address. If you're unsure about a job posting or recruiter, verify it on our Careers page. If you believe you've been targeted by a scam or have any doubts regarding the authenticity of any job listing purportedly from or on behalf of AlphaSense please contact us. Your security and trust matter to us.
    $150k-225k yearly Auto-Apply 3d ago
  • Staff Site Reliability Engineer

    Pathai 4.3company rating

    Remote

    Who We Are PathAI is on a mission to improve patient outcomes with AI-powered pathology. We are transforming traditional pathology methods into powerful, new technologies. These innovations in pathology can help accelerate drug development, improve confidence in the accuracy of diagnosis, and get life-saving therapies to patients more quickly. At PathAI, you'll work with a diverse and talented team of people, who are dedicated to solving complex problems and making a huge impact. Where You Fit We're looking for a skilled staff level Site Reliability Engineer focused on designing, building, and operating our hybrid cloud/on-prem environment. What You'll Do If you're the right candidate, you'll be exercising all the skills you have and building new ones along the way: Advancing the state of our operations by implementing SRE best practices - focusing on users, monitoring, and automation. Engineering infrastructure patterns for cloud environments in Amazon Web Services - building in security, reliability and scalability. Designing, building, and operating our data center to support our rapidly growing Machine Learning team. Integrating on-premises datacenter environments with existing cloud infrastructure to create a seamless hybrid cloud environment. Improving the reliability and resilience of our infrastructure through root-cause analysis and reviewing gaps in designs, and implementations of our infrastructure. Participating in platform on-call rotations and assisting with urgent incident response. What You Bring Our employees' skills come in all shapes and sizes, but to be successful in this role with us, you'll at least need: 8+ years of relevant experience. Automation: You work hard to eliminate toil by automating everything through scripting, configuration management tools (Ansible), and code (Python/GoLang). You've built monitoring infrastructure with modern observability tools (Datadog/Grafana/Prometheus). You've worked with infrastructure as code (Terraform/Cloudformation). You've administered physical hardware stacks in production settings (iDRAC/IPMI/Nvidia UFM/Juniper Systems). You're opinionated on storage solutions and how they can be optimized for high performance workloads (Quobyte/S3/FSx/EFS). Familiarity with modern network designs and comfort operating across network layers. Some experience and opinions on virtualization, containerization, or container orchestration platforms. (EKS/ClusterAPI/KVM). Operations experience: You've managed critical production infrastructure and are familiar with incident response, scaling, and rapid growth related challenges. A bachelor's degree in Computer Science or equivalent experience. An insatiable intellectual curiosity and the ability to learn quickly in a complex space. Travel: Willingness to travel up to 25% of the time. We Want To Hear From You At PathAI, we are looking for individuals who are team players, are willing to do the work no matter how big or small it may be, and who are passionate about everything they do. If this sounds like you, even if you may not match the job description to a tee, we encourage you to apply. You could be exactly what we're looking for. PathAI is an equal opportunity employer, dedicated to creating a workplace that is free of harassment and discrimination. We base our employment decisions on business needs, job requirements, and qualifications - that's all. We do not discriminate based on race, gender, religion, health, personal beliefs, age, family or parental status, or any other status. We don't tolerate any kind of discrimination or bias, and we are looking for teammates who feel the same way. The cash compensation outlined below includes base salary or hourly wage and on-target commission for employees in eligible roles. The summary below indicates if an employee in this position is eligible for annual bonus, overtime pay and equity awards. Individual compensation packages are tailored based on skills, experience, qualifications, and other job-related factors. Annual Pay Range: $165,750 - $224,450 Not Overtime Eligible Eligible for Equity #LI-Remote
    $165.8k-224.5k yearly Auto-Apply 1d ago
  • Staff Site Reliability Engineer

    Stash 3.9company rating

    New York, NY jobs

    Want to help everyday Americans invest and build wealth? Financial inequality is increasing, and too many people are getting left behind. At Stash, we are passionate about democratizing wealth creation through education, advice, and products that help customers achieve greater financial freedom. Join our Infrastructure team as a Staff Site Reliability Engineer and play a key role in building and scaling Stash's platforms. You'll drive initiatives that strengthen reliability, design secure and resilient systems, and lead automation efforts that make our infrastructure faster and more efficient in a high-growth environment. What you'll do: Design, build, and operate AWS networking and infrastructure, including VPCs, Transit Gateway, PrivateLink, routing, and security boundaries. Lead Kubernetes (EKS) platform operations - scaling clusters, optimizing workloads, and ensuring reliability of critical services. Automate infrastructure workflows with Terraform and CI/CD pipelines (GitHub Actions) to increase speed and consistency. Configure and maintain Nginx for high-availability, load balancing, and secure traffic management. Troubleshoot and resolve complex issues across systems, networks, and applications (DNS, routing, TCP, container orchestration). Collaborate with engineering teams to design scalable cloud solutions and embed best practices for reliability. Continuously improve observability using Datadog and related tooling to monitor performance and proactively prevent outages. Drive architectural decisions that strengthen system reliability, security, and scalability in AWS. What we're looking for: 8+ years of experience in site reliability engineering or similar roles. Deep expertise in AWS networking (VPC design, Transit Gateway, PrivateLink, routing, security groups, NACLs). Strong experience with Nginx (configuration, tuning, scaling, troubleshooting). Strong expertise in Kubernetes (K8s) and Amazon EKS. Advanced skills in AWS infrastructure setup, management, and optimization. Proficiency in infrastructure as code (Terraform, Terraform Cloud). Strong programming skills in Python and/or Go. Experience with system monitoring (Datadog) and logging/archiving practices. Extensive experience with GitHub Actions for CI/CD pipelines. Proven track record with containerized microservice architectures (Docker). Experience with Kafka. Experience working in PCI or other regulated environments. Gold Stars: Advanced network security design - experience with segmentation strategies, zero-trust architectures, and firewall policy management. Performance optimization expertise - analyzing latency and throughput, tuning DNS resolution, load balancing, and packet-level troubleshooting. Observability leadership - hands-on with Datadog dashboards, metrics strategy, log pipelines, and tracing at scale. Resiliency and chaos engineering - designing fault-tolerant architectures and running game days to validate recovery plans. Compliance and governance experience - prior work in regulated industries (e.g., PCI, SOC 2, HIPAA) beyond just technical enforcement. Cross-team leadership - ability to influence architecture decisions across product and platform teams, and mentor engineers on reliability and networking best practices. Startup and scale-up experience - familiarity with rapid growth environments where infrastructure must evolve quickly while staying reliable. #LI-HYBRID Our Commitment to Diversity, Equity, and Inclusion We proudly celebrate the unique qualities that make you you, 365 days a year, and not just because it's the right thing to do or good for business. We embed the principles and practices of diversity, equity, and inclusion (DEI) into all that we do to prioritize people, a Stash core value, and to ensure Stashers of all backgrounds and experiences can be their authentic selves. We are also proud to be the first and only venture-backed fintech to join the CEO Action for Diversity & Inclusion™, and as an Equal Opportunity Employer, Stash is committed to building an inclusive environment for people of all backgrounds. If you require any reasonable accommodations to make your application process more accessible, please reach out to ********************. Helping You Invest in Yourself Comprehensive total rewards package, comprising compensation (salary and equity) and health care benefits Complimentary subscription to Stash+ account Remote-first work policy - Live and work where you feel the most productive, whether that is in your home, in an office. Flexible PTO Work-from-home equipment stipends; home internet subsidy Paid Parental Leave (offerings for birth giving and non-birth giving parents) Primary & Secondary Enhanced health and wellness benefits through One Medical, Gympass, and Maven Health External Recognition for Stash Benzinga's 2023 Best Brokerage for Beginners and Best Robo-Advisor Awards Qorus-Accenture's 2023 Banking Innovation Awards USA Today and Statista's 2023 Top 500 Best Financial Advisory Firms Comparably's Best Company Awards: Best Places to Work, Best Company Outlook, and Best Engineering Team for Diversity, Women, Culture, and more! (2023) Fintech Breakthrough Award: Best Personal Finance App (2023) BuiltIn's Best Places to Work (2022, 2021, 2020, 2019) Forbes Fintech 50 (2021, 2020, 2019) Best Digital Bank, Finovate Awards (2020) Tearsheet Challenge Awards, Best Banking Card Product - Stock-Back Card, 2020 LendIt Fintech Innovator of the Year (2020, 2019) Salary Range: $149,180 - $222,040 The base salary range represents the reasonably anticipated low and high end of the salary range for this position. Actual salaries will vary and will be based on various factors, such as the candidate's qualifications, skills, experience and competencies, as well as internal equity and alignment with market data for companies of our size and industry. **No recruiters, please**
    $149.2k-222k yearly Auto-Apply 39d ago
  • Lead Site Reliability Engineer, Observability (Remote, North America)

    Vivun 4.2company rating

    Oakland, CA jobs

    Vivun delivers Ava, the AI Sales Teammate for high-velocity sales teams that sells with you and unlocks instant capacity. Powered by a proprietary Sales Reasoning Model, Ava provides real-time guidance before, during, and after calls through text, voice, or avatar. By helping sellers work smarter, faster, and better, Ava saves reps 6-8 hours per week-freeing teams to focus on driving growth. We are building technology that changes how people work, collaborate, and succeed together. Join us in shaping the future of intelligent sales. Position Summary We're seeking a Lead Site Reliability Engineer to rebuild and own our observability strategy across both agentic systems and SaaS infrastructure, creating the frameworks and tooling that enable teams to ship confidently, measure performance, and maintain reliability as we scale. As the Observability Lead, you'll be responsible for designing and implementing Vivun's observability patterns spanning infrastructure, applications, and agentic workloads. You'll work closely with your teammates across engineering, QA, and product to establish unified visibility across the full stack, from LLM-driven agents to backend services. You won't just monitor systems-you'll define the patterns and tools that are a core part of empowering and driving Vivun's engineering culture. Key Responsibilities * Own the end-to-end observability strategy for Ava, defining the standards, tools, and patterns that ensure reliable visibility across infrastructure and agentic components. * Design and implement correlation models that link agent behavior, LLM interactions, and SaaS telemetry into cohesive, actionable insights. * Unify observability tooling across teams, ensuring metrics, logs, and traces flow into a central platform (e.g., Observe, Datadog, or equivalent). * Collaborate with engineering and QA to embed observability best practices into development workflows, CI/CD, and quality gates. * Establish enablement frameworks-documentation, dashboards, and templates-that make observability self-serve for all engineering teams. * Partner with teammates to ensure observability aligns with infrastructure reliability, alerting, and incident response patterns. * Contribute to performance and reliability strategy, helping define how we measure agent quality, responsiveness, and system scalability. Desired Skills & Experience * 6+ years of experience in SRE, DevOps, or Observability Engineering roles, with at least 2+ years leading or designing observability initiatives. * Deep knowledge of observability tooling (e.g., OpenTelemetry, Prometheus, Grafana, Datadog, Honeycomb, Observe, etc.) and distributed tracing practices. * Experience with Agentic / LLM-based systems, including tools like LangChain, Celery, OpenAI APIs, or similar orchestration frameworks. * Strong understanding of how to instrument, trace, and correlate AI/LLM workflows with infrastructure-level telemetry. * Proven ability to define cross-team standards, influence engineering culture, and establish scalable monitoring patterns. * Strong collaboration and communication skills-you enable, not dictate. Nice to Have * Experience building observability into hybrid SaaS + agent architectures. * Background in data pipelines or analytics observability (e.g., tracing data lineage, monitoring model drift). * Familiarity with Python- or Node.js-based observability SDKs. * Prior experience scaling observability in a startup or rapid-growth environment. You Are * A believer in Vivun's core values: Set the Standard. Take Ownership. Stay Curious. Fast & Focused. * Builder at Heart: You want to build the observability foundations for a next-generation agentic platform. * Innovative Problem Solver: You are eager to take on cutting-edge monitoring challenges at the intersection of SaaS and AI. * Collaborative by Nature: You thrive in a high-impact engineering culture that values enablement, empowerment, and shared ownership. * Experienced working in high-growth startup environments: You have the ability to move fast, adapt, and thrive in a dynamic startup environment where you derive priorities, requirements, and goals from company context. What You Will Have At Vivun * Competitive salary and full health benefits * Stock Options at a well funded, pre-IPO company on a fast growth track * Flexible work schedules and work from anywhere at a fully remote company * Unlimited PTO with two weeks designated as "quiet period" each year * An experienced team who will fight beside you in the trenches to accomplish your goals
    $124k-174k yearly est. 60d+ ago
  • Lead Site Reliability Engineer, Observability (Remote, North America)

    Vivun 4.2company rating

    Oakland, CA jobs

    Vivun delivers Ava, the AI Sales Teammate for high-velocity sales teams that sells with you and unlocks instant capacity. Powered by a proprietary Sales Reasoning Model, Ava provides real-time guidance before, during, and after calls through text, voice, or avatar. By helping sellers work smarter, faster, and better, Ava saves reps 6-8 hours per week-freeing teams to focus on driving growth. We are building technology that changes how people work, collaborate, and succeed together. Join us in shaping the future of intelligent sales. Position Summary We're seeking a Lead Site Reliability Engineer to rebuild and own our observability strategy across both agentic systems and SaaS infrastructure, creating the frameworks and tooling that enable teams to ship confidently, measure performance, and maintain reliability as we scale. As the Observability Lead, you'll be responsible for designing and implementing Vivun's observability patterns spanning infrastructure, applications, and agentic workloads. You'll work closely with your teammates across engineering, QA, and product to establish unified visibility across the full stack, from LLM-driven agents to backend services. You won't just monitor systems-you'll define the patterns and tools that are a core part of empowering and driving Vivun's engineering culture. Key Responsibilities Own the end-to-end observability strategy for Ava, defining the standards, tools, and patterns that ensure reliable visibility across infrastructure and agentic components. Design and implement correlation models that link agent behavior, LLM interactions, and SaaS telemetry into cohesive, actionable insights. Unify observability tooling across teams, ensuring metrics, logs, and traces flow into a central platform (e.g., Observe, Datadog, or equivalent). Collaborate with engineering and QA to embed observability best practices into development workflows, CI/CD, and quality gates. Establish enablement frameworks-documentation, dashboards, and templates-that make observability self-serve for all engineering teams. Partner with teammates to ensure observability aligns with infrastructure reliability, alerting, and incident response patterns. Contribute to performance and reliability strategy, helping define how we measure agent quality, responsiveness, and system scalability. Desired Skills & Experience 6+ years of experience in SRE, DevOps, or Observability Engineering roles, with at least 2+ years leading or designing observability initiatives. Deep knowledge of observability tooling (e.g., OpenTelemetry, Prometheus, Grafana, Datadog, Honeycomb, Observe, etc.) and distributed tracing practices. Experience with Agentic / LLM-based systems, including tools like LangChain, Celery, OpenAI APIs, or similar orchestration frameworks. Strong understanding of how to instrument, trace, and correlate AI/LLM workflows with infrastructure-level telemetry. Proven ability to define cross-team standards, influence engineering culture, and establish scalable monitoring patterns. Strong collaboration and communication skills-you enable, not dictate. Nice to Have Experience building observability into hybrid SaaS + agent architectures. Background in data pipelines or analytics observability (e.g., tracing data lineage, monitoring model drift). Familiarity with Python- or Node.js-based observability SDKs. Prior experience scaling observability in a startup or rapid-growth environment. You Are A believer in Vivun's core values: Set the Standard. Take Ownership. Stay Curious. Fast & Focused. Builder at Heart: You want to build the observability foundations for a next-generation agentic platform. Innovative Problem Solver: You are eager to take on cutting-edge monitoring challenges at the intersection of SaaS and AI. Collaborative by Nature: You thrive in a high-impact engineering culture that values enablement, empowerment, and shared ownership. Experienced working in high-growth startup environments: You have the ability to move fast, adapt, and thrive in a dynamic startup environment where you derive priorities, requirements, and goals from company context. What You Will Have At Vivun Competitive salary and full health benefits Stock Options at a well funded, pre-IPO company on a fast growth track Flexible work schedules and work from anywhere at a fully remote company Unlimited PTO with two weeks designated as “quiet period” each year An experienced team who will fight beside you in the trenches to accomplish your goals
    $124k-174k yearly est. Auto-Apply 18d ago
  • Lead Site Reliability Engineer

    One Dynamic 3.7company rating

    Remote

    Quick Details Rate Duration Fully Remote (US) 8+ Years $70-75/hour 6 months+ About One Dynamic One Dynamic is a Service-Disabled Veteran-Owned Small Business (SDVOSB) headquartered in Fairfax, VA. We specialize in digital transformation, cloud infrastructure, quality assurance, and enterprise architecture for federal and healthcare organizations. We are currently seeking a Lead Site Reliability Engineer to support our client ARC, a rapidly growing device management company revolutionizing how frontline workers interact with enterprise mobile devices. About the Role The Lead Site Reliability Engineer is a senior technical leadership role responsible for the reliability, availability, and operational excellence of the cloud infrastructure and kiosks platform. This role owns uptime, SLAs, and incident response while driving long-term improvements to system resilience, observability, and operational maturity. The Lead SRE serves as both a hands-on technical leader and a force multiplier across platform, QA, and development teams. This role is well-suited for an experienced engineer who thrives in high-ownership environments and can balance real-time operational demands with strategic reliability initiatives. Strong communication, sound technical judgment, and a bias toward preventative engineering are critical to success. Key Responsibilities Own uptime, SLAs, and overall reliability of the cloud infrastructure and kiosks platform Lead incident response, root-cause analysis, and drive actionable postmortems Automate infrastructure, deployments, and operational tasks using modern IaC and scripting in collaboration with the Platform Engineering team Maintain and improve monitoring, alerting, and observability (e.g., Grafana, Prometheus, New Relic). Execute and continuously improve disaster recovery and business continuity plans Partner with platform engineering, QA, and development teams to ensure operational readiness Establish and maintain runbooks, operational standards, and reliability best practices Provide leadership, mentorship, and clear communication during both normal operations and incidents Optimize cloud and Kubernetes environments for reliability, performance, and scalability Required Qualifications 8+ years in SRE, DevOps, or Platform Engineering roles; 2+ years in a senior or lead capacity Strong experience supporting production environments with strict SLAs and high uptime requirements Deep knowledge of Kubernetes, containers, and cloud-native infrastructure Proficiency in automation and scripting using Bash, Python, or Go Hands-on experience with CI/CD pipelines and release engineering in modern environments Expert-level familiarity with IaC tools (Terraform preferred) Strong understanding of monitoring, alerting, logging, and observability tooling Experience implementing and managing GitOps workflows (ArgoCD or similar) Demonstrated ability to lead incidents and communicate effectively with technical and non-technical stakeholders Solid understanding of disaster recovery planning, resilience practices, and system hardening Must be authorized to work in the United States (US-based candidates only) The Ideal Candidate You think several steps ahead. You are relentless, strategic, and a long-term thinker. You believe the details are essential, and so you get them right. You are a fast learner. You take feedback well and implement it. You care about achieving the best outcome and do not focus on being right or wrong. About the Client ARC is a device management solution integrated with smart lockers, designed to store, secure, and charge company-owned handheld devices (E.g., Zebra, Honeywell) used by frontline workers to perform core job functions. Launched in late 2021, ARC was spun off from ChargeItSpot, a consumer-facing phone-charging technology company established in 2012. ARC's Mission: Minimize Device Waste. Maximize Worker Productivity. Make Life Easier. How to Apply If you have the unique combination of skills and qualities we are seeking, please submit your resume via One Dynamic's careers portal. We look forward to hearing from you! One Dynamic is an Equal Opportunity employer. Personnel are chosen based on ability without regard to race, color, religion, sex, national origin, disability, marital status, or sexual orientation, in accordance with federal and state law.
    $70-75 hourly Auto-Apply 14d ago
  • Site Reliability Engineer

    Unqork 4.1company rating

    Remote

    Unqork empowers enterprises to accelerate growth by rapidly building, testing, and running AI-powered applications that embody the future of enterprise development. Trusted by the world's largest organizations in highly regulated industries, these applications become more secure over time while significantly reducing technical debt-allowing businesses to focus on innovation rather than maintenance. Unqork's customers include Goldman Sachs, Marsh, BlackRock, and the U.S. Department of Health and Human Services. At Unqork, we value inclusive and innovative thinkers who boldly challenge the status quo. We encourage you to apply! The Impact U will make: Report to our Devops Engineering Manager You will combine software and systems engineering to build and run large-scale, distributed, fault-tolerant systems. Your primary goal will be to solve operational problems with a software engineering mindset, treating operations as a software problem and automating away toil. Define, measure, and monitor Service Level Objectives (SLOs) and Service Level Indicators (SLIs) in partnership with product and engineering teams. Champion and help manage error budgets to guide the balance between reliability work and new feature velocity. Scale systems sustainably through automation; evolve systems by pushing for changes that improves reliability and velocity. Increase visibility into the health and durability of the Unqork platform. Practice sustainable incident response and blameless postmortems. Maintain existing services and tools, augmenting and replacing What U bring: 3-5 years as a Site Reliability Engineer or Software Developer Advanced experience with programming/scripting languages such as JavaScript/NodeJS, Go, or Python. Knowledge in Linux monitoring, troubleshooting, and administration. Can write and evaluate code for scalability/runtime. Experience with container orchestration platforms such as Kubernetes or Nomad. Experience with monitoring, APM, and logging tooling (Eg: ELK, Grafana, Datadog, NewRelic, or Splunk). Experience working with at least one DBMS (Eg: Postgres, MySQL, Oracle, or MongoDB). Experience with configuration management tools (Eg: Ansible, Puppet, Chef, or Salt). Ansible Tower or AWX is a plus. Experience with Infrastructure-as-Code tools such as Terraform, Cloudformation, Google Deployment Manager, or Azure Resource Manager. Experience working with at least one major Cloud Provider (AWS/Azure/GCP). Understanding of cloud native security requirements (Eg: WAF, security groups) Compensation, Benefits, & Perks 💻 Work from home with a remote-first community 🏝 Unlimited PTO (and the encouragement to use it) 📝 Student loan payback program 🏥 100% employer-covered medical, dental, and vision options available to you and your dependents 💸 Flexible Spending Account (FSA) 🏠 Monthly stipend toward your WFH setup, vacation, development and more 💰 Employer-sponsored 401(k) with contribution match 🏋🏻 ♀️ Subsidized ClassPass Membership 🍼 Generous Paid Parental Leave 💲 Hiring Ranges: Tier 1: $117,000 - $155,000 Tier 2: $105,000 - $140,000 Unqork employs a market-driven approach to establish compensation ranges. In addition to a base salary, employees may also be eligible to receive a target incentive and company equity in the form of stock options. An employee's compensation within the range provided above depends on a variety of factors including, but not limited to, their location, role, skillset, level of experience, and similar peer salaries. As a remote-first company, Unqork incorporates a geographic differential into our compensation structure, depending on the candidate's location. We utilize a tiered system-Tier 1 and Tier 2-to accurately reflect local market rates and ensure our compensation packages are both fair and competitive. Our geographic tiers are defined as follows: Tier 1: New York Metro, Seattle Metro, San Francisco Bay Area Tier 2: All other US and US territory locations Unqork embraces a culture of security and privacy awareness by consistently safeguarding sensitive information, adhering to company policies, and actively participating in training and initiatives to protect our data and the privacy of our stakeholders. Unqork is an equal opportunity employer. We will consider all qualified applicants without regard to race, color, nationality, gender, gender identity or expression, sexual orientation, religion, disability or age. #LI-LN1
    $117k-155k yearly Auto-Apply 60d+ ago
  • Staff Site Reliability Engineer

    Bugcrowd 3.9company rating

    Remote

    We are Bugcrowd. Since 2012, we've been empowering organizations to take back control and stay ahead of threat actors by uniting the collective ingenuity and expertise of our customers and trusted alliance of elite hackers, with our patented data and AI-powered Security Knowledge Platform™. Our network of hackers brings diverse expertise to uncover hidden weaknesses, adapting swiftly to evolving threats, even against zero-day exploits. With unmatched scalability and adaptability, our data and AI-driven CrowdMatch™ technology in our platform finds the perfect talent for your unique fight. We aim to create a new era of modern crowdsourced security that outpaces threat actors. Unleash the ingenuity of the hacker community with Bugcrowd, visit ***************** Based in San Francisco and New Hampshire, Bugcrowd is supported by General Catalyst, Rally Ventures, Costanoa Ventures, and others. Job Summary We're seeking a Staff Site Reliability Engineer to serve as a technical leader within our infrastructure organization. In this role, you'll help shape the reliability strategy across our engineering teams, drive adoption of best practices, and tackle our most complex infrastructure challenges. You'll be part of an international, highly engaged and technical group that is well-versed in building enterprise-ready and extremely secure software systems. Our core values of “simple is strong, respect is king, build it like you own it and think like a hacker” should resonate with you. Essential Duties and Responsibilities Define and drive the technical vision for infrastructure reliability across the organization Architect large-scale, fault-tolerant systems on AWS using Terraform Lead cross-functional initiatives to improve system reliability, scalability, and efficiency Establish standards for infrastructure-as-code, CI/CD, and deployment practices Design and implement solutions for our most complex operational challenges Lead incident response for critical outages and drive systemic improvements Mentor senior engineers and help grow the SRE team's capabilities Evaluate and introduce new technologies that improve operational excellence Influence engineering culture around reliability, observability, and operational maturity Education, Experience, Skills, & Abilities 5+ years of experience in SRE, DevOps, or systems engineering, with demonstrated technical leadership Expert-level knowledge of Terraform, including module design, state management, and scaling IaC across teams Deep expertise in AWS architecture and services at scale, with strong focus on ECS Proven experience designing and operating containerized workloads on ECS, including capacity planning, service scaling, and task placement strategies Strong experience designing and implementing CI/CD systems with GitHub Actions or similar tools Track record of leading complex, cross-team technical initiatives Advanced proficiency in Python, Ruby, Javascript, or similar languages Strong understanding of distributed systems principles Excellent written and verbal communication skills Proven ability to balance long-term technical strategy with immediate operational needs Preferred Experience Experience building internal developer platforms or self-service infrastructure tooling Knowledge of FedRAMP Background in cost optimization and FinOps practices Contributions to open-source infrastructure projects Experience scaling infrastructure organizations and processes Experience defining and implementing SLO frameworks Working Conditions The ideal candidate must be able to complete all physical requirements of the job with or without reasonable accommodation. Sitting and/or standing - Must be able to remain in a stationary position 50% of the time Carrying and /or lifting - Must be able to carry / move laptop as needed throughout the work day. Environment - remote, work-from-home 100% of the time. ADA Statement Bugcrowd is committed to the full inclusion of all qualified individuals. In keeping with our commitment, Bugcrowd will take the steps to assure that people with disabilities are provided reasonable accommodations. Accordingly, if reasonable accommodation is required to fully participate in the job application or interview process, to perform the essential functions of the position, and/or to receive all other benefits and privileges of employment, please contact HR at ****************. Pay Range Disclosure At Bugcrowd, we strive for fairness, equality and to create an environment that allows our people to perform at their very best. Our compensation philosophy is to foster a collaborative community that rewards, attracts and retains the best possible talent. The provided salary details are based on US national averages and we retain the flexibility to tailor to the needs of the business. The national estimate for the current base range for the position of Staff Site Reliability Engineer is: $151,040 -$188,800. This position may also be eligible to participate in a discretionary bonus program or commission plan, subject to the rules governing the program, whereby an award, if any, depends on various factors, including, without limitation, individual and organizational performance. Culture At Bugcrowd, we understand that diversity in the workplace is vital to a company's success and growth. We strive to make sure that people are included and have a sense of being part of making Bugcrowd not only a great product but a great place to work. We regularly hear from both customers and researchers that Bugcrowd feels like a family, and we strive to maintain that internally as well. Our team consists of a broad range of people: musicians, adventure sports junkies, nature lovers, parents, cereal enthusiasts, night owls, cyclists, artists-you get the point. At Bugcrowd, we are solving security threats and vulnerabilities that are relevant to everyone, therefore we believe solving these problems takes all kinds of backgrounds. We value the perspectives and experiences people from underrepresented backgrounds bring. Disclaimer This position has access to highly confidential, sensitive information relating to the technologies of Bugcrowd. It is essential that the applicant possess the requisite integrity to maintain the information in the strictest confidence. The company is authorized to obtain background checks for employment purposes under state and federal law. Background checks will be conducted for positions that involve access to confidential or proprietary information (including trade secrets). Background checks may include Social Security verification, prior employment verification, personal and professional references, educational verification, and criminal history. Applicants with conviction histories will not be excluded from consideration to the extent required by law. Equal Employment Opportunity: Bugcrowd is EOE, Disability/Age Employer. Individuals seeking employment at Bugcrowd are considered without regards to race, color, religion, national origin, age, sex, marital status, ancestry, physical or mental disability, veteran status, gender identity, or sexual orientation. Apply at: ***************************************
    $151k-188.8k yearly Auto-Apply 22d ago
  • Site Reliability Engineer (AI Forms Platform)

    Filevine 4.3company rating

    Remote

    Filevine is forging the future of legal work with cloud-based workflow tools. We have a reputation for intuitive, streamlined technology that helps professionals manage their organization and serve their clients better. We're also known for our team of extraordinary and passionate professionals who love working together to help organizations thrive. Our success has catapulted Filevine to the forefront of our field-we are ranked as one of the most innovative and fastest-growing technology companies in the country by both Deloitte and Inc. Our MissionFilevine is building the seamless intersection between legal and business by creating a world- class platform to help professionals scale. Job Summary We are seeking a Site Reliability Engineer (SRE) to build and maintain the production infrastructure for a new, mission-critical forms engine. You will be responsible for ensuring high availability, implementing SOC-aligned security controls, and managing the CI/CD pipelines that enable rapid iteration. This is a builder role where you will define the architecture for a new product line. We expect you to be an AI-augmented engineer, utilizing modern AI tools to automate infrastructure coding (IaC), troubleshoot incidents faster, and optimize system performance.Responsibilities Infrastructure as Code: Architect and deploy secure, scalable infrastructure using Terraform, CloudFormation, or similar tools to support the new Forms Platform. Availability & Uptime: Ensure the platform meets strict SLA requirements for enterprise clients, minimizing downtime and "P1 incidents". Observability: Implement comprehensive monitoring, logging, and alerting (Datadog, New Relic, etc.) to provide deep visibility into AI model performance and system health. Security & Compliance: Design architecture that aligns with SOC standards and ensures proper handling of PII/PHI data and audit trails for model outputs. Release Engineering: Build and maintain efficient CI/CD pipelines to support the "tapering" of legacy systems and the rapid deployment of new features. Incident Response: Lead incident response efforts for the Forms Platform and conduct post-mortems to drive continuous improvement. Automation: Aggressively automate manual operations tasks using scripting (Python/Go) and AI tools to reduce toil. Qualifications Bachelor's degree in Computer Science, Computer Engineering, or related field. 3+ years of SRE or DevOps experience, specifically in high-availability production environments. Cloud Proficiency: Deep expertise in AWS or Azure ecosystem, including container orchestration (Kubernetes/Docker). Security Mindset: Experience implementing security best practices (SOC2, HIPAA) in a cloud environment. Scripting: Proficiency in Python, Go, or Bash for automation. Agile/Scrum: 1 to 3 years experience with scrum/agile development methodologies. AI Adaptability: Willingness and ability to use AI/LLMs to accelerate infrastructure development and debugging. Communication: Excellent verbal and written communication skills to document architecture and incident reports Filevine is an Equal Opportunity Employer. Qualifications for employment, promotion and other terms and conditions of employment are based upon the ability to perform the job. Equal-employment opportunities are provided to all applicants and employees without regard to race, creed, religion, color, age, national origin, sex, disability, veteran status, or other legally protected class. Filevine is committed to providing reasonable accommodations for qualified individuals with disabilities. If you need assistance or accommodation due to disability, or if you have concerns related to Filevine's equal employment opportunities, you may contact us at ****************** Cool Company Benefits:- A dynamic, rapidly growing company, focused on helping organizations thrive - Medical, Dental, & Vision Insurance (for full-time employees)- Competitive & Fair Pay- Maternity & paternity leave (for full-time employees)- Short & long-term disability- Opportunity to learn from a dedicated leadership team- Centrally located open office building in Sugar House (onsite employees)- Top-of-the-line company swag Privacy Policy NoticeFilevine will handle your personal information according to what's outlined in our Privacy Policy. Communication about this opportunity, or any open role at Filevine, will only come from representatives with email addresses using "filevine.com". Other addresses reaching out are not affiliated with Filevine and should not be responded to.
    $101k-148k yearly est. Auto-Apply 4d ago
  • Site Reliability Engineer

    Minio 4.1company rating

    Remote

    MinIO is the industry leader in high-performance object storage and the company behind the world's fastest, most widely deployed object store, powering production infrastructure for more than half of the Fortune 500, including 9 of the 10 largest global automakers and all 10 of the largest U.S. banks. Our enterprise offering, AIStor, is engineered to handle the scale, speed, and pressure of modern AI and analytics, from terabytes to exabytes, all in a single namespace. As a Site Reliability Engineer, you will work closely with customers as well as the engineering team on enhancing, optimizing, validating and automating our cloud-native storage platform. Your role will be a mix of DevOps and software engineering to assure that MinIo is delivering a very high quality product with high-performance, scalability and durability to enable seamless data storage and retrieval for demanding workloads for customers. This role requires deep expertise in DevOps practices, SRE, systems programming, distributed computing, and storage architectures. You will work closely with a world-class team of engineers to push the boundaries of object storage performance and reliability. What You Will Do: Enhance, optimize, validate and automate core MinIO software for performance, scalability, and security. Help building and delivering high-performance distributed storage solutions with a focus on cloud-native architectures. Validate the MinIO Software according to customer environment and requirements, ensuring no surprises are observed at customer deployments. Improve existing features, fix critical issues, and contribute to open-source repositories. Collaborate with other engineers to refine architecture, APIs, and integrations. Write efficient, well-documented, and maintainable code. Conduct performance benchmarking and debugging of complex storage environments. Work closely with customers to address issues, and manage expectations. Your Skills and Experience: Bachelor's or Master's degree in Computer Science, Engineering, or a related field. 5+ years of professional experience in software engineering. Desire and ability to directly work with customers to solve their problem with product enhancement and automation. Experience in DevOps, GitOps, Automation and testing frameworks. Expertise in distributed systems, networking, or high-performance computing. Experience with cloud-native technologies (Kubernetes, containers, microservices). Strong proficiency in Go desired (or deep experience in C/C++/Rust with a willingness to learn Go). Deep understanding of storage systems, file systems, or databases. Strong problem-solving skills and experience debugging complex, large-scale applications. Contributions to open-source projects are a plus. Ability to work in a fast-moving, collaborative environment with a strong sense of ownership. Empathy towards the customer and ability to quickly dig in to resolve any customer issue. Passion for innovation and staying current with technology trends. Self-motivation and a commitment to continuous learning and adopting new tools and frameworks. A strong sense of ownership and accountability in delivering high-quality work while directly working with customers.. A collaborative and team-oriented mindset, thriving in environments that value open communication and shared goals. Ability to collaborate effectively with cross-functional teams, contributing to a positive and productive work environment. Attention to detail and fine craftsmanship. What We Offer: Health Care Plan (Medical, Dental & Vision) 401K with 3% Contribution Pre-IPO Stock Options At least 12 Public Holidays Flexible Time Off Equal Opportunity Policy (EEO) MinIO is proud to be an equal opportunity workplace and an affirmative action employer. We review applications for employment without regard to their race, color, religion, sex, sexual orientation, gender identity, national origin, ancestry, citizenship, age, veteran status, genetic information, physical or mental disability, medical condition, marital status, or any other basis prohibited by law.
    $94k-136k yearly est. Auto-Apply 18d ago
  • Site Reliability Engineer

    Podium Corporation 4.5company rating

    Remote

    At Podium, our mission is to arm every local business with a complete platform and outcome-driven AI employees that convert leads into real, paying customers. Every day, millions of workers use our AI lead conversion and communication platform to help them get more leads and make more money. Our work and focus on helping local businesses thrive has been recognized across the industry, including Forbes' Next Billion Dollar Startups, Forbes' Cloud 100, the Inc. 5000, and Fast Company's World's Most Innovative Companies. At Podium, we believe in fostering a culture that thrives on hiring and developing exceptional talent. Our operating principles serve as a compass, guiding daily behavior and decision-making, and ensure we hire people who will thrive at Podium. If you resonate with our operating principles and are energized by our mission, Podium will be a great place for you! Site Reliability Engineer At Podium, our Site Reliability Engineers (SREs) operate at the intersection of software and systems engineering. The SRE team ensures our products are stable, scalable, sustainable, and seamless. We partner closely with product engineering teams to address their operational needs while continuously improving the reliability and performance of Podium's platform. We're looking for a SRE who can make an impact from day one! What You'll Do Work with technologies including Kubernetes, Helm, Docker, AWS, Terraform, Datadog, Honeycomb, Prometheus, Ansible, StrongDM, Python, Go, Ruby, GitLab/GitHub, and CI/CD pipelines. Collaborate across Podium's engineering community to identify areas for improvement, enhance reliability, and create a safer, more efficient system. Participate in an on-call rotation, triaging and resolving production and development issues. Partner with cross-functional teams to minimize downtime and ensure platform resilience. Mentor junior engineers, fostering growth and technical excellence. What You'll Bring Bachelor's degree in a technical field or equivalent experience. 4+ years experience supporting production systems in a software or systems engineering role. 3+ years deploying, operating, and debugging server software on Linux. Strong curiosity and a desire to learn continuously. Willingness to participate in on-call rotations. What We Hope You Have Experience with distributed systems and microservices. Knowledge of system design principles. Hands-on experience with cloud computing (AWS, GCP, or Azure). Familiarity with SOC2, HIPAA, PCI, or similar compliance frameworks. Experience building and maintaining CI/CD pipelines. Deep expertise in infrastructure engineering. BENEFITS Open and transparent culture Life insurance, long and short-term disability coverage Paid maternity and paternity leave Fertility Benefits Generous vacation time, plus three 4-day summer holiday weekends Excellent medical, dental, and vision benefits 401k Plan with company matching Bi-annual swag drops with cool Podium gear and apparel A stellar HQ (Utah) gym with local professional coaches and classes offered Onsite HQ (Utah) child care center, subsidized for employees Podium is an equal opportunity employer. Podium provides equal employment opportunities (EEO) to all employees and applicants for employment without regard to race, color, religion, gender, national origin, sexual orientation, gender identity or expression, age, disability, genetic information, marital status or veteran status.
    $96k-139k yearly est. Auto-Apply 29d ago
  • Staff Database Reliability Engineer

    Boulevard Ford 4.6company rating

    Remote

    Who is Boulevard? Boulevard provides the first and only client experience platform for appointment-based, self-care businesses. We empower our customers to give their clients more of the magical moments that matter most. Before launching in 2016, our founders spent months interviewing salon managers and working behind front desks to understand their pain points so we could design a modern, user-friendly platform that meets the unique needs of their business. Our roots may be in hair salons, but we are built for the broader self-care industry, including many types of salons, spas, medspa, barbershops, and more. Our technology not only helps our customers survive but thrive. Take a look at how we (and YOU) can make that happen. We have an insatiable curiosity and embrace experimentation. We believe that simple solutions require the most sophistication, and we design each and every detail to maximize potential, power, and impact. Do our values match? Read through our story and what we value the most. Our team values and celebrates our diverse backgrounds. Being open about who we are and what we do allows us to do the best work of our lives. We believe in equal opportunity for all, and you should too. Come do the best work of your life at Boulevard. We're hiring a Staff Database Reliability Engineer to shape the foundation of Database Reliability Engineering at Boulevard. This role goes beyond optimizing queries or reviewing SQL PRs - you'll own database reliability at scale within our cloud-based infrastructure, influence site reliability practices, and drive our RDBMS scalability strategy. You'll help teams improve how they design, operate, and depend on databases through repeatable, reliable practices. Reporting to the Director of Cloud & Reliability, this is a hands-on technical leadership role focused on elevating reliability practices and building resilient database platforms. You'll define what “good” looks like, partner closely with engineering teams, and help Boulevard operate databases that scale with confidence. The Cloud & Reliability group operates on four foundational principles: Reliable Infrastructure - a foundation of stability and security Developer Productivity - empowering builders to do the right things Clear Ownership - accountability aligned with ownership; collaboration over silos Long-Term Focus - we engineer for tomorrow Key Projects & Initiatives Database Reliability & Fault Tolerance: Lead initiatives to make our database platforms more robust, fault-tolerant, and self-healing. Platform Performance Optimization: Drive continuous improvement in performance and cost efficiency, using observability data to identify and resolve bottlenecks with a focus on RDBMS infrastructure. Observability & Operational Insight: Enhance database observability across metrics, logging, and tracing to ensure deep visibility into production health and behavior. What You'll Do Here Develop a deep understanding of how Boulevard's systems behave, scale, interact, and fail, and use that insight to identify risks and improvement opportunities. Own and improve database reliability, performance, and scalability; participate in incident response and drive architectural improvements that reduce incident frequency and impact. Partner with engineering teams to design, build, and operate scalable, fault-tolerant, and secure distributed systems that support Boulevard's growth and customer trust. Build tools, automation, and frameworks that eliminate toil, reduce operational overhead, and establish best practices used across engineering teams. Elevate observability and operational excellence through actionable metrics, alerts, and dashboards that enable faster incident resolution and proactive reliability improvements. Mentor and influence engineers across the organization, helping foster a culture where reliability is a shared responsibility. What You'll Need to Thrive Deep Systems Expertise: 8-10+ years of experience in systems, infrastructure, or backend software engineering, with a strong focus on RDBMS and NoSQL systems. Cloud Database Experience: Production experience with managed cloud databases such as AWS Aurora/RDS (PostgreSQL), and deploying/managing infrastructure using infrastructure-as-code tools. Reliability Engineering Mindset: Proven experience delivering reliability outcomes using SLOs, SLIs, error budgets, and mature observability practices. Automation-First Philosophy: Strong background in automation, scripting, and infrastructure-as-code (e.g., Terraform, Python, Go, or similar). Incident Management Mastery: Experience diagnosing and mitigating production incidents in high-availability systems, with a focus on learning and continuous improvement. Collaboration & Influence: Excellent communication skills and the ability to influence without authority across engineering teams. Technical Leadership & Mentorship: Demonstrated ability to set technical standards, mentor engineers, and scale impact through others. Comfort with Ambiguity: Ability to navigate uncertainty, set direction, and iterate toward meaningful outcomes in a fast-moving environment. Bonus Experience with Elixir, Phoenix, Ruby, or Rails Hands-on experience identifying and improving database performance In addition to the wonderful people you'll get to work with and challenging projects that'll push you - Boulevard is here to make sure you're always at the top of your game emotionally, mentally, and physically. ✨ We've got you covered with a 401(k) match plus dental, medical, vision, and life insurance. 🏝 Take a break whenever you need with our flexible vacation day policy. 🖥 Fully remote so you can choose where you want to work. You'll receive a work from home stipend every month. 💚 Family planning resources and specialized support programs. 🔮 Equity: get ahead on the ground floor and grow with Boulevard. 💅 Boulevard Bucks Learning and Development program allows employees to explore businesses in the market we serve. 📲 We recommend following our official LinkedIn page to stay up to date on all things Boulevard life! Boulevard Labs, Inc. is an Equal Opportunity Employer committed to hiring a diverse workforce and sustaining an inclusive culture. All employment decisions at Boulevard Labs, Inc. are based on business needs, job requirements, and individual qualifications, without regard to race, color, religion, marital status, age, national origin, ancestry, physical or mental disability, medical condition, pregnancy, gender, sexual orientation, gender identity or expression, veteran status, or any other status protected under federal, state, or local law.
    $94k-136k yearly est. Auto-Apply 4d ago
  • Site Reliability Engineer | Growth and Transformation

    Red Ventures 4.4company rating

    Charlotte, NC jobs

    This role is not open to visa sponsorship or transfer of visa sponsorship including those on H1-B, F-1, OPT, STEM-OPT, or TN visa, nor is it available to work corp-to-corp. This is a hybrid opportunity. Candidates are asked to report to our Fort Mill, SC office, 3x per week, Tuesday through Thursday and work remotely on Monday and Friday. The Growth and Transformation team at Red Ventures is seeking a Site Reliability Engineer (SRE) to ensure our platforms and applications are resilient, scalable, and perform at lightning speed. You'll work across AWS, GCP, and Kubernetes environments, helping us meet ambitious 99.99% uptime goals while driving automation, observability, and performance improvements at scale. This isn't a reactive role - you'll be empowered to design reliability into our systems from the start, partner with engineers across the org, and continually improve how we build, monitor, and operate mission-critical services. What You'll Do: Ensure system reliability and performance across multi-cloud, multi-region platforms. Build and maintain observability solutions (OpenTelemetry, New Relic, Grafana) for real-time insights. Automate infrastructure and deployments with Terraform and custom tooling. Lead and participate in incident response, troubleshooting issues, restoring service quickly, and driving root cause analysis. Define and manage SLOs/SLIs that hold us accountable to business-critical SLAs Scale infrastructure capacity to meet growth and traffic demands. Partner with developers to embed reliability best practices into application design and delivery Manage and optimize Kubernetes clusters across AWS and GCP. Contribute to architecture reviews with a focus on reliability and scalability. Foster a culture of continuous improvement, experimentation, and operational excellence. What We're Looking For: 3-5 years in SRE, DevOps, or cloud infrastructure engineering Strong experience with AWS, GCP, and Kubernetes orchestration Skilled in infrastructure as code (Terraform) Proficient in observability and monitoring tools (New Relic, Grafana, OpenTelemetry) Familiar with CI/CD pipelines, automated deployments, and scripting (Python, Bash, Go, etc.) Experience maintaining high-availability systems (99.9%+) Strong grasp of distributed systems, microservices, and scalability patterns Incident response and troubleshooting experience with a focus on learning from failures Excellent communication and collaboration skills Bonus Points For: Certifications (AWS Solutions Architect, GCP Professional Cloud Architect) Experience with chaos engineering, resilience testing, or load balancing at scale Familiarity with Salesforce or Adobe ecosystems Database performance tuning expertise Exposure to log aggregation tools (ELK, Splunk) Strong knowledge of cloud security and multi-region networking Why Join Us? At Red Ventures, we know reliability powers growth. As part of our Growth and Transformation team, you'll not only keep systems running - you'll shape how reliability is engineered into every layer of our platforms. You'll work alongside passionate engineers in a culture that values automation over manual toil, learning from failure, and building systems that scale with confidence. Here, your work won't just keep the lights on. It will drive innovation, empower partners, and directly impact the success of our business. Compensation: This range reflects total cash compensation, which may include base salary only or base salary plus target bonus, depending on the role. Where eligible, equity may also be offered separately and not included below. Actual compensation varies based on location, experience, and qualifications. Total Cash Compensation Range: $100,000 - $145,000 per year Additionally, the following benefits are provided by Red Ventures, subject to eligibility requirements. Health Insurance Coverage (medical, dental, and vision) Life Insurance Short and Long-Term Disability Insurance Flexible Spending Accounts Holiday Pay 401(k) with match Employee Assistance Program Paid Parental Bonding Benefit Program Flexible Paid Time Off (PTO): We believe time to rest and recharge is essential. That's why we offer a generous and flexible PTO policy. Full-time employees accrue 20 days of PTO for a full calendar year annually, with an increase to 25 days after five years of service. Who We Are: Red Ventures is a global portfolio of high-growth companies - spanning several U.S. businesses, a joint venture in the health services industry, and strategic investments in Europe and Puerto Rico. Their businesses include The Points Guy, Lonely Planet, Bankrate, the Allconnect Platform, RV Home Client Growth, RV Growth & Transformation, Sage Home Loans Corporation, RV Education and more. Across the portfolio, Red Ventures businesses deliver seamless digital experiences for consumers, help Fortune 100 clients solve large-scale digital growth challenges, and create world-class experiences and opportunities for employees. Learn more at redventures.com and follow @RedVentures on LinkedIn and Instagram. At Red Ventures, we believe diverse, inclusive teams are better. To help you better understand our core values and beliefs, we encourage you to watch this brief YouTube video: Our Belief Statements. This will give you insight into the principles that guide our work and our commitment to fostering an inclusive environment. We offer competitive salaries and a comprehensive benefits program for full-time employees, including medical, dental and vision coverage, paid time off, life insurance, disability coverage, employee assistance program, 401(k) plan and a paid parental leave program. Red Ventures is an equal opportunity employer that does not discriminate against any employee or applicant because of race, creed, color, religion, gender, sexual orientation, gender identity/expression, national origin, disability, age, genetic information, veteran status, marital status, pregnancy or any other basis protected by law. Employment at Red Ventures is based solely on a person's merit and qualifications. We are committed to providing equal employment opportunities to qualified individuals with disabilities. This includes providing reasonable accommodation where appropriate. Should you require a reasonable accommodation to apply or participate in the job application or interview process, please contact accommodation@redventures.com. If you are based in California, we encourage you to read this important information for California residents linked here. #LI-HYBRID Click here for more details regarding the employee privacy policy: ******************************************************* Questions about this Privacy Notice can be directed to ******************************. Alternatively, you may raise any questions or concerns to your manager, HR Business Partner, or through the Privacy Team.
    $100k-145k yearly Auto-Apply 60d+ ago
  • Staff Site Reliability Engineer

    Bugcrowd 3.9company rating

    Bedford, NH jobs

    We're seeking a Staff Site Reliability Engineer to serve as a technical leader within our infrastructure organization. In this role, you'll help shape the reliability strategy across our engineering teams, drive adoption of best practices, and tackle our most complex infrastructure challenges. You'll be part of an international, highly engaged and technical group that is well-versed in building enterprise-ready and extremely secure software systems. Our core values of "simple is strong, respect is king, build it like you own it and think like a hacker" should resonate with you. Essential Duties and Responsibilities * Define and drive the technical vision for infrastructure reliability across the organization * Architect large-scale, fault-tolerant systems on AWS using Terraform * Lead cross-functional initiatives to improve system reliability, scalability, and efficiency * Establish standards for infrastructure-as-code, CI/CD, and deployment practices * Design and implement solutions for our most complex operational challenges * Lead incident response for critical outages and drive systemic improvements * Mentor senior engineers and help grow the SRE team's capabilities * Evaluate and introduce new technologies that improve operational excellence * Influence engineering culture around reliability, observability, and operational maturity Education, Experience, Skills, & Abilities * 5+ years of experience in SRE, DevOps, or systems engineering, with demonstrated technical leadership * Expert-level knowledge of Terraform, including module design, state management, and scaling IaC across teams * Deep expertise in AWS architecture and services at scale, with strong focus on ECS * Proven experience designing and operating containerized workloads on ECS, including capacity planning, service scaling, and task placement strategies * Strong experience designing and implementing CI/CD systems with GitHub Actions or similar tools * Track record of leading complex, cross-team technical initiatives * Advanced proficiency in Python, Ruby, Javascript, or similar languages * Strong understanding of distributed systems principles * Excellent written and verbal communication skills * Proven ability to balance long-term technical strategy with immediate operational needs Preferred Experience * Experience building internal developer platforms or self-service infrastructure tooling * Knowledge of FedRAMP * Background in cost optimization and FinOps practices * Contributions to open-source infrastructure projects * Experience scaling infrastructure organizations and processes * Experience defining and implementing SLO frameworks Working Conditions The ideal candidate must be able to complete all physical requirements of the job with or without reasonable accommodation. Sitting and/or standing - Must be able to remain in a stationary position 50% of the time Carrying and /or lifting - Must be able to carry / move laptop as needed throughout the work day. Environment - remote, work-from-home 100% of the time. ADA Statement Bugcrowd is committed to the full inclusion of all qualified individuals. In keeping with our commitment, Bugcrowd will take the steps to assure that people with disabilities are provided reasonable accommodations. Accordingly, if reasonable accommodation is required to fully participate in the job application or interview process, to perform the essential functions of the position, and/or to receive all other benefits and privileges of employment, please contact HR at ****************. Pay Range Disclosure At Bugcrowd, we strive for fairness, equality and to create an environment that allows our people to perform at their very best. Our compensation philosophy is to foster a collaborative community that rewards, attracts and retains the best possible talent. The provided salary details are based on US national averages and we retain the flexibility to tailor to the needs of the business. The national estimate for the current base range for the position of Staff Site Reliability Engineer is: $151,040 -$188,800. This position may also be eligible to participate in a discretionary bonus program or commission plan, subject to the rules governing the program, whereby an award, if any, depends on various factors, including, without limitation, individual and organizational performance.
    $151k-188.8k yearly Auto-Apply 20d ago
  • Site Reliability Engineer

    Rx Savings Solutions 4.2company rating

    Columbus, OH jobs

    McKesson is an impact-driven, Fortune 10 company that touches virtually every aspect of healthcare. We are known for delivering insights, products, and services that make quality care more accessible and affordable. Here, we focus on the health, happiness, and well-being of you and those we serve - we care. What you do at McKesson matters. We foster a culture where you can grow, make an impact, and are empowered to bring new ideas. Together, we thrive as we shape the future of health for patients, our communities, and our people. If you want to be part of tomorrow's health today, we want to hear from you. Rx Savings Solutions (RxSS), part of McKesson's CoverMyMeds business segment, is seeking a talented Site Reliability Engineer (SRE) to join our team! In this role, you will be instrumental in ensuring the reliability, scalability, and performance of our critical healthcare technology systems. You will apply software engineering principles to operations, focusing on automation, monitoring, and proactive problem-solving to maintain high availability and deliver exceptional user experiences. * Our preferred candidate will reside in Columbus, OH, or one of our other hub locations of Overland Park KS, Irving TX or Atlanta GA. Position allows for primarily working from home, with occasional in-office time. We may consider a well-qualified candidate based not located in one of the above hub areas. * At this time, we are not able to offer sponsorship for employment visas. We're unable to consider individuals currently on H1B, F-1 OPT, STEM OPT, or any other visa status that would require future sponsorship. Candidates must be authorized to work in the United States on a permanent basis without the need for current or future sponsorship. Job Responsibilities: * System Reliability & Performance: Design, implement, and maintain robust and scalable infrastructure and applications to ensure high availability, performance, and disaster recovery capabilities * Automation & Tooling: Develop and implement automation scripts, tools, and processes to streamline operational tasks, reduce manual effort, and improve efficiency across the software development lifecycle * Monitoring & Alerting: Establish and maintain comprehensive monitoring, alerting, and logging systems to proactively identify and diagnose issues, understand system behavior, and track key performance indicators * Incident Response & Post-Mortem: Participate in on-call rotations, respond to and resolve critical incidents, and conduct thorough post-mortems to identify root causes and implement preventative measures * Capacity Planning & Optimization: Collaborate with development teams to analyze system capacity, forecast future needs, and optimize resource utilization to support business growth * Collaboration & Mentorship: Work closely with software engineers, product managers, and other SREs to promote a culture of reliability, share best practices, and contribute to continuous improvement * Documentation: Create and maintain clear and concise documentation for systems, processes, and incident runbooks * Security: Contribute to the implementation and enforcement of security best practices within our infrastructure and applications Job Qualifications: * Education / Experience: Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent practical experience, and 2+ years of experience in a Site Reliability Engineering, DevOps, or highly related software engineering role * Programming Skills: Strong proficiency in at least one scripting language (e.g., Python, Go, Ruby, Bash) for automation and tool development * Cloud Platforms: Hands-on experience with cloud computing platforms (e.g., AWS, Azure, GCP). AWS experience is highly preferred * Containerization & Orchestration: Experience with container technologies (e.g., Docker) and container orchestration platforms (e.g., Kubernetes) * CI/CD: Familiarity with Continuous Integration and Continuous Delivery (CI/CD) pipelines and tools * Monitoring & Alerting Tools: Experience with monitoring and observability tools (e.g., Datadog, Prometheus, Grafana, Splunk) * Operating Systems: Strong understanding of Linux/Unix operating systems * Networking: Fundamental understanding of networking concepts (TCP/IP, DNS, HTTP, Load Balancing) * Problem-Solving: Excellent analytical and problem-solving skills with a proactive approach to identifying and resolving complex technical issues * Communication: Strong verbal and written communication skills, with the ability to articulate complex technical concepts to both technical and non-technical audiences We are proud to offer a competitive compensation package at McKesson as part of our Total Rewards. This is determined by several factors, including performance, experience and skills, equity, regular job market evaluations, and geographical markets. The pay range shown below is aligned with McKesson's pay philosophy, and pay will always be compliant with any applicable regulations. In addition to base pay, other compensation, such as an annual bonus or long-term incentive opportunities may be offered. For more information regarding benefits at McKesson, please click here. Our Base Pay Range for this position $84,300 - $140,500 McKesson is an Equal Opportunity Employer McKesson provides equal employment opportunities to applicants and employees and is committed to a diverse and inclusive environment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, protected veteran status, disability, age or genetic information. For additional information on McKesson's full Equal Employment Opportunity policies, visit our Equal Employment Opportunity page. Join us at McKesson!
    $84.3k-140.5k yearly Auto-Apply 3d ago
  • Reliability Engineer

    Path Robotics 4.2company rating

    Columbus, OH jobs

    Build the Path Forward At Path Robotics, we're building the future of embodied intelligence. Our AI-driven systems enable robots to adapt, learn, and perform in the real world closing the skilled labor gap and transforming industries. We go beyond traditional methods, combining perception, reasoning, and control to deliver field-ready AI that is risk-aware, reliable, and continuously improving through real-world use. Big, hard problems are our everyday work, and our team of intelligent, humble, and driven people make the impossible possible together. The Reliability Engineer will serve as a vital bridge between our internal Mission Control and development teams, ensuring effective resolution of technical issues while driving long-term improvements in support processes, tools, and documentation. This role is pivotal in enhancing operational efficiency, reducing developer context-switching, and building scalable support systems. What You'll Do Support and Resolution Act as the L2 support layer, handling escalated issues from Mission Control (L1) and collaborating with developers (L3) for resolution. Perform root cause analysis to ensure thorough understanding and resolution of recurring issues. Gradually expand expertise to resolve more issues at the L2 level without L3 escalation. Process and Tool Development Develop tooling, processes, documentation, and SOPs to minimize support escalations. Build playbooks and solutions that enable Mission Control to resolve issues independently. Create automated solutions, bug fixes, and workarounds to proactively prevent support issues. Maintain a comprehensive database of documentation and playbooks for common technical issues. Collaboration and Reporting Work with Mission Control to establish and refine SOPs for ticket triage, handling, and communication. Track and analyze support metrics and SLOs for response and resolution times. Report frequent and resource-intensive support cases to developers, providing actionable insights. Collaborate with developers and test engineers to ensure adequate test coverage for recurring issues. Who You Are Bachelor's or Master's degree in Computer Science, Software Engineering, Robotics Engineering, or a related field, or equivalent experience.. Strong proficiency in Python and C++. Proven ability to develop SOPs and playbooks for operational efficiency. Strong cross-functional collaboration skills with technical and non-technical teams. Expertise in triaging and prioritizing issues based on impact and urgency. Exceptional time management and a self-directed approach to finding high-value work. Strong commitment to maintaining and enhancing documentation. Why You'll Love It Here Free lunch every day Flexible PTO Medical, Dental, and Vision insurance 6 weeks 100% paid parental leave plus an additional 6-8 weeks maternity leave for the birthing parent (12-14 weeks total) 401K through Empower Paid Referral Bonus Who We Are At Path Robotics we love coming to work to solve interesting and tough challenges but also because our ideas are welcomed and valued. We encourage unique thinking and are dedicated to creating a diverse and inclusive environment. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, or veteran status.
    $80k-112k yearly est. Auto-Apply 1d ago
  • Site Reliability Engineer

    Enosix 4.1company rating

    Cincinnati, OH jobs

    Are you ready to help set a new standard? enosix is the leading provider of real-time integration solutions between SAP ERP and front-end systems of engagement (such as Salesforce). enosix solutions are pre-built and require minimal coding, enabling companies to quickly realize value-in days instead of months. Transformative talent is key to the success of enosix, and we are looking for pivotal change makers for our growth mindset organization. If you have a passion for solution-based technology to help customers unlock data, enosix is the forefront employer. enosix is looking for a talented Site Reliability Engineer to join our growing team. In this role, be part of the core development team that builds front end components our customers use to interface with our SAP integration framework. Are you an innovative, creative, and a driven self-starter experienced in application development who is an analytical problem solver? If you enjoy the culture of a company bringing new and exciting solutions to the market, this could be the perfect position for you. The Responsibilities and Vision: Collaborate with distributed agile product teams to ensure systems are reliable, secure, scalable, and observable. Design and implement monitoring, alerting, and incident response systems to maintain high availability and performance. Automate infrastructure provisioning and deployments using OpenTofu and Infrastructure as Code (IaC) principles. Optimize CI/CD pipelines and champion DevSecOps best practices across development teams. Manage multi-cloud environments (Azure and AWS), ensuring consistent performance and security across platforms. Implement and maintain security controls, including identity and access management, secrets management, creation of SBOMs and vulnerability scanning. Conduct root cause analysis and postmortems to continuously improve system resilience. Mentor developers on reliability engineering principles, cloud security, and operational excellence. Requirements Requirements and Skills: Bachelor's Degree or equivalent experience required. Strong experience in site reliability engineering, DevSecOps, or cloud infrastructure roles. Hands-on experience with Azure and AWS services (App Services, Blob Storage, IAM, EC2, S3, etc.) and deploying containerized workloads. Proficiency in scripting and automation (PowerShell, Bash). Experience with OpenTofu or Terraform for infrastructure provisioning. Experience with CI/CD tools and version control (GitHub Actions, Git). Knowledge of relational databases (PostgreSQL preferred). Experience with code-scanning/container scanning technology (Dependabot, SonarQube, Snyk). Experience implementing security best practices in cloud environments. Excellent problem-solving skills and the ability to work effectively in a fast-paced, agile environment. Benefits Why enosix: Competitive compensation packages. Everyone needs a vacation. Generous and flexible open PTO policy. We trust our employees. Small, scale-up culture but big company benefits: Health, dental, and vision benefits, LTD, STD, 401k eligibility. Growth: Opportunity to get in with a global company from the ground up. Learning:To enable our team to work with the latest technology, we encourage our employees to take the time they need to train and develop their skills. Influence: The ability to make key decisions and see your impact immediately. Remote Work: enosix has been a remote workforce since our inception, we are happy to bring you on wherever your home office is located.
    $78k-110k yearly est. Auto-Apply 57d ago

Learn more about Apixio jobs

Most common jobs at Apixio