Post job

Reliability Engineer jobs at Comdata - 54 jobs

  • Site Reliability Engineer Hybrid - San Francisco

    Grammarly, Inc. 4.1company rating

    San Francisco, CA jobs

    Superhuman offers a dynamic hybrid working model for this role. This flexible approach gives team members the best of both worlds: plenty of focus time along with in-person collaboration that helps foster trust, innovation, and a strong team culture. About Superhuman Grammarly is now part of Superhuman, the AI productivity platform on a mission to unlock the superhuman potential in everyone. The Superhuman suite of apps and agents brings AI wherever people work, integrating with over 1 million applications and websites. The company's products include Grammarly's writing assistance, Coda's collaborative workspaces, Mail's inbox management, and Go, the proactive AI assistant that understands context and delivers help automatically. Founded in 2009, Superhuman empowers over 40 million people, 50,000 organizations, and 3,000 educational institutions worldwide to eliminate busywork and focus on what matters. Learn more at superhuman.com and about our values here. The Opportunity To achieve our ambitious goals, we're looking for an SRE to join our infrastructure team. This role will be responsible for building software to ensure the reliability of our back-end systems, working with engineers who develop them, and planning for our future growth. You will work with our existing production engineering teams in the EU as we transition away from a “you build it, you own it” model. Superhuman's engineers and researchers have the freedom to innovate and uncover breakthroughs-and, in turn, influence our product roadmap. The complexity of our technical challenges is growing rapidly as we scale our interfaces, algorithms, and infrastructure. You can hear more from our team on our technical blog. As an SRE, you will Scale our Kubernetes-based control plane that processes billions of events per day. Improve our automation mechanisms that react to our workload. Deploy ML systems across the company. Qualifications Has 5+ years of relevant experience as an SRE or DevOps engineer. Experience in participating in incident management processes. Familiarity with docker, linux, and terraform. Have used AWS, Azure, or GCP. Java and Kubernetes skills preferred, but not required. Has a demonstrated ability to work independently with minimal guidance, proactively manages tasks and priorities across multiple projects, analyzes and executes work efficiently, collaborates effectively with cross‑functional teams, and thrives in fast‑paced, results‑driven environments. Embodies our EAGER values-is ethical, adaptable, gritty, empathetic, and remarkable. Is inspired by our MOVE principles: move fast and learn faster; obsess about creating customer value; value impact over activity; and embrace healthy disagreement rooted in trust. Compensation and Benefits Superhuman offers all team members competitive pay along with a benefits package encompassing the following and more: Excellent health care (including a wide range of medical, dental, vision, mental health, and fertility benefits) Disability and life insurance options 401(k) and RRSP matching Paid parental leave 20 days of paid time off per year, 12 days of paid holidays per year, two floating holidays per year, and flexible sick time Generous stipends (including those for caregiving, pet care, wellness, your home office, and more) Annual professional development budget and opportunities Superhuman takes a market-based approach to compensation, which means base pay may vary depending on your location. Our US locations are categorized into two compensation zones based on proximity to our hub locations. Base pay may vary considerably depending on job-related knowledge, skills, and experience. The expected salary ranges for this position are outlined below by compensation zone and may be modified in the future. United States: Zone 1: $214,000 - $260,000 /year (USD) We encourage you to apply At Superhuman, we value our differences, and we encourage all to apply-especially those whose identities are traditionally underrepresented in tech organizations. We do not discriminate on the basis of race, religion, color, gender expression or identity, sexual orientation, ancestry, national origin, citizenship, age, marital status, veteran status, disability status, political belief, or any other characteristic protected by law. Superhuman is an equal opportunity employer and a participant in the US federal E-Verify program (US). We also abide by the Employment Equity Act (Canada). #J-18808-Ljbffr
    $214k-260k yearly 1d ago
  • Job icon imageJob icon image 2

    Looking for a job?

    Let Zippia find it for you.

  • Senior Site Reliability Engineer - Networking

    Lambda Inc. 4.2company rating

    San Francisco, CA jobs

    Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence. One person, one GPU. If you'd like to build the world's best AI cloud, join us. *Note: This position requires presence in our San Francisco/San Jose/Bellevue office location 4 days per week; Lambda's designated work from home day is currently Tuesday. Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance. What You'll Do Help scale Lambda's high performance multi-tenant cloud network Contribute to the reproducible automation of network configuration and deployments Contribute to the implementation and operations of Software Defined Networks Help to deploy and manage Spine and Leaf networksp> Ensure high availability of our network through observability, failover, and redundancy Ensure clients have predictable networking performance through the use of network engineering and other applicable technologies Help with deploying and maintaining network monitoring and management tools Participate in on-call You Have 5+ years of experience being a Site Reliability Engineer or Network Reliability Engineering Been part of the implementation of production-scale networking projects Experience being on-call and incident response management Have experience building and maintaining Software Defined Networks (SDN), experience with OpenStack, Neutron, OVN Are comfortable on the Linux command line, and have an understanding of the Linux networking stack Have experience with multi-data center networks and hybrid cloud networks Have Python programming experience and configuration management tools like Ansible Have experience with CI/CD tools for deployment and GIT. Operated network environment with GitOps practices in place. Experience with application lifecycle and deployments on Kubernetes Nice To Have Operated production-scale SDNs in a cloud context (e.g. helped implement or operate the infrastructure that powers an AWS VPC-like feature) Have Software development experience with C, GO, Python Experience automating network configuration within public clouds, with tools like Kubernetes, HELM, Terraform, and Ansible Deep understanding of the Linux networking stack and its interaction with network virtualization, SR-IOV and DPDK Understanding of the SDN ecosystem (e.g. OVS, Neutron, VMware NSX, Cisco ACI or Nexus Fabric Controller, Arista CVP) Have experience with Spine and Leaf (Clos) network topology Have experience and understanding of BGP EVPN VXLAN networks Experience with building and maintaining multi-data center networks, SD-WAN, DWDM Experience with Next-Generation Firewalls (NGFW) Salary Range Information The annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About Lambda Founded in 2012, with 500+ employees, and growing fast Our investors notably include TWG Global, US Innovative Technology Fund (USIT), Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, Gradient Ventures, Mercato Partners, SVB, 1517, and Crescent Cove We have research papers accepted at top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG Our values are publicly available: ************************* We offer generous cash & equity compensation Health, dental, and vision coverage for you and your dependents Wellness and commuter stipends for select roles 401k Plan with 2% company match (USA employees) Flexible paid time off plan that we all actually use A Final Note: You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills. Equal Opportunity Employer Lambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law. #J-18808-Ljbffr
    $142k-189k yearly est. 2d ago
  • Site Reliability Engineer II

    Akamai 4.4company rating

    Remote

    Do you like collaborating across teams to solve complex problems? Do you have a passion for cutting edge technologies and tackling system problems? Join our highly skilled Site Reliability team Our team designs, develops, and manages applications and infrastructure that support Akamai's Compute products and services. We specialize in creating solutions that help improve observability and enforce SLAs across all internal teams. We do all of this while maintaining Akamai's mission to make life better for billions of people, billions of times a day. Partner with the best As a Site Reliability Engineer II - Observability, you will collaborate across operations teams and application development teams. Together, you will be creating tooling and software that monitors and improves the reliability of our systems. You'll work with a diverse range of technologies as we release new applications and modernize existing tooling. As a Site Reliability Engineer II, you will be responsible for: Deploying and maintaining our observability platform and internal tooling Partnering across teams to ensure the reliability, scalability and usability of our products and services Providing guidance to engineers and developers to increase confidence that their services are performing as expected Collaborating with our support, operations, and engineering teams to investigate and troubleshoot complex problems Improving our system monitoring and analysis platforms to ensure rapid error detection and remediation, including developing automated remediations Participating in on-call rotations, guiding restoration and repair of service-impacting issues Do what you love To be successful in this role you will: Have 2 years of relevant experience and a Bachelor's degree in Computer Science or its equivalent Have professional experience in a Site Reliability, Development, or SysAdmin role, working with large-scale distributed systems Have in-depth experience working with modern observability tools such as OpenTelemetry, Prometheus, Grafana, Loki, or similar Be familiar with distributed queueing technologies such as Kafka, RedPanda, NATS, or similar Have experience with containerization technologies such as Docker or Podman and container orchestration (Kubernetes) Have experience developing applications and scripts using languages such as Go, Python, Bash, Rust, or similar Have familiarity with infrastructure-as-code tools such as Terraform or Pulumi Have experience with continuous integration / continuous deployment tools such as Jenkins, Github Actions, or similar Work in a way that works for you FlexBase, Akamai's Global Flexible Working Program, is based on the principles that are helping us create the best workplace in the world. When our colleagues said that flexible working was important to them, we listened. We also know flexible working is important to many of the incredible people considering joining Akamai. FlexBase, gives 95% of employees the choice to work from their home, their office, or both (in the country advertised). This permanent workplace flexibility program is consistent and fair globally, to help us find incredible talent, virtually anywhere. We are happy to discuss working options for this role and encourage you to speak with your recruiter in more detail when you apply. Learn what makes Akamai a great place to work Connect with us on social and see what life at Akamai is like! We power and protect life online, by solving the toughest challenges, together. At Akamai, we're curious, innovative, collaborative and tenacious. We celebrate diversity of thought and we hold an unwavering belief that we can make a meaningful difference. Our teams use their global perspectives to put customers at the forefront of everything they do, so if you are people-centric, you'll thrive here. Working for you At Akamai, we will provide you with opportunities to grow, flourish, and achieve great things. Our benefit options are designed to meet your individual needs for today and in the future. We provide benefits surrounding all aspects of your life: Your health Your finances Your family Your time at work Your time pursuing other endeavors Our benefit plan options are designed to meet your individual needs and budget, both today and in the future. About us Akamai powers and protects life online. Leading companies worldwide choose Akamai to build, deliver, and secure their digital experiences helping billions of people live, work, and play every day. With the world's most distributed compute platform from cloud to edge we make it easy for customers to develop and run applications, while we keep experiences closer to users and threats farther away. Join us Are you seeking an opportunity to make a real difference in a company with a global reach and exciting services and clients? Come join us and grow with a team of people who will energize and inspire you! #LI-Remote Compensation Akamai is committed to fair and equitable compensation practices. For US based candidates only - the base salary for this position ranges from $95,000 - $171,000/year; a candidate's salary is determined by various factors including, but not limited to, relevant work experience, skills, certifications and location. Compensation for candidates outside the US will vary. The compensation package may also include incentive compensation opportunities in the form of annual bonus or incentives, equity awards and an Employee Stock Purchase Plan (ESPP). Akamai provides industry-leading benefits including healthcare, 401K savings plan, company holidays, vacation (in the form of PTO), sick time, family friendly benefits including parental leave and an employee assistance program including a focus on mental and financial wellness; Eligibility requirements apply.
    $95k-171k yearly Auto-Apply 10d ago
  • Senior Lead Site Reliability Engineer- Remote

    Akamai 4.4company rating

    Remote

    Would you love to deliver huge value to our customers? Would you enjoy contributing to core technology which is serving billions of people? Join our world-class security team Our team is a part of the Cloud Security Intelligence group. We own one of the largest Big Data environments. The group develops Security infrastructures and Security products for our customers. Partner with the best You will be responsible to ensure optimal performance and up-time of Akamai's critical security products. Analyze system performance end-to-end. Working towards monitoring, alerting, log aggregations process and developing tools for it. You will work with the latest technologies such as Azure, Databricks & more. As a Senior Lead Site Reliability Engineer, you will be responsible for: Deploying and maintaining the platform and tools used internally Developing automation pipelines to support development, testing, and deployment workflows Collaborating with our support, operations and engineering teams to investigate and troubleshoot complex problems Improving our system monitoring and analysis platform to speed error detection and remediation, enhancing performance and reliability Working with Dev and Quality Assurance teams to create more robust solutions, code improvement and stability Participating in on-call rotations, guiding restoration and repair of service-impacting issues Do what you love To be successful in this role you will: Have 4 years of experience and a Bachelors Degree in Computer Science or a related field Have professional experience in a DevOps, SRE, or SysAdmin role, working with large scale distributed systems Have experience with any cloud platform (we use Azure heavily) & automation tool such as Jenkins, Terraform Have experience developing software using Python, Golang, and familiarity with scripting programming languages Have exposure to Container technologies like Dockers and Kubernetes Demonstrate communication and presentation skills Have a Secret Security Clearance Work in a way that works for you FlexBase, Akamai's Global Flexible Working Program, is based on the principles that are helping us create the best workplace in the world. When our colleagues said that flexible working was important to them, we listened. We also know flexible working is important to many of the incredible people considering joining Akamai. FlexBase, gives 95% of employees the choice to work from their home, their office, or both (in the country advertised). This permanent workplace flexibility program is consistent and fair globally, to help us find incredible talent, virtually anywhere. We are happy to discuss working options for this role and encourage you to speak with your recruiter in more detail when you apply. Learn what makes Akamai a great place to work Connect with us on social and see what life at Akamai is like! We power and protect life online, by solving the toughest challenges, together. At Akamai, we're curious, innovative, collaborative and tenacious. We celebrate diversity of thought and we hold an unwavering belief that we can make a meaningful difference. Our teams use their global perspectives to put customers at the forefront of everything they do, so if you are people-centric, you'll thrive here. Working for you At Akamai, we will provide you with opportunities to grow, flourish, and achieve great things. Our benefit options are designed to meet your individual needs for today and in the future. We provide benefits surrounding all aspects of your life: Your health Your finances Your family Your time at work Your time pursuing other endeavors Our benefit plan options are designed to meet your individual needs and budget, both today and in the future. About us Akamai powers and protects life online. Leading companies worldwide choose Akamai to build, deliver, and secure their digital experiences helping billions of people live, work, and play every day. With the world's most distributed compute platform from cloud to edge we make it easy for customers to develop and run applications, while we keep experiences closer to users and threats farther away. Join us Are you seeking an opportunity to make a real difference in a company with a global reach and exciting services and clients? Come join us and grow with a team of people who will energize and inspire you! Akamai Technologies is an Affirmative Action, Equal Opportunity Employer that values the strength that diversity brings to the workplace. All qualified applicants will receive consideration for employment and will not be discriminated against on the basis of gender, gender identity, sexual orientation, race/ethnicity, protected veteran status, disability, or other protected group status. If no date is displayed, applications are being accepted on an ongoing basis until the job is filled. Compensation Akamai is committed to fair and equitable compensation practices. For US based candidates only - the base salary for this position ranges from $106,600 - $221,400/year; a candidate's salary is determined by various factors including, but not limited to, relevant work experience, skills, certifications and location. Compensation for candidates outside the US will vary. The compensation package may also include incentive compensation opportunities in the form of annual bonus or incentives, equity awards and an Employee Stock Purchase Plan (ESPP). Akamai provides industry-leading benefits including healthcare, 401K savings plan, company holidays, vacation (in the form of PTO), sick time, family friendly benefits including parental leave and an employee assistance program including a focus on mental and financial wellness; Eligibility requirements apply.
    $106.6k-221.4k yearly Auto-Apply 14d ago
  • Principal Site Reliability Engineer

    Zefr 4.7company rating

    Marina del Rey, CA jobs

    What we do: Zefr is the leading global technology company enabling responsible marketing in walled garden social environments. Zefr's solutions empower brands to manage their content adjacency on scaled platforms such as YouTube, Meta, TikTok, and Snap, in accordance with industry standard frameworks. Through its patented AI technology, Zefr offers brands and agencies more accurate and transparent solutions for social walled gardens. The company is headquartered in Los Angeles, California, with additional locations across the globe. What you'll do: As a Principal Site Reliability Engineer at Zefr, you'll serve as a technical leader and subject matter expert, helping define the technical vision and shape the direction of our reliability practices across the organization. You'll leverage deep expertise in observability, core SRE principles, cloud infrastructure, CI/CD and DevSecOps to solve our most complex challenges and set the standard for engineering excellence. This role requires a blend of hands-on technical expertise and strategic thinking. You'll drive cross-functional initiatives, mentor engineers across teams, and partner with leadership to ensure our AI-powered platform is robust, efficient, and scalable. We're looking for someone to combine their technical expertise with strong leadership and a passion for continuous improvement and innovation. Zefr wants a candidate that champions reliability as a product feature, and can translate complex technical concepts into strategy. This is a role where you'll shape how we build and operate systems at scale. Support and build systems and tools that enable other engineers to generate, deploy, and manage product features and models both quickly and safely. Deploy and support a multi-cloud, micro-service architecture, including infrastructure tailored for ML workloads, deployed via Github Actions, ArgoCD & Kubernetes. Collaborate with other engineers to architect secure, resilient, scalable, and cost-efficient applications and ML systems/pipelines in AWS and GCP. Foster and push our DevOps culture and philosophy by encouraging continuous improvement across all engineering teams. Proactively maintain the health of production environments, including monitoring application performance and resource utilization. Participate in 24/7 on-call rotation, respond to system performance issues and outages. Debug code at the application and infrastructure level. Mature our CI/CD workflows and release process. Maintains a forward-thinking approach, actively researching and proposing new solutions. Propose and review Engineering Request for Comments (RFC) to drive Engineering architecture and practices. Technology Stack at Zefr: Core Infrastructure & Cloud Platforms: Cloud Providers: Google Cloud Platform (primary), Amazon Web Services Infrastructure as Code (IaC): Terraform, Terragrunt Containerization & Orchestration: Docker, Kubernetes (experience with GKE and/or EKS expected), Helm, Kustomize Service Mesh: Istio CI/CD & Automation: CI/CD Pipelines: GitHub Actions GitOps / Continuous Delivery: Argo CD Primary Scripting/Automation Language: Python Observability & Monitoring: Monitoring & Alerting: Prometheus, Chronosphere, Pagerduty Telemetry Standards: OpenTelemetry Application & Data Ecosystem (Supporting): Application Languages/Frameworks: Python, FastAPI, Flask, Node.js, React Data Streaming: Apache Kafka Data Processing/Transformation: Pandas, DBT Workflow Orchestration: Apache Airflow, Ray Data Stores & Databases: Relational Databases: PostgreSQL (including managed versions like AWS Aurora, GCP Cloud SQL) NoSQL Databases: DynamoDB Search Databases: OpenSearch Vector Databases: Qdrant Caching: Redis Data Warehousing: Snowflake What we're looking for: 10+ year job history designing, managing, deploying, and supporting Cloud Infrastructure in a production environment using major public cloud providers (GCP experience a huge bonus) Experience in Advertising or AdTech Demonstrated technical leadership experience; including mentoring engineers, driving cross-functional projects, and influencing architectural decisions at an organizational level. Knowledge of GitOps including an understanding of modern CI/CD pipelines, techniques and technologies (Github Actions, GitLab, CircleCI, Argo CD, Flux) Advanced Proficiency with IaC and configuration management tools (Terraform, Terragrunt, OpenTofu, Crossplane, Pulumi) Deep production experience architecting, managing, deploying, and supporting container based workloads into Kubernetes clusters Proven track record of building and scaling reliability practices, including SLO/SLI frameworks, incident management, and capacity planning. Heavy Production experience with observability platforms and practices (Prometheus, Grafana, Chronosphere, Datadog, OpenTelemetry); ability to design monitoring strategies for complex distributed systems. Strong knowledge of cloud networking (Mesh, NAT, Load Balancers, API Gateways, proxies, etc), cloud security, and cost optimization strategies. Exceptional written and verbal communication skills; ability to translate complex technical concepts for diverse audiences and build consensus across teams. Experience authoring technical strategy documents, RFCs, and architectural proposals. Benefits (for US based employees): Flexible PTO Medical, dental, and vision insurance with FSA options Company-paid life insurance Paid parental leave 401(k) with company match Professional development opportunities 13 paid holidays off Summer Fridays (we leave early) In-office, hybrid, and fully-remote work options available In-office lunches and lots of free food Optional in-person and virtual events (we like to celebrate!) Compensation (for US based employees): The anticipated salary for this position is between $210,000 and $235,000. Within the range, individual pay is determined by factors such as job-related skills, experience, and relevant education or training. If your compensation expectations fall outside of this range, it may still be worth having a conversation. Zefr is an equal opportunity employer that embraces diversity and inclusion in the workplace. We are committed to building a team that represents a variety of backgrounds, skills, and perspectives because we know this only makes us better. We strongly encourage women, persons of color, LGBTQIA+ individuals, persons with disabilities, members of ethnic minorities, foreign-born residents, and veterans to apply even if you do not meet 100% of the qualifications.
    $210k-235k yearly Auto-Apply 20h ago
  • Senior Site Reliability Engineer - Managed Kubernetes

    Lambda 4.2company rating

    San Francisco, CA jobs

    Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence. One person, one GPU. If you'd like to build the world's best AI cloud, join us. *Note: This position requires presence in our San Francisco/San Jose or Bellevue office location 4 days per week; Lambda's designated work from home day is currently Tuesday. Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance. What You'll Do Operate and maintain bare-metal Kubernetes clusters, scaling up to thousands of nodes Handle cluster degradation, recovery, resizing, and incident response using fleet management tools Participate in a well-managed on-call rotation for critical incidents Assist customers with Kubernetes questions, workload integration, storage, and authentication Work closely with our HPC Ops and Datacenter Ops teams for low-level or cross-functional issues Use Python and Golang to create tooling and automate the validation of platform quality. Design, build, and maintain scalable control plane services, operators, and custom controllers for Kubernetes Develop automation for cluster lifecycle management: provisioning, upgrades, patching, and deletion. Define and implement SLOs and SLIs for Kubernetes services, workloads, and platform reliability. About You Must-Have 6+ years of experience in a SRE, operations engineer, or similar role, with a deep knowledge of running Linux clusters and systems Strong programming skills in Go and Python; experience with GitOps (e.g., ArgoCD), Helm, and Kubernetes operators Proven experience operating Kubernetes clusters in production environments (on-prem, EKS, GKE, or similar) Can work either independently with limited direction or as part of a team Can work with customers during incidents either via tickets, live messaging, or as part of a larger call. Familiarity with observability tools like Prometheus, Grafana, FluentBit, and CI/CD pipelines Proven experience provisioning Kubernetes using tools such as kubeadm, Cluster API, or similar Nice-to-Have Deep Kubernetes expertise: CRDs, CSI, CNI, Kubernetes Operator Coding experience Exposure to HPC clusters, AI/ML workloads, or large-scale GPU clusters Hybrid or multi-cloud Kubernetes environment experience Contributions to CNCF projects or Kubernetes SIGs Why Join Us Work on cutting-edge Managed Kubernetes platforms for AI/ML workloads Influence the platform roadmap and help shape operations and reliability best practices Collaborate with a highly skilled engineer Opportunity to mentor and grow within a fast-growing, technology-driven environment Salary Range Information The annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About Lambda Founded in 2012, with 500+ employees, and growing fast Our investors notably include TWG Global, US Innovative Technology Fund (USIT), Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, Gradient Ventures, Mercato Partners, SVB, 1517, and Crescent Cove We have research papers accepted at top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG Our values are publicly available: ************************* We offer generous cash & equity compensation Health, dental, and vision coverage for you and your dependents Wellness and commuter stipends for select roles 401k Plan with 2% company match (USA employees) Flexible paid time off plan that we all actually use A Final Note: You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills. Equal Opportunity Employer Lambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
    $142k-189k yearly est. Auto-Apply 60d+ ago
  • Senior Site Reliability Engineer - Networking

    Lambda 4.2company rating

    San Francisco, CA jobs

    Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence. One person, one GPU. If you'd like to build the world's best AI cloud, join us. *Note: This position requires presence in our San Francisco/San Jose/Bellevue office location 4 days per week; Lambda's designated work from home day is currently Tuesday. Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance. What You'll Do Help scale Lambda's high performance multi-tenant cloud network Contribute to the reproducible automation of network configuration and deployments Contribute to the implementation and operations of Software Defined Networks Help to deploy and manage Spine and Leaf networks Ensure high availability of our network through observability, failover, and redundancy Ensure clients have predictable networking performance through the use of network engineering and other applicable technologies Help with deploying and maintaining network monitoring and management tools Participate in on-call You Have 5+ years of experience being a Site Reliability Engineer or Network Reliability Engineering Been part of the implementation of production-scale networking projects Experience being on-call and incident response management Have experience building and maintaining Software Defined Networks (SDN), experience with OpenStack, Neutron, OVN Are comfortable on the Linux command line, and have an understanding of the Linux networking stack Have experience with multi-data center networks and hybrid cloud networks Have Python programming experience and configuration management tools like Ansible Have experience with CI/CD tools for deployment and GIT. Operated network environment with GitOps practices in place. Experience with application lifecycle and deployments on Kubernetes Nice To Have Operated production-scale SDNs in a cloud context (e.g. helped implement or operate the infrastructure that powers an AWS VPC-like feature) Have Software development experience with C, GO, Python Experience automating network configuration within public clouds, with tools like Kubernetes, HELM, Terraform, and Ansible Deep understanding of the Linux networking stack and its interaction with network virtualization, SR-IOV and DPDK Understanding of the SDN ecosystem (e.g. OVS, Neutron, VMware NSX, Cisco ACI or Nexus Fabric Controller, Arista CVP) Have experience with Spine and Leaf (Clos) network topology Have experience and understanding of BGP EVPN VXLAN networks Experience with building and maintaining multi-data center networks, SD-WAN, DWDM Experience with Next-Generation Firewalls (NGFW) Salary Range Information The annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About Lambda Founded in 2012, with 500+ employees, and growing fast Our investors notably include TWG Global, US Innovative Technology Fund (USIT), Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, Gradient Ventures, Mercato Partners, SVB, 1517, and Crescent Cove We have research papers accepted at top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG Our values are publicly available: ************************* We offer generous cash & equity compensation Health, dental, and vision coverage for you and your dependents Wellness and commuter stipends for select roles 401k Plan with 2% company match (USA employees) Flexible paid time off plan that we all actually use A Final Note: You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills. Equal Opportunity Employer Lambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
    $142k-189k yearly est. Auto-Apply 46d ago
  • Staff Site Reliability Engineer, Cloud

    Kentik 4.2company rating

    Austin, TX jobs

    Who we are Kentik is the network intelligence platform for modern infrastructure teams. Unlike traditional monitoring and observability tools, we demystify complex network operations, enabling organizations to deliver applications and innovation at scale. Built by network experts to make critical insight accessible to every engineer, Kentik is the real-time source of truth that understands every network in context - from data center to cloud to the internet. This single platform unifies and correlates cloud, device, flow, synthetic data to turn telemetry into action. Market leaders like Akamai, Booking.com, Dropbox, and Zoom rely on Kentik to run, manage, and optimize their networks. What we do Our platform ingests trillions of records and serves hundreds of thousands of queries for our users each day. You will gain experience building a production quality, high performance server-and-client SaaS application that handles uniquely high volumes of data. We have built a team of world-class engineers, network experts, and technology thought leaders in a remote-friendly culture from day one. While prior experience in a remote environment is not required, we highly value strong collaboration and communication skills, as well as a high level of independence and autonomy. What you'll do Kentik is looking for a Staff level Site Reliability Engineer (Cloud) to join our Product Engineering team to help build and maintain our Synthetics and Cloud product lines. These products have multiple applications deployed in various cloud providers all over the world. We manage these cloud applications using observability tooling, automated build processes, and adherence to configuration as code best practices. We're looking for an experienced engineer who will work with engineering teams across the company to help grow our hardware and software infrastructure. We operate a well-organized, well-instrumented platform, and offer enormous opportunities for employee growth. Make sure our real-time, scalable, infrastructure is set up for growth and working efficiently. Our infrastructure runs on our own hardware, across multiple locations as well as all major cloud vendors Work on tools and processes to better monitor our platform as well as ensuring its stability through our rapid growth Deep-diving into diverse topics, from firewalls and IP routing, to database replication strategies or automating build processes Collaborate with engineering and infrastructure teams on finding solutions from an operational perspective Assist with expanding our cloud deployments across the major cloud providers Contribute code, code reviews and tools or patches to all kinds of existing code Write design documents or collaborate on colleagues' docs to introduce new features or changes into our infrastructure Provide valuable feedback on team goals, projects, and processes. We believe in continuously improving our team What you'll bring Studies have shown that some candidates tend to apply to jobs only if they meet 100% of the qualifications. We encourage you to apply if you meet most of the criteria - even if you don't match all of the qualifications, your skills and experience could be valuable in this role! 8+ years of experience in cloud-based Systems Administration, IT and/or SRE related projects Expertise in public cloud environments such as AWS, GCP, Azure, or OCI. Strong command of containerization and orchestration using Docker and Kubernetes. Solid programming and automation skills using Bash, Python, or Go. Proficiency with Infrastructure as Code (IaC) and configuration management platforms such as Terraform, Ansible, and Puppet. Proficiency in Linux administration and command-line tools (e.g., SSH, grep, awk). Detailed understanding of major internet protocols (TCP/IP, DNS, HTTP, TLS) Networking administration experience: concepts such as routing, firewalls (iptables), peering sound familiar A passion for documenting code, processes, and infrastructure in runbooks and wikis Worked with metrics monitoring solutions such as grafana, prometheus, telegraf, and OpenTelemetry Experience creating and managing tickets with third party vendors and owning cloud vendor partner relationships Nice to haves: Familiarity with Kubernetes automation tools, specifically managing complex deployments with Helm and Helmfile. Knowledge of scaling Kubernetes workloads and compute infrastructure Experience optimizing CI/CD build and deploy pipelines using GitHub Actions and Jenkins. Exposure to PagerDuty Integrations Knowledge of SRE, DevOps and GitOps practices and principles Our tech stack Our core data engine and platform are primarily written in Go We use Node.js + Express for application serving, and React as our primary UI framework We also use some JS and Python for tooling/scripting In addition to our own database, we use Postgres, Kafka, Mysql, and Redis Internal and public APIs expose both rest/json and gRPC endpoints Haproxy, Envoy for API traffic routing and balancing Github for source control, PRs, issues Jenkins for automated builds What we offer Kentik is a fully remote company that operates globally. We seek professionals that will help us thrive as an organization, and in turn, to broaden and enhance your career. We're very thorough in the interview process to understand your skills and how they will relate to your successful growth here at Kentik. Our compensation philosophy encompasses a fair program for all in order to attract, engage and retain talented individuals who will drive our business and wow our customers. The compensation range for this position is: $165,000 - $200,000. This range reflects the low and high end of the U.S. compensation range Kentik reasonably and generally expects to pay the hired candidate in this role. The actual compensation offered may be lower or higher than the stated range depending on various factors, including but not limited to: Experience with the skill sets required for success Demonstrated competencies and potential A geographic market-based approach In addition to a great career opportunity, Kentik offers stellar benefits for our employees, which include: 100% of premiums are paid by company for health, vision and dental coverage for you and your dependents Additionally, an annual Health Reimbursement Account (HRA) of $3,000 for an individual or $4,500 for a family Paid family & medical leave Open PTO, a quarterly Wellness Day, and a minimum of 10 paid holidays 401(k) retirement account Home office reimbursement Stock options Note: Benefits are as listed for all US full-time employees. For compensation, international applicants will be treated equitably in relation to the laws applicable within the countries in which we operate. Come work with us The true meaning of Kentik is visibility. We're committed to making sure everyone feels empowered to use their voice, has a sense of belonging, and is represented at Kentik. We don't look for individuals who fit the culture, but those who will continue to add to the culture. We encourage everyone to apply, especially those individuals who are underrepresented in the industry: people of color, LGBTQI+ community, women, individuals with disabilities (both seen and unseen), veterans, and people of any age or family status. Kentik is committed to creating an inclusive interview process. If you require a reasonable accommodation during the application or interview process, please reach out to *********************. Come as you are! You will be working at a fast-growing, well-funded startup alongside industry thought leaders and network aficionados as we build the future of observability and set the high bar for how network operations and digital businesses should run. With a competitive salary and amazing benefits on top of the meaningful and challenging projects you'll take on, we're sure you'll enjoy joining the Kentik team. #li-remote
    $165k-200k yearly Auto-Apply 9d ago
  • Senior Site Reliability Engineer - Observability

    Lambda 4.2company rating

    Remote

    Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence. One person, one GPU. If you'd like to build the world's best AI cloud, join us. *Note: This position requires presence in our San Francisco, San Jose, or upcoming Bellevue office location 4 days per week; Lambda's designated work from home day is currently Tuesday. Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance. What You'll Do Deploy and operate observability platforms for logging, metrics, and distributed tracing. Automate the deployment and operation of these observability systems. Set up monitoring for modern AI/HPC clusters. Develop platform software to make observability adoptable and improve system reliability across Lambda engineering. Lead members of other engineering teams to design and develop solutions for their monitoring challenges. You Have 8+ years of experience in software engineering, with 3+ years in Go Have 5+ years of experience in Site Reliability Engineering practices Possess proven understanding of Observability tools and practices Have experience with application deployment and monitoring using Kubernetes Have experience building CI/CD pipelines Expect quality and reliability from the solutions you build Enjoy collaborating across team boundaries to help our engineering teams meet their observability needs. Nice to Have Experience monitoring AI systems or HPC clusters Experience with Prometheus and writing queries in PromQL Experience with messaging systems like NATS Understanding of the OpenTelemetry ecosystem and experience with both OTel instrumentation and the OTel collector Experience with network monitoring, Ethernet and Infiniband Understanding of dashboard design principles Strong understanding of Linux fundamentals and system administration. Experience with infrastructure automation tooling such as Ansible and Terraform Salary Range Information The annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About Lambda Founded in 2012, with 500+ employees, and growing fast Our investors notably include TWG Global, US Innovative Technology Fund (USIT), Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, Gradient Ventures, Mercato Partners, SVB, 1517, and Crescent Cove We have research papers accepted at top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG Our values are publicly available: ************************* We offer generous cash & equity compensation Health, dental, and vision coverage for you and your dependents Wellness and commuter stipends for select roles 401k Plan with 2% company match (USA employees) Flexible paid time off plan that we all actually use A Final Note: You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills. Equal Opportunity Employer Lambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
    $115k-160k yearly est. Auto-Apply 49d ago
  • Senior Site Reliability Engineer - Fleet Reliability

    Lambda 4.2company rating

    Remote

    Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence. One person, one GPU. If you'd like to build the world's best AI cloud, join us. *Note: This position requires presence in our San Francisco office location 4 days per week; Lambda's designated work from home day is currently Tuesday. Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance. What You'll Do Define Fleet Health metrics and indicators to objectively measure and improve system availability Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies Create runbooks and automated remediations for common failure scenarios Build in automation and auditing to ensure compliance and improve efficiency and productivity Participate in on-call rotations and provide support for incident response and resolution Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc You 7+ years of experience in Site Reliability Engineering, DevOps, or a similar role Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization Strong understanding of Linux-based systems in a distributed environment Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling. Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic) Proficiency in automation and configuration management tools (e.g., Ansible, Terraform) Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure) Excellent problem-solving and troubleshooting skills Strong communication and collaboration skills Passion for continuous improvement and innovation Nice to Have Experience in the machine learning or computer hardware industry Knowledge of containerization and orchestration technologies (e.g., Docker, Kubernetes) Experience building and/or operating HPC resources. Background in chaos engineering or similar reliability testing methodologies Understanding of compliance frameworks (SOC 2, ISO 27001, etc.) Salary Range Information The annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About Lambda Founded in 2012, with 500+ employees, and growing fast Our investors notably include TWG Global, US Innovative Technology Fund (USIT), Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, Gradient Ventures, Mercato Partners, SVB, 1517, and Crescent Cove We have research papers accepted at top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG Our values are publicly available: ************************* We offer generous cash & equity compensation Health, dental, and vision coverage for you and your dependents Wellness and commuter stipends for select roles 401k Plan with 2% company match (USA employees) Flexible paid time off plan that we all actually use A Final Note: You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills. Equal Opportunity Employer Lambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
    $115k-160k yearly est. Auto-Apply 60d+ ago
  • Lead Site Reliability Engineer, Observability (Remote, North America)

    Vivun 4.2company rating

    Oakland, CA jobs

    Vivun delivers Ava, the AI Sales Teammate for high-velocity sales teams that sells with you and unlocks instant capacity. Powered by a proprietary Sales Reasoning Model, Ava provides real-time guidance before, during, and after calls through text, voice, or avatar. By helping sellers work smarter, faster, and better, Ava saves reps 6-8 hours per week-freeing teams to focus on driving growth. We are building technology that changes how people work, collaborate, and succeed together. Join us in shaping the future of intelligent sales. Position Summary We're seeking a Lead Site Reliability Engineer to rebuild and own our observability strategy across both agentic systems and SaaS infrastructure, creating the frameworks and tooling that enable teams to ship confidently, measure performance, and maintain reliability as we scale. As the Observability Lead, you'll be responsible for designing and implementing Vivun's observability patterns spanning infrastructure, applications, and agentic workloads. You'll work closely with your teammates across engineering, QA, and product to establish unified visibility across the full stack, from LLM-driven agents to backend services. You won't just monitor systems-you'll define the patterns and tools that are a core part of empowering and driving Vivun's engineering culture. Key Responsibilities Own the end-to-end observability strategy for Ava, defining the standards, tools, and patterns that ensure reliable visibility across infrastructure and agentic components. Design and implement correlation models that link agent behavior, LLM interactions, and SaaS telemetry into cohesive, actionable insights. Unify observability tooling across teams, ensuring metrics, logs, and traces flow into a central platform (e.g., Observe, Datadog, or equivalent). Collaborate with engineering and QA to embed observability best practices into development workflows, CI/CD, and quality gates. Establish enablement frameworks-documentation, dashboards, and templates-that make observability self-serve for all engineering teams. Partner with teammates to ensure observability aligns with infrastructure reliability, alerting, and incident response patterns. Contribute to performance and reliability strategy, helping define how we measure agent quality, responsiveness, and system scalability. Desired Skills & Experience 6+ years of experience in SRE, DevOps, or Observability Engineering roles, with at least 2+ years leading or designing observability initiatives. Deep knowledge of observability tooling (e.g., OpenTelemetry, Prometheus, Grafana, Datadog, Honeycomb, Observe, etc.) and distributed tracing practices. Experience with Agentic / LLM-based systems, including tools like LangChain, Celery, OpenAI APIs, or similar orchestration frameworks. Strong understanding of how to instrument, trace, and correlate AI/LLM workflows with infrastructure-level telemetry. Proven ability to define cross-team standards, influence engineering culture, and establish scalable monitoring patterns. Strong collaboration and communication skills-you enable, not dictate. Nice to Have Experience building observability into hybrid SaaS + agent architectures. Background in data pipelines or analytics observability (e.g., tracing data lineage, monitoring model drift). Familiarity with Python- or Node.js-based observability SDKs. Prior experience scaling observability in a startup or rapid-growth environment. You Are A believer in Vivun's core values: Set the Standard. Take Ownership. Stay Curious. Fast & Focused. Builder at Heart: You want to build the observability foundations for a next-generation agentic platform. Innovative Problem Solver: You are eager to take on cutting-edge monitoring challenges at the intersection of SaaS and AI. Collaborative by Nature: You thrive in a high-impact engineering culture that values enablement, empowerment, and shared ownership. Experienced working in high-growth startup environments: You have the ability to move fast, adapt, and thrive in a dynamic startup environment where you derive priorities, requirements, and goals from company context. What You Will Have At Vivun Competitive salary and full health benefits Stock Options at a well funded, pre-IPO company on a fast growth track Flexible work schedules and work from anywhere at a fully remote company Unlimited PTO with two weeks designated as “quiet period” each year An experienced team who will fight beside you in the trenches to accomplish your goals
    $124k-174k yearly est. Auto-Apply 60d+ ago
  • Lead Site Reliability Engineer, Observability (Remote, North America)

    Vivun 4.2company rating

    Oakland, CA jobs

    Vivun delivers Ava, the AI Sales Teammate for high-velocity sales teams that sells with you and unlocks instant capacity. Powered by a proprietary Sales Reasoning Model, Ava provides real-time guidance before, during, and after calls through text, voice, or avatar. By helping sellers work smarter, faster, and better, Ava saves reps 6-8 hours per week-freeing teams to focus on driving growth. We are building technology that changes how people work, collaborate, and succeed together. Join us in shaping the future of intelligent sales. Position Summary We're seeking a Lead Site Reliability Engineer to rebuild and own our observability strategy across both agentic systems and SaaS infrastructure, creating the frameworks and tooling that enable teams to ship confidently, measure performance, and maintain reliability as we scale. As the Observability Lead, you'll be responsible for designing and implementing Vivun's observability patterns spanning infrastructure, applications, and agentic workloads. You'll work closely with your teammates across engineering, QA, and product to establish unified visibility across the full stack, from LLM-driven agents to backend services. You won't just monitor systems-you'll define the patterns and tools that are a core part of empowering and driving Vivun's engineering culture. Key Responsibilities * Own the end-to-end observability strategy for Ava, defining the standards, tools, and patterns that ensure reliable visibility across infrastructure and agentic components. * Design and implement correlation models that link agent behavior, LLM interactions, and SaaS telemetry into cohesive, actionable insights. * Unify observability tooling across teams, ensuring metrics, logs, and traces flow into a central platform (e.g., Observe, Datadog, or equivalent). * Collaborate with engineering and QA to embed observability best practices into development workflows, CI/CD, and quality gates. * Establish enablement frameworks-documentation, dashboards, and templates-that make observability self-serve for all engineering teams. * Partner with teammates to ensure observability aligns with infrastructure reliability, alerting, and incident response patterns. * Contribute to performance and reliability strategy, helping define how we measure agent quality, responsiveness, and system scalability. Desired Skills & Experience * 6+ years of experience in SRE, DevOps, or Observability Engineering roles, with at least 2+ years leading or designing observability initiatives. * Deep knowledge of observability tooling (e.g., OpenTelemetry, Prometheus, Grafana, Datadog, Honeycomb, Observe, etc.) and distributed tracing practices. * Experience with Agentic / LLM-based systems, including tools like LangChain, Celery, OpenAI APIs, or similar orchestration frameworks. * Strong understanding of how to instrument, trace, and correlate AI/LLM workflows with infrastructure-level telemetry. * Proven ability to define cross-team standards, influence engineering culture, and establish scalable monitoring patterns. * Strong collaboration and communication skills-you enable, not dictate. Nice to Have * Experience building observability into hybrid SaaS + agent architectures. * Background in data pipelines or analytics observability (e.g., tracing data lineage, monitoring model drift). * Familiarity with Python- or Node.js-based observability SDKs. * Prior experience scaling observability in a startup or rapid-growth environment. You Are * A believer in Vivun's core values: Set the Standard. Take Ownership. Stay Curious. Fast & Focused. * Builder at Heart: You want to build the observability foundations for a next-generation agentic platform. * Innovative Problem Solver: You are eager to take on cutting-edge monitoring challenges at the intersection of SaaS and AI. * Collaborative by Nature: You thrive in a high-impact engineering culture that values enablement, empowerment, and shared ownership. * Experienced working in high-growth startup environments: You have the ability to move fast, adapt, and thrive in a dynamic startup environment where you derive priorities, requirements, and goals from company context. What You Will Have At Vivun * Competitive salary and full health benefits * Stock Options at a well funded, pre-IPO company on a fast growth track * Flexible work schedules and work from anywhere at a fully remote company * Unlimited PTO with two weeks designated as "quiet period" each year * An experienced team who will fight beside you in the trenches to accomplish your goals
    $124k-174k yearly est. 60d+ ago
  • Site Reliability Engineer

    Offchain Labs 4.0company rating

    Remote

    At Offchain Labs, we aren't just building products: we're leading a movement. As pioneers in blockchain scalability and security, we're at the forefront of transforming how the world interacts with decentralized applications. We're laying the foundation that will define the next generation of digital commerce, governance, and human interaction. This involves tackling real-world challenges that come with scaling blockchain technology, without compromising on its core principles: decentralization, security and transparency. At the center of this vision is our people. Our team is made up of thinkers and doers that embrace new challenges and seek solutions that push existing boundaries. If you're energized by solving unprecedented problems, and believe in the role that decentralized systems will play in creating a more equitable digital future, then we want to hear from you. Why Offchain Labs? Offchain Labs is setting the pace for the entire Ethereum ecosystem. We built the Arbitrum stack that powers Arbitrum One, the most widely adopted Ethereum scaling solution that exists today. Arbitrum's ecosystem is undergoing tremendous growth with hundreds of projects and dApps on Arbitrum One today. Over 100 different teams have used Offchain Labs technology to build their own Arbitrum chains. Major players in the space, Robinhood, BlackRock, Ethena Labs, Securitize, Aave, and Apechain are all using the Arbitrum stack. Arbitrum's thriving ecosystem wouldn't exist without our advanced technology stack. Arbitrum, Prysm, ZeroDev. These aren't just product names. These are tools that are actively reshaping what's possible on Ethereum and advancing its core infrastructure. To top it all off? We're backed by $124 million in funding. We've demonstrated consistent execution with billions in secured value, thousands of supported projects, and infrastructure processing millions of transactions seamlessly.Who You Are Eager to dive into blockchain technology, even if it's new territory Enjoy solving infrastructure problems in unconventional ways and thinking beyond standard patterns Use tools like k9s or ArgoCD for speed and abstraction, but comfortable dropping into YAML, logs, or low-level debugging when things go sideways Experienced with GitOps-style systems and treating both infrastructure and application delivery as code Have scaled deployment automation using patterns like ArgoCD ApplicationSets or similar tooling Curious about how things work under the hood and not satisfied with surface-level fixes Comfortable in Linux, fluent in shell scripting, and productive in languages like Python or Go Comfortable operating within a cloud platform (e.g., AWS, GCP, Azure), with a strong understanding of the underlying components making it easy to adapt to or migrate across providers Participated in an on-call rotation, responding to incidents, troubleshooting under pressure, and driving postmortems to improve system reliability over time Design systems with security in mind, applying principles like least privilege and threat modeling Bring a strong technical foundation, excellent problem-solving skills, and a genuine commitment to high-quality work Take ownership, collaborate openly, and contribute to a culture of clarity, curiosity, and continuous improvement What You've Done Operated production Kubernetes clusters and built scalable, declarative infrastructure using Terraform or similar tools Deployed and maintained Kubernetes environments, managed system components, and troubleshot applications running on the platform Designed CI/CD workflows with ArgoCD, GitHub Actions, CodeBuild, or similar tools, covering both infra and app deployments Designed and operated observability systems using time-series metrics, logs, and dashboards with tools like Prometheus, Loki, Mimir, Grafana, and CloudWatch Diagnosed tough networking and storage issues across complex, distributed systems Implemented secure-by-default infrastructure and contributed to architecture reviews and threat models Automated operational workflows using scripting or programming in Python, Go, or Bash SREs come from a wide range of backgrounds. If you bring strong problem-solving skills, curiosity, and a drive to build reliable systems, we'd love to hear from you, even if your experience doesn't perfectly match every bullet point Perks: Remote-first global workforce + NY office Annual company offsite + team onsites Professional reimbursement program (facilitates industry conference attendance, certifications, and more) Medical, dental & vision coverage (US + some other countries) 401k retirement plan + company match (US only) Wellness stipend Home office set up / ergonomic equipment program Attention Offchain Labs Job Seekers: This role cannot be performed in California, or Colorado. Please be advised that there has been a rise in fraudulent recruiter activities, particularly within the Web3 space. If you would like to confirm whether someone is an OCL employee or the legitimacy of an offer you received, please email ********************* At Offchain Labs, we are committed to building a welcoming and supportive workplace for all employees, regardless of their background or identity. We strive to create an environment where everyone feels valued and has an equal opportunity to succeed and thrive. We encourage candidates from all walks of life to apply and join our team.
    $94k-136k yearly est. Auto-Apply 60d+ ago
  • Staff Site Reliability Engineer

    Bugcrowd 3.9company rating

    Bedford, NH jobs

    We're seeking a Staff Site Reliability Engineer to serve as a technical leader within our infrastructure organization. In this role, you'll help shape the reliability strategy across our engineering teams, drive adoption of best practices, and tackle our most complex infrastructure challenges. You'll be part of an international, highly engaged and technical group that is well-versed in building enterprise-ready and extremely secure software systems. Our core values of "simple is strong, respect is king, build it like you own it and think like a hacker" should resonate with you. Essential Duties and Responsibilities * Define and drive the technical vision for infrastructure reliability across the organization * Architect large-scale, fault-tolerant systems on AWS using Terraform * Lead cross-functional initiatives to improve system reliability, scalability, and efficiency * Establish standards for infrastructure-as-code, CI/CD, and deployment practices * Design and implement solutions for our most complex operational challenges * Lead incident response for critical outages and drive systemic improvements * Mentor senior engineers and help grow the SRE team's capabilities * Evaluate and introduce new technologies that improve operational excellence * Influence engineering culture around reliability, observability, and operational maturity Education, Experience, Skills, & Abilities * 5+ years of experience in SRE, DevOps, or systems engineering, with demonstrated technical leadership * Expert-level knowledge of Terraform, including module design, state management, and scaling IaC across teams * Deep expertise in AWS architecture and services at scale, with strong focus on ECS * Proven experience designing and operating containerized workloads on ECS, including capacity planning, service scaling, and task placement strategies * Strong experience designing and implementing CI/CD systems with GitHub Actions or similar tools * Track record of leading complex, cross-team technical initiatives * Advanced proficiency in Python, Ruby, Javascript, or similar languages * Strong understanding of distributed systems principles * Excellent written and verbal communication skills * Proven ability to balance long-term technical strategy with immediate operational needs Preferred Experience * Experience building internal developer platforms or self-service infrastructure tooling * Knowledge of FedRAMP * Background in cost optimization and FinOps practices * Contributions to open-source infrastructure projects * Experience scaling infrastructure organizations and processes * Experience defining and implementing SLO frameworks Working Conditions The ideal candidate must be able to complete all physical requirements of the job with or without reasonable accommodation. Sitting and/or standing - Must be able to remain in a stationary position 50% of the time Carrying and /or lifting - Must be able to carry / move laptop as needed throughout the work day. Environment - remote, work-from-home 100% of the time. ADA Statement Bugcrowd is committed to the full inclusion of all qualified individuals. In keeping with our commitment, Bugcrowd will take the steps to assure that people with disabilities are provided reasonable accommodations. Accordingly, if reasonable accommodation is required to fully participate in the job application or interview process, to perform the essential functions of the position, and/or to receive all other benefits and privileges of employment, please contact HR at ****************. Pay Range Disclosure At Bugcrowd, we strive for fairness, equality and to create an environment that allows our people to perform at their very best. Our compensation philosophy is to foster a collaborative community that rewards, attracts and retains the best possible talent. The provided salary details are based on US national averages and we retain the flexibility to tailor to the needs of the business. The national estimate for the current base range for the position of Staff Site Reliability Engineer is: $151,040 -$188,800. This position may also be eligible to participate in a discretionary bonus program or commission plan, subject to the rules governing the program, whereby an award, if any, depends on various factors, including, without limitation, individual and organizational performance.
    $151k-188.8k yearly Auto-Apply 36d ago
  • Senior Site Reliability Engineering

    Bugcrowd 3.9company rating

    Bedford, NH jobs

    We're looking for a Senior Site Reliability Engineer for our small agile infrastructure team to help us build and maintain highly available, scalable infrastructure. You'll be part of an international, highly engaged and technical group that is well-versed in building enterprise-ready and extremely secure software systems. Our core values of "simple is strong, respect is king, build it like you own it and think like a hacker" should resonate with you. Essential Duties and Responsibilities * Design, build, and maintain infrastructure using Terraform on AWS * Develop and improve CI/CD pipelines and deployment automation * Monitor system health, respond to incidents, and conduct blameless postmortems * Collaborate with development teams to improve service reliability and performance * Automate toil and repetitive operational tasks * Participate in on-call rotations * Document systems, runbooks, and operational procedures * Mentor junior team members Education, Experience, Skills, & Abilities * 3+ years of experience in SRE, DevOps, or systems engineering * Strong proficiency with Terraform and infrastructure-as-code practices * Deep experience with AWS services (ECS, RDS, Lambda, IAM, VPC, CloudWatch, etc.) * Hands-on experience with ECS for container orchestration, including task definitions, services, and auto-scaling * Solid understanding of GitHub workflows, branching strategies, and CI/CD tooling * Strong experience with Docker and containerized application deployments * Proficiency in at least one programming/scripting language (Python, Go, Bash, Ruby, Go, Javascript, Kotlin) * Strong troubleshooting skills across the stack (networking, OS, application) * Familiarity with observability tools (Prometheus, Grafana, Datadog, or similar) * Excellent writing, communication and collaboration skills Preferred Experience * Experience with FedRAMP Working Conditions The ideal candidate must be able to complete all physical requirements of the job with or without reasonable accommodation. Sitting and/or standing - Must be able to remain in a stationary position 50% of the time Carrying and /or lifting - Must be able to carry / move laptop as needed throughout the work day. Environment - remote, work-from-home 100% of the time. ADA Statement Bugcrowd is committed to the full inclusion of all qualified individuals. In keeping with our commitment, Bugcrowd will take the steps to assure that people with disabilities are provided reasonable accommodations. Accordingly, if reasonable accommodation is required to fully participate in the job application or interview process, to perform the essential functions of the position, and/or to receive all other benefits and privileges of employment, please contact HR at ****************. Pay Range Disclosure At Bugcrowd, we strive for fairness, equality and to create an environment that allows our people to perform at their very best. Our compensation philosophy is to foster a collaborative community that rewards, attracts and retains the best possible talent. The provided salary details are based on US national averages and we retain the flexibility to tailor to the needs of the business. The national estimate for the current base range for the position of Senior Site Reliability Engineering is: $129,280 - $161,600. This position may also be eligible to participate in a discretionary bonus program or commission plan, subject to the rules governing the program, whereby an award, if any, depends on various factors, including, without limitation, individual and organizational performance.
    $129.3k-161.6k yearly Auto-Apply 36d ago
  • Senior Site Reliability Engineer, Observability

    Chainlink Labs 3.6company rating

    Remote

    About Chainlink Chainlink is the industry-standard oracle platform bringing the capital markets onchain and powering the majority of decentralized finance (DeFi). The Chainlink stack provides the essential data, interoperability, compliance, and privacy standards needed to power advanced blockchain use cases for institutional tokenized assets, lending, payments, stablecoins, and more. Since inventing decentralized oracle networks, Chainlink has enabled tens of trillions in transaction value and now secures the vast majority of DeFi. Many of the world's largest financial services institutions have also adopted Chainlink's standards and infrastructure, including Swift, Euroclear, Mastercard, Fidelity International, UBS, S&P Dow Jones Indices, FTSE Russell, WisdomTree, ANZ, and top protocols such as Aave, Lido, GMX and many others. Chainlink leverages a novel fee model where offchain and onchain revenue from enterprise adoption is converted to LINK tokens and stored in a strategic Chainlink Reserve. Learn more at chain.link. The Observability Team enables Chainlink development and empowers engineers to continue building and supporting crucial products and services that have a profound impact in the blockchain industry. Reliability is vital to the success of our company. As a Senior SRE, you will help us accelerate and enable other engineering teams by increasing self-service and decreasing cognitive load. This job would be perfect for someone who has a strong DevOps mentality, is passionate about building and maintaining a mature GitOps environment, and has experience focusing on observability. The entire engineering team is expanding, and you would have plenty of opportunities to build, learn, and grow. We all have different backgrounds and are determined to help you succeed no matter where you are or who you are. If you think you would do a great job at Chainlink, we are looking forward to speaking with you, even if you don't match 100% of the job requirements: those describe people we've usually had a great time working with, but they're not a tick-box exercise. Your Impact * Build and orchestrate Modern OTEL-based Observability Platform * Support multiple telemetry types, like metrics, logs and traces. * Define and support modern governance in observability and problems at scale. * Ensure reliability, security, and performance exceed our defined SLAs * Work with engineers from across the company to help troubleshoot issues, deploy new products and services, and increase velocity while decreasing cognitive load * Lead the design and deployment of monitoring/observability services to detect and alert the team of needed action. * Ingest, aggregate, transform, and utilize data from a multitude of sources in our real time data pipeline. * Oversee the availability, performance, and supportability of our observability infrastructure. * Create processes around alert response operations and support the team to ensure the reliable delivery of oracle data. * Make recommendations to ensure sufficient metrics are collected to create alerts with every new feature release. * Champion reliability and security by taking the time to do your work right the first time Requirements * 7+ years of relevant professional experience. You probably have worked on a devops, infrastructure, SRE, and/or platform team before * Ability to develop software outside of the scope of typical infrastructure requirements and configurations * Experience programming in C, C++, Java, Python, Go, Perl, or Ruby * Expert knowledge in all aspects of designing, developing, and managing large real-time systems * Experience with monitoring and logging. You know how to export metrics using Prometheus, have built a Grafana dashboard or two, and have experience with a centralized logging solution like an ELK Stack, Splunk or Grafana Stack. * Experience with distributed systems and container orchestration. You have maintained or even built Kubernetes clusters before and feel comfortable deploying completely new services on them * Strong communication skills. You can give and receive constructive feedback, and you do not shy away from planning meetings and code reviews Desired Qualifications * Excitement for blockchain, Web 3.0, and similar decentralized technologies. * Experience running any infrastructure in the blockchain/web3 space * Ability to scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity * Experience working remotely in a distributed team * A strong desire to grow and challenge yourself. We would expect you to constantly find ways to improve and automate services to reduce toil Some of the tools and services we use daily or almost daily are: * AWS; Terraform/Terragrunt; Kubernetes, Calico and ArgoCD; Prometheus and Grafana; GitHub Actions; Packer * We expect you to be comfortable with most of those tools and very proficient in several of them. All roles with Chainlink Labs are global and remote-based. Unless otherwise stated, we ask that you try to overlap some working hours with Eastern Standard Time (EST). We carefully review all applications and aim to provide a response to every candidate within two weeks after the job posting closes. The closing date is listed on the job advert, so we encourage you to take the time to thoughtfully prepare your application. We want to fully consider your experience and skills, and you will hear from us regarding the status of your application shortly after the closing date. Commitment to Equal Opportunity Chainlink Labs is an equal opportunity employer. All qualified applicants will receive equal consideration for employment in compliance with applicable laws, regulations, or ordinances. If you need assistance or accommodation due to a disability or special need when applying for a role or in our recruitment process, please contact us via this form. Global Data Privacy Notice for Job Candidates and Applicants Information collected and processed as part of your Chainlink Labs Careers profile, and any job applications you choose to submit is subject to our Privacy Policy. By submitting your application, you are agreeing to our use and processing of your data as required.
    $117k-161k yearly est. 33d ago
  • Sr. Systems Safety Engineer

    Reliable Robotics 4.0company rating

    Remote

    We're building safety-enhancing technology for aviation that will save lives. Automated aviation systems will enable a future where air transportation is safer, more convenient and fundamentally transformative to the way goods - and eventually people - move around the planet. We are a team of mission-driven engineers with experience across aerospace, robotics and self-driving cars working to make this future a reality. As a Systems Safety Engineer at Reliable Robotics, you will be a part of the Systems & Safety team and will report to the Systems & Safety Manager. The Systems & Safety Team is responsible for architecting systems which support novel safety-critical functions, establishing development methodologies for new technologies, and comply with the federal regulations. As part of an experienced multi-disciplinary team, you will contribute to expanding the aircraft's operational and functional capabilities. Your professional experience in aircraft and system safety will further strengthen the team's ability to meet rigorous expectations of stakeholders and certification authorities. You will collaborate with members from other groups within Reliable Robotics to ensure the safety of new and novel systems necessary for unmanned aircraft operations. Responsibilities In your role as a Systems Safety Engineer you will contribute to the development and architectural design supporting the delivery of prototypes and certifiable safety-critical flight systems in a dynamic start-up work environment. You will have these responsibilities: Apply systems safety methodologies as defined by industry standards Develop new processes / methodologies used to certify autonomous / remote operations Perform safety activities in coordination with design, development and verification of aviation systems including: Generate safety artifacts such as FHAs & Safety Assessments for aircraft, systems and equipment. Generate reliability or testability assessments to support safety activities. Validate safety requirements in collaboration with systems teams. Ensure consistency between the safety activities and development activities. Support flight operations with safety of flight assessments for experimental operations You'll also have the opportunity to interact with other domains within Reliable Robotics and become involved in a broad task spectrum not limited to typical system safety engineering activities. Basic Success Criteria Bachelor's Degree of Science or Engineering in Mechanical, Electrical, Aerospace, or related discipline Broad experience from 10+ years of safety engineering on various systems, including complex safety-critical aerospace systems Demonstrated self starter with the ability to solve problems from first principles Ability to work well independently and cross-functionally across multiple organizations Preferred Criteria Advanced Degree of Science or Engineering in Mechanical, Electrical, Aerospace, or related discipline Creative problem solver that can bring multiple disciplines together with the ability to assess risk and make design and development decisions without all available data Experience certifying flight critical systems, hardware, and software via ARP4754A, ARP4761 Demonstrated ability ability to obtain regulatory acceptance of safety documents This role is critical to us at Reliable Robotics, because it works to channel the ingenuity of the development engineers, hardware and software, towards a successful, concise and conclusive, certification package for an aircraft that flies itself. This role can be remote, or located at our facility in Mountain View, California. Must be willing to travel up to 10% of the time. The estimated salary range for this position is $190,000 to $300,000/annual salary + cash and stock option awards + benefits. At Reliable Robotics, we strive to provide competitive and rewarding compensation based on experience and expertise, as well as market conditions, location, and pay equity. In addition to base compensation, Reliable Robotics offers stock options, employee medical, 401k contribution, great co-workers and a casual work environment. This position requires access to information that is subject to U.S. export controls. An offer of employment will be contingent upon the applicant's capacity to perform in compliance with U.S. export control laws. All applicants are asked to provide documentation that legally establishes status as a U.S. person or non-U.S. person (and nationalities in the case of a non-U.S. person). Where the applicant is not a U.S. person, meaning not a (i) U.S. citizen or national, (ii) U.S. lawful permanent resident, (iii) refugee under 8 U.S.C. § 1157, or (iv) asylee under 8 U.S.C. § 1158, or not otherwise permitted to access the export-controlled technology without U.S. government authorization, the Company reserves the right not to apply for an export license for such applicants whose access to export-controlled technology or software source code requires authorization and may decline to proceed with the application process and any offer of employment on that basis. At Reliable Robotics, our goal is to be a diverse and inclusive workforce. As an Equal Opportunity Employer, we do not discriminate on the basis of race, religion, color, creed, ancestry, sex, gender (including pregnancy, childbirth, breastfeeding, or related medical conditions), gender identity, gender expression, sexual orientation, age, non-disqualifying physical or mental disability or medical conditions, national origin, military or veteran status, genetic information, marital status, or any other basis covered by applicable law. All employment and promotion is decided on the basis of qualifications, merit, and business need. If you require reasonable accommodation in completing an application, interviewing, completing any pre-employment testing, or otherwise participating in the employee selection process, please direct your inquiries to ****************
    $60k-102k yearly est. Auto-Apply 60d+ ago
  • Quality Engineer

    Offchain Labs 4.0company rating

    Remote

    At Offchain Labs, we aren't just building products: we're leading a movement. As pioneers in blockchain scalability and security, we're at the forefront of transforming how the world interacts with decentralized applications. We're laying the foundation that will define the next generation of digital commerce, governance, and human interaction. This involves tackling real-world challenges that come with scaling blockchain technology, without compromising on its core principles: decentralization, security and transparency. At the center of this vision is our people. Our team is made up of thinkers and doers that embrace new challenges and seek solutions that push existing boundaries. If you're energized by solving unprecedented problems, and believe in the role that decentralized systems will play in creating a more equitable digital future, then we want to hear from you. Why Offchain Labs? Offchain Labs is setting the pace for the entire Ethereum ecosystem. We built the Arbitrum stack that powers Arbitrum One, the most widely adopted Ethereum scaling solution that exists today. Arbitrum's ecosystem is undergoing tremendous growth with hundreds of projects and dApps on Arbitrum One today. Over 100 different teams have used Offchain Labs technology to build their own Arbitrum chains. Major players in the space, Robinhood, BlackRock, Ethena Labs, Securitize, Aave, and Apechain are all using the Arbitrum stack. Arbitrum's thriving ecosystem wouldn't exist without our advanced technology stack. Arbitrum, Prysm, ZeroDev. These aren't just product names. These are tools that are actively reshaping what's possible on Ethereum and advancing its core infrastructure. To top it all off? We're backed by $124 million in funding. We've demonstrated consistent execution with billions in secured value, thousands of supported projects, and infrastructure processing millions of transactions seamlessly.The Role: As we continue to scale our ecosystem, we are looking for a Senior Quality Engineer to build and scale automated testing frameworks for our Engineering teams. This is a hands-on, high-impact role where you'll implement our QA strategy, develop automation and testing infrastructure, and help us expand our QA team. Ensuring the reliability, security, and performance of a blockchain scaling solution is no trivial task. Bugs and vulnerabilities in smart contract execution, rollups, or state proofs can have massive implications. As part of our QA team, you will establish rigorous, automated testing pipelines that validate the integrity of Arbitrum's protocol, its developer tooling, and its ecosystem. Your work will directly contribute to the security and trustworthiness of one of the most widely adopted Layer 2 solutions in the Ethereum ecosystem. What you'll do: Refine and implement the QA strategy for our Engineering teams. Establish test automation frameworks that validate correctness, security, and performance across Platform Engineering's entire stack. Integrate QA into the CI/CD pipeline to ensure rigorous pre-deployment validation of protocol updates and software releases. Develop fuzzing, property-based testing, and other advanced verification techniques for smart contract security. Execute load and stress testing to measure scalability under high transaction volume. Work Cross-Functionally to Embed Quality in the Development ProcessPartner with software engineers, smart contract developers, and DevOps to establish a culture of quality across the engineering organization. Collaborate with researchers and external auditors to automate vulnerability detection and regression testing. Work closely with developer relations to ensure that SDKs, APIs, and documentation are well-tested and reliable for the growing Arbitrum developer ecosystem. What you'll need: 3+ years in QA engineering, test automation, or software quality roles. Strong programming language skills in at least one language such as Go, Rust, JavaScript, Typescript, Python, Java or similar Strong automation skills, with proficiency in at least one testing framework (e.g., Playwright, Cypress, Pytest, Jest, Mocha, Selenium, etc.) Deep understanding of CI/CD pipelines and integrating automated testing within DevOps workflows. Ability to deploy to and administer virtual machines on a cloud infrastructure provider (e.g. AWS, Google Cloud) Ability to be a team leader & collaborate with various technical and non-technical roles Ability to drive process change and influence engineering teams to adopt best practices. Passionate about open-source, decentralization, and blockchain technology. Nice to have: Experience building QA processes from scratch in a fast-paced, engineering-driven environment. Experience with blockchain-based, Web3 products and crypto wallets Experience testing complex web and mobile applications, either manually or through automation Familiarity with blockchain, Ethereum, and Layer 2 scaling solutions (or strong willingness to learn). Security mindset-familiarity with fuzzing, property-based testing, and formal verification is a plus. Experience with performance testing for distributed systems, including load testing and bottleneck analysis. Prior experience mentoring QA engineers. Perks: Remote-first global workforce + NY office Annual company offsite + team onsites Professional reimbursement program (facilitates industry conference attendance, certifications, and more) Medical, dental & vision coverage (US + some other countries) 401k retirement plan + company match (US only) Wellness stipend Home office set up / ergonomic equipment program Attention Offchain Labs Job Seekers: This role cannot be performed in California, or Colorado. Please be advised that there has been a rise in fraudulent recruiter activities, particularly within the Web3 space. If you would like to confirm whether someone is an OCL employee or the legitimacy of an offer you received, please email ********************* At Offchain Labs, we are committed to building a welcoming and supportive workplace for all employees, regardless of their background or identity. We strive to create an environment where everyone feels valued and has an equal opportunity to succeed and thrive. We encourage candidates from all walks of life to apply and join our team.
    $67k-91k yearly est. Auto-Apply 48d ago
  • Product Quality Engineer - Quality Dept

    Fuyao Glass America Inc. 4.3company rating

    Moraine, OH jobs

    Job Title: Product Quality Engineer Job Summary: 1.Advance product quality planning in the New Product Development phase; 2.Control the new product quality before SOP+3 months; 3.New product verification; 4.Solve and improve quality problem. Job Functions: APQP (advanced product quality planning): 1.1 Participate in customer drawing and technical specification review and examination, define special customer requirements; 1.2 According to the internal control quality standard of customer and the company, establish new product quality standard and the detection and test specification; 1.3 Prepare the new product quality target according to customer requirements and company's profit plan; 1.4 Prepare new product quality plan according to the production management level and quality management level of the company; 1.5 Organize existing evaluation of new product test and detection capability and prepare improvement plan. Control the new product quality before SOP+3 months: 2.1 Participate in the design of new product test and detection plan, responsible for verifying the detection and test capability for new products; 2.2 Organize the process capability study and verification of new product technological process and participate in Run-at-Rate verification; 2.3 Organize product review of samples during new product development, quality gate review before mass production and the process review; 2.4 Participate in the product review, revised changes review and trial assembly during customer's new product development; 2.5 Monitor if all necessary processes and plans during new product development are effectively implemented. New product verification: 3.1 Work with Quality Verification Engineer to develop PV testing plan according to the OEM requirements; 3.2 Participate in and track the implementation of PV Testing plan, ensure timely attainment. Solve and improve quality problem: 4.1 Properly handling with quality problems occurred during new product development and SOP+6 months; 4.2 Handling with customer complaint and returns of goods during new product development and SOP+6 months; 4.3 Establishing new product quality resume, summarize quality problems and provide improvement suggestions. Other works relating to quality assurance: 5.1 Participate in internal training specifically required by customer; 5.2 Responsible for training and guiding new product quality standard and inspection specification; 5.3 Assist quality control supervisor in recognizing and managing process change points; 5.4 Responsible for compiling, archiving and transferring materials relating to new product development process Reports to prepare and submit: Quality review report, Summary of customer quality information, Project quality record Other duties as assigned Nothing in the Position Description restricts management's right to assign or re-assign duties and responsibilities to this job at any time Qualifications Languages spoken commonly in the workplace are English and Mandarin. - Ability to read, understand and comprehend documents such as safety rules, operating and maintenance instructions. Ability to interpret a variety of instructions furnished in written, oral, diagram, or schedule form. Ability to speak effectively and interact with other team members engineers, leadership and customers. Experienced in quality engineer area Bachelor degree or above (The mechanical and electrical major is preferred) Bilingual: English - Chinese (Required) The employee is regularly required o stand for long periods. Duties include turning at the waist, reaching, bending, squatting and lifting up to 50 pounds. Ability to pass static strength requirements (grip) Clarity of vision at 20 inches or less. Use this factor when special and minute accuracy is demanded Clarity of vision at 20 feet or more. Use this factor when visual efficiency in terms of far acuity is required in day and night/dark conditions Three-dimensional vision. Ability to judge distances and spatial relationships so as to see objects where and as they actually are. Ability to identify and distinguish colors Observing an area that can be seen up and down or to right or left while eyes are fixed on a given point The noise level in the work environment is usually moderate. Safety requirements for this position are safety glasses, hearing protection and steel-toed work boots Ability to add, subtract, multiply, and divide in all units of measure, using whole numbers, common fractions, and decimals Ability to solve practical problems and deal with a variety of variables Knowledge of and familiarity manufacturing software Familiar with manufacturing principle and technological process of automotive glass Familiar with the company's quality policy, objectives and commitments Have a good knowledge of TS16949 quality management system Have a basic knowledge of ISO14001 and OHSAS18001 quality system Familiar with various standards and inspection/test methods for automotive glass Familiar with core knowledge of Five Tools Quality planning of new projects Risk assessment and management of new projects Able to solve problems systematically Relatively better planning, organization, Leadership and control ability Relatively strong sense of responsibility and sensitivity; be able to actively deal with problems Able to work under pressure Systematically trained in manufacturing technique of automotive glass and main equipment Systematically trained in APQP Trained in TS16949 quality system (including six handbooks) Systematically trained in quality tools Proficient in using Microsoft Office software Word, Excel, Power Point Five Tools and quality statistics knowledge Proficient in using 8D method to solve problems This is a 2nd Shift position
    $58k-75k yearly est. Auto-Apply 60d+ ago
  • Product Quality Engineer - Quality Dept

    Fuyao Glass America Inc. 4.3company rating

    Moraine, OH jobs

    Job Title: Product Quality Engineer Job Summary: 1.Advance product quality planning in the New Product Development phase; 2.Control the new product quality before SOP+3 months; 3.New product verification; 4.Solve and improve quality problem. Job Functions: APQP (advanced product quality planning): 1.1 Participate in customer drawing and technical specification review and examination, define special customer requirements; 1.2 According to the internal control quality standard of customer and the company, establish new product quality standard and the detection and test specification; 1.3 Prepare the new product quality target according to customer requirements and company's profit plan; 1.4 Prepare new product quality plan according to the production management level and quality management level of the company; 1.5 Organize existing evaluation of new product test and detection capability and prepare improvement plan. Control the new product quality before SOP+3 months: 2.1 Participate in the design of new product test and detection plan, responsible for verifying the detection and test capability for new products; 2.2 Organize the process capability study and verification of new product technological process and participate in Run-at-Rate verification; 2.3 Organize product review of samples during new product development, quality gate review before mass production and the process review; 2.4 Participate in the product review, revised changes review and trial assembly during customer's new product development; 2.5 Monitor if all necessary processes and plans during new product development are effectively implemented. New product verification: 3.1 Work with Quality Verification Engineer to develop PV testing plan according to the OEM requirements; 3.2 Participate in and track the implementation of PV Testing plan, ensure timely attainment. Solve and improve quality problem: 4.1 Properly handling with quality problems occurred during new product development and SOP+6 months; 4.2 Handling with customer complaint and returns of goods during new product development and SOP+6 months; 4.3 Establishing new product quality resume, summarize quality problems and provide improvement suggestions. Other works relating to quality assurance: 5.1 Participate in internal training specifically required by customer; 5.2 Responsible for training and guiding new product quality standard and inspection specification; 5.3 Assist quality control supervisor in recognizing and managing process change points; 5.4 Responsible for compiling, archiving and transferring materials relating to new product development process Reports to prepare and submit: Quality review report, Summary of customer quality information, Project quality record Other duties as assigned Nothing in the Position Description restricts management's right to assign or re-assign duties and responsibilities to this job at any time Qualifications Languages spoken commonly in the workplace are English and Mandarin. - Ability to read, understand and comprehend documents such as safety rules, operating and maintenance instructions. Ability to interpret a variety of instructions furnished in written, oral, diagram, or schedule form. Ability to speak effectively and interact with other team members engineers, leadership and customers. Experienced in quality engineer area Bachelor degree or above (The mechanical and electrical major is preferred) The employee is regularly required o stand for long periods. Duties include turning at the waist, reaching, bending, squatting and lifting up to 50 pounds. Ability to pass static strength requirements (grip) Clarity of vision at 20 inches or less. Use this factor when special and minute accuracy is demanded Clarity of vision at 20 feet or more. Use this factor when visual efficiency in terms of far acuity is required in day and night/dark conditions Three-dimensional vision. Ability to judge distances and spatial relationships so as to see objects where and as they actually are. Ability to identify and distinguish colors Observing an area that can be seen up and down or to right or left while eyes are fixed on a given point The noise level in the work environment is usually moderate. Safety requirements for this position are safety glasses, hearing protection and steel-toed work boots Ability to add, subtract, multiply, and divide in all units of measure, using whole numbers, common fractions, and decimals Ability to solve practical problems and deal with a variety of variables Knowledge of and familiarity manufacturing software Familiar with manufacturing principle and technological process of automotive glass Familiar with the company's quality policy, objectives and commitments Have a good knowledge of TS16949 quality management system Have a basic knowledge of ISO14001 and OHSAS18001 quality system Familiar with various standards and inspection/test methods for automotive glass Familiar with core knowledge of Five Tools Quality planning of new projects Risk assessment and management of new projects Able to solve problems systematically Relatively better planning, organization, Leadership and control ability Relatively strong sense of responsibility and sensitivity; be able to actively deal with problems Able to work under pressure Systematically trained in manufacturing technique of automotive glass and main equipment Systematically trained in APQP Trained in TS16949 quality system (including six handbooks) Systematically trained in quality tools Proficient in using Microsoft Office software Word, Excel, Power Point Five Tools and quality statistics knowledge Proficient in using 8D method to solve problems This position is for 1st Shift
    $58k-75k yearly est. Auto-Apply 50d ago

Learn more about Comdata jobs