Make Your Resume Now

Site Reliability Engineer

Posted November 18, 2025
Full-time Mid-Senior Level

Job Overview

Position Overview

The Site Reliability Engineer (SRE) ensures the reliability, scalability, and performance of complex distributed systems deployed across private, public, and hybrid cloud environments. This is a hands-on technical leadership role that combines deep infrastructure knowledge with software engineering expertise to build systems that are automated, observable, and operationally sustainable.

As a Site Reliability Engineer, you will play a central role in evolving the reliability and sustainability of the company’s core platform. Your work will directly shape the resilience of mission-critical systems deployed at customer premises, influencing not only internal engineering excellence but also the long-term trust and satisfaction of enterprise clients.

The successful candidate will work within a DevIntegration team, integrating multiple layers of the product stack to enable automated, Kubernetes-based GPU workload provisioning using the Cluster API framework. You will contribute both strategically and tactically - shaping architectural direction while also leading by example in implementation, troubleshooting, and mentorship.

 

Key Responsibilities

1. Reliability and Infrastructure Engineering

  • Design, deploy, and maintain highly available, fault-tolerant systems running on Kubernetes and bare metal infrastructure.
  • Define and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets to balance innovation velocity with operational stability.
  • Lead system reliability initiatives, ensuring that uptime and performance targets are consistently achieved.

2. System Integration and Automation

  • Work within the DevIntegration team to integrate diverse components of the product stack, enabling end-to-end cluster provisioning and management.
  • Build automation pipelines using Infrastructure as Code (IaC) and CI/CD frameworks to ensure consistent, repeatable deployments.
  • Develop scripts, frameworks, and tools to eliminate manual interventions and improve system resilience.

3. Architecture and Design Leadership

  • Participate in and lead architectural discussions- from high-level design to low-level implementation - to ensure alignment with reliability, security, and scalability goals.
  • Collaborate with development and product teams to address functional gaps and propose sustainable technical solutions in a fast-paced environment.

4. Operational Excellence

  • Ensure long-term operational sustainability of the deployed product, including updates, incident management, and integration with third-party enterprise systems such as PKI, IAM, and SIEM.
  • Conduct performance optimization, capacity planning, and root cause analysis to maintain system health.
  • Champion automation of day-2 operations, such as monitoring, scaling, patching, and recovery.

5. Leadership and Mentorship

  • Take ownership beyond engineering scope when needed - leading planning, coordination, and execution activities with an end-to-end accountability mindset.
  • Mentor and support team members, sharing deep expertise in reliability engineering, infrastructure design, and troubleshooting best practices.
  • Actively contribute to defining and refining SRE standards and processes across the organization.

Ready to Apply?

Take the next step in your career journey

Stand out with a professional resume tailored for this role

Build Your Resume – It’s Free!