Site Reliability Engineer
Full-time Mid-Senior LevelJob Overview
Position Overview
The Site Reliability Engineer (SRE) ensures the reliability, scalability, and performance of complex distributed systems deployed across private, public, and hybrid cloud environments. This is a hands-on technical leadership role that combines deep infrastructure knowledge with software engineering expertise to build systems that are automated, observable, and operationally sustainable.
As a Site Reliability Engineer, you will play a central role in evolving the reliability and sustainability of the company’s core platform. Your work will directly shape the resilience of mission-critical systems deployed at customer premises, influencing not only internal engineering excellence but also the long-term trust and satisfaction of enterprise clients.
The successful candidate will work within a DevIntegration team, integrating multiple layers of the product stack to enable automated, Kubernetes-based GPU workload provisioning using the Cluster API framework. You will contribute both strategically and tactically - shaping architectural direction while also leading by example in implementation, troubleshooting, and mentorship.
Key Responsibilities
1. Reliability and Infrastructure Engineering
- Design, deploy, and maintain highly available, fault-tolerant systems running on Kubernetes and bare metal infrastructure.
- Define and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets to balance innovation velocity with operational stability.
- Lead system reliability initiatives, ensuring that uptime and performance targets are consistently achieved.
2. System Integration and Automation
- Work within the DevIntegration team to integrate diverse components of the product stack, enabling end-to-end cluster provisioning and management.
- Build automation pipelines using Infrastructure as Code (IaC) and CI/CD frameworks to ensure consistent, repeatable deployments.
- Develop scripts, frameworks, and tools to eliminate manual interventions and improve system resilience.
3. Architecture and Design Leadership
- Participate in and lead architectural discussions- from high-level design to low-level implementation - to ensure alignment with reliability, security, and scalability goals.
- Collaborate with development and product teams to address functional gaps and propose sustainable technical solutions in a fast-paced environment.
4. Operational Excellence
- Ensure long-term operational sustainability of the deployed product, including updates, incident management, and integration with third-party enterprise systems such as PKI, IAM, and SIEM.
- Conduct performance optimization, capacity planning, and root cause analysis to maintain system health.
- Champion automation of day-2 operations, such as monitoring, scaling, patching, and recovery.
5. Leadership and Mentorship
- Take ownership beyond engineering scope when needed - leading planning, coordination, and execution activities with an end-to-end accountability mindset.
- Mentor and support team members, sharing deep expertise in reliability engineering, infrastructure design, and troubleshooting best practices.
- Actively contribute to defining and refining SRE standards and processes across the organization.
Make Your Resume Now