Make Your Resume Now

SRE Manager

Posted January 09, 2026
Full-time Mid-Senior Level

Job Overview

We are seeking a talented and motivated SRE Manager  to join our dynamic team. In this role, you will execute a range of site reliability activities, ensuring optimal service performance, reliability, and availability. You will collaborate with cross-functional engineering teams to develop scalable, fault-tolerant, and cost-effective cloud services.

If you are passionate about site reliability engineering and ready to make a significant impact, we would love to hear from you!

Key Responsibilities:

 

A Site Reliability Engineering (SRE) Team Manager plays a crucial role in ensuring system reliability, scalability, and operational efficiency. Their responsibilities typically include:

  • Leading & Mentoring – Guiding a team of SREs, fostering a culture of automation, resilience, and continuous improvement.
  • Defining SLOs & SLIs – Establishing service level objectives (SLOs) and indicators (SLIs) to measure and maintain system performance.
  • Incident Management – Overseeing incident response, conducting post-mortems, and implementing preventive measures.
  • Collaboration with Engineering Teams – Working closely with developers to build scalable and resilient systems.
  • Automation & Efficiency – Driving automation initiatives to reduce toil and enhance operational workflows.
  • Risk & Compliance Management – Ensuring adherence to security, compliance, and reliability standards.
  • Optimizing Observability – Implementing monitoring tools and strategies to proactively detect and resolve issues.

 

 

  • implement automation tools, frameworks, and CI/CD pipelines, promoting best practices and code reusability.
  •  Enhance site reliability through process automation, reducing mean time to detection, resolution, and repair.
  •  Identify and manage risks through regular assessments and proactive mitigation strategies.
  •  Develop and troubleshoot large-scale distributed systems in both on-prem and cloud environments.
  •  Deliver infrastructure as code to improve service availability, scalability, latency, and efficiency.
  • Monitor support processing for early detection of issues and share knowledge on emerging site reliability trends.
  •  Analyze data to identify improvement areas and optimize system performance through scale testing.

 

 

 

Ready to Apply?

Take the next step in your career journey

Stand out with a professional resume tailored for this role

Build Your Resume – It’s Free!