SRE Manager
Full-time Mid-Senior LevelJob Overview
We are seeking a talented and motivated SRE Manager to join our dynamic team. In this role, you will execute a range of site reliability activities, ensuring optimal service performance, reliability, and availability. You will collaborate with cross-functional engineering teams to develop scalable, fault-tolerant, and cost-effective cloud services.
If you are passionate about site reliability engineering and ready to make a significant impact, we would love to hear from you!
Key Responsibilities:
A Site Reliability Engineering (SRE) Team Manager plays a crucial role in ensuring system reliability, scalability, and operational efficiency. Their responsibilities typically include:
- Leading & Mentoring – Guiding a team of SREs, fostering a culture of automation, resilience, and continuous improvement.
- Defining SLOs & SLIs – Establishing service level objectives (SLOs) and indicators (SLIs) to measure and maintain system performance.
- Incident Management – Overseeing incident response, conducting post-mortems, and implementing preventive measures.
- Collaboration with Engineering Teams – Working closely with developers to build scalable and resilient systems.
- Automation & Efficiency – Driving automation initiatives to reduce toil and enhance operational workflows.
- Risk & Compliance Management – Ensuring adherence to security, compliance, and reliability standards.
- Optimizing Observability – Implementing monitoring tools and strategies to proactively detect and resolve issues.
- implement automation tools, frameworks, and CI/CD pipelines, promoting best practices and code reusability.
- Enhance site reliability through process automation, reducing mean time to detection, resolution, and repair.
- Identify and manage risks through regular assessments and proactive mitigation strategies.
- Develop and troubleshoot large-scale distributed systems in both on-prem and cloud environments.
- Deliver infrastructure as code to improve service availability, scalability, latency, and efficiency.
- Monitor support processing for early detection of issues and share knowledge on emerging site reliability trends.
- Analyze data to identify improvement areas and optimize system performance through scale testing.
Make Your Resume Now