Make Your Resume Now

Site Reliability Engineer (SRE)

Full-time

Job Overview

Our client is an innovative technology company operating large-scale cloud and edge infrastructure supporting AI-driven products and services. As the platform continues to expand, they are looking for a Site Reliability Engineer to help build highly reliable, observable, and secure systems that power mission-critical applications.

This role offers the opportunity to work across cloud infrastructure, Kubernetes, observability, security, automation, and emerging AI operational platforms in a fast-growing environment.

Our client is an innovative technology company operating large-scale cloud and edge infrastructure supporting AI-driven products and services. As the platform continues to expand, they are looking for a Site Reliability Engineer to help build highly reliable, observable, and secure systems that power mission-critical applications.

This role offers the opportunity to work across cloud infrastructure, Kubernetes, observability, security, automation, and emerging AI operational platforms in a fast-growing environment.

What you will do:

    Reliability & Observability

    • Design and maintain monitoring, alerting, and dashboarding systems across cloud and edge environments.
    • Build visibility into system health through metrics, logs, traces, and performance analytics.
    • Define and manage SLIs, SLOs, and service reliability targets.
    • Develop proactive monitoring and anomaly detection capabilities to identify issues before they impact users.
    • Cloud Infrastructure & Platform Operations

      • Deploy, manage, and optimize containerized workloads running on Kubernetes.
      • Maintain scalable cloud infrastructure across production environments.
      • Improve system performance, availability, and operational efficiency.
      • Support infrastructure provisioning through Infrastructure-as-Code practices.
      • Security & Access Management

        • Implement secure access controls and audit mechanisms across infrastructure environments.
        • Monitor for cybersecurity threats, unauthorized access attempts, and service disruptions.
        • Develop alerting and response procedures for security-related incidents.
        • Contribute to operational security best practices and governance initiatives.
        • Automation & Engineering Excellence

          • Automate repetitive operational tasks to reduce manual effort and improve reliability.
          • Build tooling and scripts to streamline infrastructure operations.
          • Support CI/CD workflows and deployment automation.
          • Promote documentation, operational standards, and continuous improvement.
          • Incident Response & Reliability Engineering

            • Participate in on-call rotations and incident management.
            • Lead troubleshooting efforts during production incidents.
            • Conduct root-cause analysis and post-mortem reviews.
            • Drive long-term improvements that enhance system resilience.
            • Cross-Functional Collaboration

              • Work closely with software, AI, machine learning, hardware, and product teams.
              • Ensure new services are production-ready with appropriate monitoring, security, and reliability measures.
              • Support the operational needs of both cloud-based and distributed edge computing environments.

What you will need:

  • 3+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, or Production Operations.
  • Hands-on experience with AWS or other major cloud platforms.
  • Strong understanding of observability and monitoring tools such as Grafana, Prometheus, or similar platforms.
  • Solid Linux administration and troubleshooting skills.
  • Experience with Docker, Kubernetes, and containerized workloads.
  • Experience with Infrastructure as Code tools such as Terraform.
  • Proficiency in at least one scripting or programming language (Python, Bash, etc.).
  • Understanding of networking fundamentals and infrastructure security concepts.
  • Experience supporting production systems and participating in incident response.
  • Strong automation mindset and commitment to operational excellence.

Nice-to-haves:

  • Experience operating large-scale edge computing or IoT deployments.
  • Familiarity with zero-trust access management platforms.
  • Experience in security operations, threat detection, or infrastructure security.
  • Exposure to AI infrastructure, LLM-based applications, or workflow automation platforms.
  • Knowledge of AI-Ops, anomaly detection, or intelligent monitoring solutions.
  • Familiarity with compliance and security frameworks such as ISO 27001.

Ready to Apply?

Take the next step in your career journey

Stand out with a professional resume tailored for this role

Build Your Resume – It’s Free!