Make Your Resume Now

Site Reliability Engineer

Posted January 21, 2026
Full-time Mid-Senior Level

Job Overview

Who You’ll Work With

SRE's at Arista combine strong software and systems engineering with a passion for operating production systems at scale. As an SRE you’ll be part of the team responsible for our global service fleet.

What You’ll Do:

CloudVision is deployed on Kubernetes across global regions using Spinnaker for our CI/CD pipeline. Our tech stack runs on GKE, using HBase/Hadoop as main distributed database and storage layer, ElasticSearch for powering search data, ClickHouse for fast real time queries of flow data, our own Kafka-based distributed real time stream processing layer for analytics, and TensorFlow for ML analysis. Our monitoring system is built on top of Prometheus, Grafana, Loki, and other OSS tools.
As a Senior SRE, you’ll be responsible for our global CloudVision service fleet. This includes:

  • Build, deploy safely and incrementally and operate critical production systems with focus on scalability, reliability, observability, performance and security.
  • Monitor, support and enhance product deployment experience across services.
  • Build automation to remove toil and efficiently operate production systems.
  • Proactively monitor, respond to, and enhance alerts and set up automated alert handling
  • Create and maintain the incident response runbooks.
  • Build and deploy new systems with scalability, reliability, and observability as primary requirements
  • Triage platform/infrastructural issues and help Arista software engineers in their triages. Engage with 3rd party vendor support.
  • Deploy new systems in a staged manner
  • Write postmortem documents and build solutions to avoid incidents from repeating.
  • Plan and communicate maintenance windows on production systems.
  • Work with Arista’s product development teams to identify infrastructural issues that are causing bottlenecks and limitations in their workflows. Design and implement solutions to resolve them.
  • Survey and adopt best practices around infrastructure/platform to maintain secure, scalable and fault-tolerant systems.
  • Implement solutions to scale the systems
  • Implement fault-tolerance and performance to improve availability of the systems
  • Study the design and sufficient implementation details of OSS systems for better triage and fix resolution.

#LI-EO1

Ready to Apply?

Take the next step in your career journey

Stand out with a professional resume tailored for this role

Build Your Resume – It’s Free!