Platform Site Reliability Engineer
Full-time Mid-Senior LevelJob Overview
Nexthink is looking for a strong Platform Engineer with SRE operations experience to strengthen our infrastructure and accelerate our ability to deploy, monitor, and scale systems effectively. As a SaaS provider, our customers rely on us to deliver a seamless, reliable, and scalable experience 24/7. This role needs to be located in West or Mountain Time Zone.
Join Nexthink's vibrant team where cutting-edge technology meets innovation. Be a part of Nexthink's Digital Employee Experience technological revolution, ensuring our global customers enjoy a seamless user experience. Embrace the future with Nexthink in US; apply now and become a key player in our dynamic Platform Engineering/SRE organization.
What You'll Do:
- Design, build, and maintain the infrastructure powering our multi-tenant SaaS platform with reliability, security, and scalability in mind.
- Implement and manage cloud-native systems (AWS) using best-in-class tools and automation.
- Operate and enhance Kubernetes clusters, deployment pipelines, and service meshes to support continuous delivery.
- Establish and enforce SLOs, SLAs, and error budgets, and proactively address availability and performance issues.
- Develop infrastructure as code (Terraform or similar) for repeatable and auditable provisioning.
- Experience in programming solutions for Platform Tools such as for automation, monitoring, provisioning, using programming technologies.
- Solid understanding of the network stack (TCP/IP, VPN, HTTP, SSL, routing, etc.), cloud topologies (VPC, Virtual Subnets, NACLS, NSG, ILB, ELB, etc.) and storage (S3, EBS, Azure Files etc).
- Monitor system health, application performance, and user-facing SLAs using tools like Datadog, Prometheus, Grafana...
- Be a main actor and improve incident response practices and help reduce mean time to detect (MTTD) and recover (MTTR). Experience in coordinating teams and persons to maintain a SLA.
- Ability to troubleshoot, narrow down and fix incidents with minimal intervention of other functions.
- Participate in a shared on-call rotation, responding to incidents, troubleshooting outages, and driving timely resolution and communication.
- Work closely with software engineers to embed reliability and observability into every service.
- Develop automated runbooks, health checks, and alerting to support reliable operations with minimal manual intervention.
- Support automated testing, canary deployments, and rollback strategies to ensure safe, fast, and reliable releases.
- Contribute to security best practices, compliance automation, and cost optimization.
Make Your Resume Now