Associate Staff Engineer, Devops
Full-time Not ApplicableJob Overview
Requirement:
- Experience: 5+ years
- Strong experience in DevOps or Site Reliability Engineering (SRE) roles.
- Strong knowledge of Docker, Kubernetes, Terraform, and CI/CD pipelines.
- Hands-on experience with AWS, Azure, or other cloud platforms.
- Familiarity with GPU infrastructure and ML workloads is a plus.
- Good understanding of monitoring and logging systems (Prometheus, Grafana).
- Ability to collaborate with ML teams for optimized inference and deployment.
- Strong troubleshooting and problem-solving skills in high-scale environments.
- Knowledge of infrastructure security best practices, cost optimization, and performance tuning.
- Exposure to vector databases and AI/ML deployment pipelines is highly desirable.
Responsibilities:
- Maintain and manage Kubernetes clusters, AWS/Azure environments, and GPU infrastructure for high-performance workloads.
- Design and implement CI/CD pipelines for seamless deployments and faster release cycles.
- Set up and maintain monitoring and logging systems using Prometheus and Grafana to ensure system health and reliability.
- Support vector database scaling and model deployment for AI/ML workloads.
- Collaborate with ML engineering teams to optimize inference performance and resource utilization.
- Ensure high availability, security, and scalability of infrastructure across multiple environments.
- Automate infrastructure provisioning and configuration using Terraform and other IaC tools.
- Troubleshoot production issues and implement proactive measures to prevent downtime.
- Continuously improve deployment processes and infrastructure reliability through automation and best practices.
- Participate in architecture reviews, capacity planning, and disaster recovery strategies.
- Drive cost optimization initiatives for cloud resources and GPU utilization.
- Stay updated with emerging technologies in cloud-native, AI infrastructure, and DevOps automation.
Make Your Resume Now