Senior MLOps Engineer
Full-timeJob Overview
About the Institute of Foundation Models (IFM)
The Institute of Foundation Models is a dedicated research lab for building, understanding, deploying, and risk-managing large-scale AI systems. We drive innovation in foundation models and their operationalization, empowering research, education, and industry adoption through scalable infrastructure and real-world applications.
As part of our engineering team, you will operate at the intersection of machine learning and systems design — building the cloud, orchestration, and deployment layers that power the next generation of intelligent applications at MBZUAI. You’ll work alongside world-class AI researchers and engineers to productionize LLMs, voice models, and multimodal systems at scale.
The Role
As a Senior MLOps Engineer, you will design, build, and maintain robust ML(Machine Learning) infrastructure across training, inference, and deployment pipelines. You will take ownership of the model lifecycle — from data ingestion to real-time serving — and ensure our LLM and speech models are deployed efficiently, securely, and reproducibly in Kubernetes-based environments.
This position requires deep hands-on experience with Kubernetes (EKS), Helm, AWS cloud infrastructure, and modern MLOps toolchains (e.g., vLLM, SGLang, OpenWebUI, Weights & Biases, MLflow). Familiarity with speech/voice AI frameworks like ElevenLabs, Whisper, and RVC is also valuable.
About the Institute of Foundation Models (IFM)
The Institute of Foundation Models is a dedicated research lab for building, understanding, deploying, and risk-managing large-scale AI systems. We drive innovation in foundation models and their operationalization, empowering research, education, and industry adoption through scalable infrastructure and real-world applications.
As part of our engineering team, you will operate at the intersection of machine learning and systems design — building the cloud, orchestration, and deployment layers that power the next generation of intelligent applications at MBZUAI. You’ll work alongside world-class AI researchers and engineers to productionize LLMs, voice models, and multimodal systems at scale.
The Role
As a Senior MLOps Engineer, you will design, build, and maintain robust ML(Machine Learning) infrastructure across training, inference, and deployment pipelines. You will take ownership of the model lifecycle — from data ingestion to real-time serving — and ensure our LLM and speech models are deployed efficiently, securely, and reproducibly in Kubernetes-based environments.
This position requires deep hands-on experience with Kubernetes (EKS), Helm, AWS cloud infrastructure, and modern MLOps toolchains (e.g., vLLM, SGLang, OpenWebUI, Weights & Biases, MLflow). Familiarity with speech/voice AI frameworks like ElevenLabs, Whisper, and RVC is also valuable.
Key Responsibilities
- Design and manage scalable ML infrastructure on AWS using EKS, EC2, RDS, S3, and IAM-based access control.
- Build and maintain Kubernetes deployments for LLM and TTS inference using Helm, ArgoCD, and Prometheus/Grafana monitoring.
- Implement and optimize model serving pipelines using vLLM, SGLang, TensorRT, or similar frameworks for high-throughput inference.
- Develop CI/CD and MLOps automation for data versioning, model validation, and deployment (GitHub Actions, Jenkins, or AWS CodePipeline).
- Integrate OpenWebUI, Gradio, or similar UIs for user-facing model demos and internal evaluation tools.
- Collaborate with ML researchers to productize models — including TTS (e.g., ElevenLabs API), ASR (Whisper), and LLM-based chat systems.
- Ensure observability, cost optimization, and reliability of cloud resources across multiple environments.
- Contribute to internal tools for dataset curation, model monitoring, and retraining pipelines.
- Maintain infrastructure-as-code using Terraform and Helm charts for reproducibility and governance.
- Support real-time multimodal workloads (voice, text, vision) across inference clusters.
Academic Qualifications
- 4+ years of experience in MLOps, DevOps, or Cloud Infrastructure Engineering for ML systems.
- Strong proficiency in Kubernetes, Helm, and container orchestration.
- Experience deploying ML models via vLLM, SGLang, TensorRT, or Ray Serve.
- Proficiency with AWS services (EKS, EC2, S3, RDS, CloudWatch, IAM).
- Solid experience with Python, Docker, Git, and CI/CD pipelines.
- Strong understanding of model lifecycle management, data pipelines, and observability tools (Grafana, Prometheus, Loki).
- Excellent collaboration skills with ML researchers and software engineers.
Professional Experience – Preferred
- Extensive Experience with vLLM, K8s, Elevenlabs, Whisper, Gradio/OpenWebUI, or custom TTS/ASR model hosting.
- Familiarity with multi-GPU scheduling, NCCL optimization, and HPC cluster integration.
- Knowledge of security, cost management, and network policy in multi-tenant Kubernetes clusters and cloudflare systems.
- Prior work in LLM deployment, fine-tuning pipelines, or foundation model research.
- Exposure to data governance and responsible AI operations in research or enterprise settings.
Make Your Resume Now