Platform - SRE Engineer
Contract Mid-Senior LevelJob Overview
Anticipated Contract End Date/Length: November 30, 2026.
Work Set Up: Hybrid (3 days per week in office)
Clearance required: BPSS
Our client in the Information Technology and Services industry is looking for a Platform / SRE Engineer to own deployment, observability, reliability, cost control, and production operations for an AI helpdesk platform. This role will support the design, deployment, and operational management of AI services and production environments while ensuring scalability, uptime, performance optimization, and operational resilience across cloud-based infrastructure.
The ideal candidate will bring strong expertise in DevOps and Site Reliability Engineering practices, along with experience managing cloud-native platforms, CI/CD pipelines, observability tooling, and AI/ML production workloads within complex enterprise environments.
What you will do:
- Build and manage CI/CD pipelines, infrastructure, and runtime environments for AI services.
- Deploy and operate model-serving, orchestration, and application workloads.
- Implement monitoring, tracing, alerting, logging, and operational dashboards.
- Manage scaling activities, release processes, rollback mechanisms, and production support operations.
- Optimize inference cost, latency, uptime, and overall system reliability.
- Create runbooks, operational standards, and incident response processes.
- Support infrastructure automation and platform engineering initiatives.
- Maintain observability and monitoring solutions across production environments.
- Support release automation, secrets management, and production operational processes.
- Collaborate with engineering teams to support AI platform reliability and operational readiness.
- Troubleshoot production issues and support system diagnostics and remediation activities.
- Ensure platform stability, scalability, and performance across cloud-native environments.
Make Your Resume Now