Senior SRE Engineer
Full-time Mid-Senior levelJob Overview
As part of our continued growth, Neo Group is recruiting on behalf of one of our local partners, leveraging our network of 1,400 talented professionals across 10+ countries. Together, we are committed to delivering innovative, data-driven solutions that empower our clients and foster professional growth within a dynamic and collaborative workplace.
We are on the lookout for a Senior SRE Engineer to join our Engineering Department.
Responsibilities:
- Design, deploy, and maintain observability platforms including Zabbix, Grafana, and Opensearch Stack (Opensearch, Logstash, Kibana).
- Implement and maintain metrics, logs, traces, and synthetic monitoring across infrastructure and applications.
- Integrate Prometheus, Alertmanager and OpenTelemetry where applicable to achieve unified observability.
- Maintain monitoring coverage for Linux, network devices, applications, and cloud services.
- Maintain and enhance the overall monitoring and logging infrastructure, including capacity, performance, and reliability.
- Develop meaningful dashboards and alerting logic to ensure timely and actionable incident notifications.
- Optimize alerting systems: reduce noise, tune thresholds, and focus on critical business and technical metrics.
- Improve observability processes and implement predictive failure analysis and early-warning signals.
- Analyze incidents, identify patterns, and drive proactive monitoring improvements.
- Define and maintain KPIs, SLIs, SLOs, and SLA measurement processes in coordination with service owners.
- Enhance reliability through structured incident management and post-mortem analysis.
- Automate deployment and configuration of monitoring components using Ansible, Terraform following IaC principles.
- Manage configuration templates and Zabbix host provisioning through automation tools (Ansible, Terraform following IaC principles).
- Leverage APIs and scripting (e.g., Python, Go) for data collection, integrations, and automation.
- Collaborate closely with Developers, System Engineers, DevOps, and IT Operations teams to improve system reliability and reduce MTTR.
- Establish and evolve the Monitoring & Diagnostics foundation for the in-house 24/7 App Support team, including tooling, processes, knowledge base, training, runbooks, and troubleshooting guides.
- Create intelligent, step-by-step troubleshooting instructions to speed up incident resolution.
Requirements
- 4+ years of experience as an SRE, Monitoring Engineer, or similar role in production environments.
- Advanced Linux user with strong command-line and diagnostic skills.
- Strong understanding of monitoring, logging, and observability concepts (metrics, logs, traces, SLIs/SLOs, alerting).
- Hands-on experience with at least several of the following:
- Zabbix, Prometheus, Grafana, Elastic Stack (ELK), Alertmanager, OpenTelemetry.
- Experience managing both cloud-based and on-premise environments.
- Automation skills using Python or Go.
- Proficiency with configuration management / IaC tools (Ansible, Terraform or similar).
- Solid grasp of networking principles and protocols (TCP/IP, HTTP, DNS, load balancing, etc.).
- Experience with CI/CD pipelines (GitLab, Jenkins or similar).
- Familiarity with container orchestration (Kubernetes, Rancher).
- Experience documenting workflows and training support teams.
- Proven skills in incident analysis, pattern recognition, and driving preventive improvements.
- Good communication skills and ability to work with cross-functional teams.
Nice to Have:
- Experience with synthetic monitoring tools and user-experience monitoring.
- Background in capacity planning and performance tuning.
- Advanced knowledge of ML-driven monitoring and predictive analysis.
- Experience with automated incident response (self-healing systems).
Soft Skills:
- Responsibility, initiative, and strong analytical thinking.
- Ability to collaborate effectively within a team.
- Focus on automation and process improvement.
- Strong documentation and knowledge-sharing skills.
- Capability to diagnose complex incidents and provide actionable insights.
Benefits
- Enjoy 3 health days to focus on your well-being.
- Take advantage of 25 paid calendar vacation days to explore, relax, and unwind.
- Get a $30 net per month sports compensation to stay active and healthy.
- Benefit from top-notch medical insurance for peace of mind.
- Indulge in a variety of snacks available in the office.
- Join us for exciting corporate events that foster team spirit and fun!
Make Your Resume Now