Job Overview
Responsibilities
- Be available to respond to critical service incidents outside of business hours on a rotating on-call schedule.
- Proactively monitor application health and performance across cloud infrastructure (AWS).
- Troubleshoot and prevent service interruptions in real-time, working closely with development teams to resolve incidents efficiently.
- Lead and participate in disaster recovery drills and security incident simulations.
- Implement Infrastructure as Code (IaC) and maintain scalable deployments using AWS-native tools and services.
- Collaborate with development teams to ensure smooth CI/CD workflows using Git and containerized deployments (Docker).
- Work closely with stakeholders and product teams to ensure technical reliability aligns with business needs.
- Support and improve observability tools, alerting mechanisms, and logging infrastructure to promote transparency and response agility.
- Champion best practices in security, availability, performance, and incident response.
Required Technologies & Tools
- Cloud Infrastructure: Strong proficiency in Amazon Web Services (AWS) with knowledge of services like EC2, ECS, RDS, CloudWatch, and IAM.
- Programming/Scripting: Proficiency in Node.js and scripting for automation and tooling.
- Containerization: Experience with Docker for container-based deployment pipelines.
- Frontend Awareness: Familiarity with React and Ember.js to understand performance implications at the frontend level.
- Backend Stack: Understanding of NestJS and scalable Node-based services.
- Databases: Proficient in MySQL and performance monitoring of relational databases.
- Version Control: Proficiency with Git for collaborative code management and DevOps workflow integration.
Core Competencies
- Incident Response: Calm and focused under pressure with a structured approach to resolving outages and degradation.
- System Design: Ability to contribute to and review architectural designs for scalability and resiliency.
- Collaboration: Strong communication skills to coordinate across developers, QA, and product teams.
- Automation & Efficiency: Passion for automation, repeatability, and continuous improvement.
- Security Mindset: Consistent implementation of security best practices and a strong grasp of data protection standards.
Qualifications
- 3+ years of experience in a Site Reliability, DevOps, or related engineering role.
- Proven track record managing and scaling applications in a production AWS environment.
- Familiarity with full stack environments, particularly those using Node.jss.
- Experience maintaining and deploying databases such as MySQL with performance tuning.
- Experience with container orchestration (e.g., ECS or Kubernetes is a plus).
- Commitment to uptime, performance, and security in fast-moving SaaS environments.
Ready to Apply?
Take the next step in your career journey
Stand out with a professional resume tailored for this role