Make Your Resume Now

Senior Site Reliability Engineer

fulltime_permanent experienced

Job Overview

What are we building?

Hard Rock Digital is a team focused on becoming the best online sportsbook, casino, and social gaming company in the world. We’re building a team that resonates passion for learning, operating, and building new products and technologies for millions of consumers. We care about each customer interaction, experience, behavior, and insight and strive to ensure we’re always acting authentically. 

 

Rooted in the kindred spirits of Hard Rock and the Seminole Tribe of Florida, Hard Rock Digital taps a brand known the world over as the leader in gaming, entertainment, and hospitality. We’re taking that foundation of success and bringing it to the digital space — ready to join us?

 

What’s the position?

We are looking for a skilled Sr. Site Reliability Engineer (SRE) to maintain and improve the reliability, scalability, and performance of our Java-based application. You will be responsible for managing and monitoring the applications and infrastructure, using the Grafana stack (Grafana, Loki, Prometheus) to ensure a high level of observability, and implementing robust monitoring, alerting, and logging solutions.

 

Key Responsibilities:

Application Reliability & Performance:

  • Ensure the availability, reliability, and performance of a high-traffic Java-based application in a distributed environment.

  • Troubleshoot and resolve complex issues in production and non-production environments.

  • Participate in both pre- and post-deployment performance testing and monitoring efforts to improve application performance.

  • Optimize Java application performance, ensuring efficient resource utilization and scaling.

 

Monitoring & Observability:

  • Deploy and manage the Grafana stack (Grafana, Prometheus, Loki) to provide real-time monitoring, logging, and alerting.

  • Implement and refine observability strategies to enhance application and infrastructure visibility.

  • Create and maintain dashboards, alerts, and logs for comprehensive monitoring of system health and performance.

 

Incident Management & Root Cause Analysis:

  • Support the operations team’s incident response efforts, conduct post-mortems, and identify root causes of issues to prevent recurrence.

  • Document and share lessons learned from incidents, contributing to a culture of continuous improvement.

 

Collaboration & Cross-functional Support:

  • Work closely with developers, architects, and other engineers to design and implement solutions that improve application reliability.

  • Collaborate closely with DevOps and NOC teams to support the application platform.

  • Communicate SRE practices and principles to technical and non-technical stakeholders.

  • Provide feedback and insights on application performance, potential improvements, and observability metrics.

Ready to Apply?

Take the next step in your career journey

Stand out with a professional resume tailored for this role

Build Your Resume – It’s Free!