Associate Manager - Reliability Operations (SaaS Support) bei Zeta
Zeta · Hyderabad, Indien · On-site
- Optionales Büro in Hyderabad
Role
- The Associate Manager - Reliability Operations leads a team to rigorously uphold service level objectives (SLOs) through expert alert management, SOP-compliant ticket escalations, and coordinated support for SRE-signed deployments across multiple sites.
- This role drives operational accountability, fosters seamless SRE partnerships, and ensures production stability in a high stakes 24x7 SaaS environment
Responsibilities
- Drives SLO adherence by implementing advanced metric monitoring, enforcing error budgets, and spearheading proactive initiatives to prevent breaches and elevate system reliability.
- Ensures all alerts receive immediate acknowledgment, with tickets escalated to SRE teams for any issues lacking defined SOPs, systematically reducing escalations, downtime, and MTTR.
- Coordinates standard deployments across sites following SRE sign-off, overseeing logistics, real-time rollout health monitoring, and rigorous post-deployment SLO validation.
- Collaborates strategically with SRE teams on deployment planning, comprehensive risk assessments, troubleshooting, and post-release optimizations for flawless execution and rapid recovery.
- Oversees and refines team processes for alert triage, SOP documentation/updates, and knowledge sharing, integrating automation to minimize manual toil and enhance operational resilience.
- Mentors staff on SLO-driven decision-making, conducts in-depth audits of alert/ticket workflows, analyses trends in operational data, and delivers actionable reliability KPI reports to stakeholders.
Skills
- Proven track record in 24x7 SaaS/cloud support operations, handling high-pressure incidents and customer-impacting events.
- Strong proficiency in monitoring/incident tools (Prometheus, Grafana, Splunk, PagerDuty) and ticketing systems.
- Effective leadership and people management, with excellent communication for technical/non-technical collaboration.
- Analytical skills to interpret operational data, identify trends, and drive process recommendations.
Experience and Qualifications
- Familiarity with ITIL frameworks, SRE principles (e.g., error budgets, toil reduction), and cloud platforms (AWS, Azure, GCP).
- Experience with process improvement methodologies and shift handoff protocols.
- Knowledge of basic reliability concepts and observability stacks.
- Education: Bachelor's degree in Information Technology, Business, or related field; relevant IT certifications (e.g., ITIL Foundation) are a plus.
- Experience: 6-8 years in operations support, reliability operations, or IT service management, including 2+ years in supervisory roles managing 24x7 teams.
Shift Information
- 24x7 Operational Oversight: Role with on-call and shift responsibilities for escalations; provides oversight for 24x7 team operations, including shift scheduling and off-hour incident coordination.