Platzhalter Bild

Recovery and Monitoring Specialist en ORNL Federal Credit Union

ORNL Federal Credit Union · Oak Ridge, Estados Unidos De América · Onsite

Solicitar ahora

The deadline to apply for this opportunity is October 8, 2025. 

Role: The Recovery and Monitoring Specialist plays a vital role in ensuring the stability, recoverability, and operational visibility of ORNL Federal Credit Union’s infrastructure. This position is responsible for managing and validating backup and recovery processes, conducting routine restore testing, and monitoring system health through established alerting platforms. 

Essential Functions and Responsibilities

  • Monitoring & Alert Management: Serves as the primary administrator for system and application monitoring platforms, including alert thresholds, escalation rules, and dashboard visibility. Reviews and responds to alerts regarding infrastructure degradation or system failures; identify issues and escalate per documented procedures. Assists with monitoring platform configuration and tuning, reducing false positives and increasing alert relevance. Manages and maintains monitoring escalation call trees, including documented contacts, response timelines, and escalation triggers. Coordinates internal communication during critical alerts, ensuring proper notification and escalation paths are followed.
  • Backup & Recovery Operations: Manages and validates enterprise backup schedules, ensuring successful completion of system, file, and configuration-level backups across environments. Performs and documents regular restore testing procedures to verify integrity and speed of data recovery processes. Maintains accurate records of backup status, restore history, and system availability metrics to support audit and compliance checks.
  • Disaster Recovery & Business Continuity: Collaborates with IT staff to continuously improve business continuity and systems recovery documentation. Supports disaster recovery readiness by ensuring current assets, tested runbooks, and contact procedures are regularly reviewed and updated. Participates in the refinement of response workflows and system documentation based on post-incident reviews and lessons learned.  Will be part of an on-call rotation and escalation process.
  • Compliance & Availability: Provides ongoing evidence and documentation for compliance with NCUA, FFIEC, GLBA, and NIST cybersecurity controls, especially those related to data protection and recoverability. Supports after-hours availability rotation for escalated alerts or recovery operations as needed.
  • Performs other job-related duties as assigned.

Experience: Four or more years of experience in enterprise backup, recovery operations, and/or systems monitoring within enterprise infrastructure and cloud hosted environments.

  • Strong knowledge of backup software and frameworks (e.g., Veeam, Commvault, Rubrik, Azure Backup, etc.) and disaster recovery testing methodologies
  • Familiarity with enterprise monitoring tools and system health dashboards (e.g., Splunk, PRTG, Zabbix, Dynatrace)
  • Experience in regulated IT environments with awareness of GLBA, FFIEC, NIST, and NCUA requirements related to data protection and resilience

Education:

Bachelor’s degree in Information Systems, Computer Science, or a related field preferred or equivalent combination of education, training, experience, or military experience.

Preferred certifications include: 

  • CompTIA Server+ 
  • VMware Certified Professional – Data Center Virtualization (VCP-DCV) 
  • Microsoft Certified: Azure Administrator Associate 
  • Certified in the Fundamentals of Infrastructure Monitoring (e.g., from PRTG, Zabbix, Nagios, Splunk, or similar platforms)

Other Skills Required:

  • Comfortable providing both technical summaries and high-level insights into system availability performance
  • Ability to document recovery and testing processes with clarity, including evidence for audits and leadership reviews
  • Strong analytical skills in recognizing patterns of instability and identifying root causes
  • Proficiency in maintaining uptime and resilience in virtual and physical server environments
  • Willingness to support after-hours availability for incident escalations or planned recovery testing
      Solicitar ahora

      Otros empleos