Senior Site Reliability Engineer at Orion Health
Orion Health · Minot, United States Of America · Hybrid
- Professional
- Office in Minot
Innovate With Purpose
Do you want to work for a company that is innovating and making a difference to the health and wellbeing of people all over the world? We’re not about selling meaningless, unnecessary products for corporate profitability. You’ll be working on technology that will revolutionise global health systems so that we can finally get the healthcare we all want - a basic human right.
We like to think of ourselves as a community of start-ups where you can be your true, genuine self. Each of our product teams has the autonomy to decide how they operate and contribute towards our mission of providing each person with the right care at the right time and in the right place.
Orion Health is excited to be expanding our galaxy by recruiting for a number of stellar individuals to join our team to help us deliver to our global customer base. If you want to climb aboard the rocketship and help us revolutionise global health systems, astronomical opportunities await.
Position Purpose:
Collaborate in the construction of the automation for infrastructure and software delivery, and being the primary executor of such processes, collecting feedback from the support of operational sites. Responsible for availability, latency, reliability, performance, efficiency, change management, monitoring, emergency response, improve system availability and capacity planning.
Success in this Role looks like…
- Through a proactive approach, relentless improvement and constant training, the SREs run the customer environments by monitoring availability and taking a holistic view of system health
- SLAs are always met through automation with none to small involvement from the team, and the number of customers and provided services can scale without correlation with the size of the team
- Bridge the gap between development and operations
- Well built software and systems to manage platform infrastructure and applications
- Measured and optimised system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve
Business Unit:
North American Managed Services
- This unit contributes to Orion Health’s purpose to enabling client success by introducing and maintaining managed environments, policies and procedures in line with ITIL aligned standards and maintain focus on all elements of support for our customers
Key Relationships:
Internal
- Technical Operations Leads, Implementation Consultants, Solution Architects, Service Management Lead, Service Operations, and Product Team, Database administrators
- SREs must have constant communication with the Development Team and Technical Leads (Software Designers) to understand the concrete requirements of the products and the configuration required to be part of the automation
External
- Client, and Third Party vendors
Essential functions:
Operations Support and Issue Resolution
Participate in the daily management of multiple Orion Health solutions hosted in AWS Cloud, Infrastructure and Networking including but not limited to:
- Daily monitoring and alert responses, identify potential problems, and implement alerts to notify relevant parties
- Following a Change Request from creation to completion, providing review, detail validation and execution of all tasks
- Work with other teams to ensure a smooth and reliable releases
- Tuning the Application stack to improve stability and resultant uptime metrics
- Automate repetitive tasks, such as development, scaling, and patching, to improve efficiency and reduce manual effort.
- Acute and Recurring issue investigation and resolution
- Performance Trend Analysis, identifying and address performance bottlenecks to ensure system can handle expected loads and user traffic
- Log Analysis and Error resolution
- Manage and maintain the underlying infrastructure, including servers, and networks, to ensure smooth operations
- Handover Testing
- Document procedures, and processes to facilitate learning and knowledge transfer within the team
- Root Cause Analysis; involved in investigating and resolving incidents, including outages and performance problems, to minimize disruption
- Plan for future capacity needs to ensure systems can and handle increasing demand
- Developing and testing disaster recovery plans to guarantee data integrity, system resilience, and swift restoration of services in case of critical incidents.
- Coordinate with teams to maintain Service Level Agreements
Internal Development
Responsible for the Continuous Integration of updates for over 10 Products/solutions released by Development teams into Orion Health solutions.
Build secure and scalable infrastructure to manage customer data
Internal Support
- Participate in On-Call RotationWork with Development, Solution Adoption, Managed Services, Professional Services, Support and other teams to provide clients with a world class stable solution platform
Behavioural and Technical Capabilities
- Highly proactive and motivated Software/System Engineer, always seeking opportunities for improvement and taking ownership of the challenges
- Strong understanding of software engineering principles, operating systems (Windows and Linux), networking, and cloud technologies
- Experienced in Windows and Linux OS administration, with hands-on exposure to DataCenter operations
- Proficient in Active Directory, Group Policy Object (GPO) management, DNS, and Active Directory service health monitoring
- Demonstrated scripting and automation experience using PowerShell, Python, Bash, and other languages
- Familiarity with infrastructure automation tools such as Puppet and Ansible is a plus
- Capable of communicating ideas and collaborating productively across technical teams
- Committed to continuous learning and knowledge sharing within the team
- Ability to design secure distributed web services and manage network security at scale
- Solid understanding of TCP/IP, DNS, DHCP, VLANs, VPNs, firewall configuration, Load Balancers, and other network appliances
Relevant Experience
- 4–6 years in a Site Reliability Engineering or equivalent role
- 5 years in systems/application support and/or development
- Strong scripting background with experience in object-oriented and structured programming
- Experience with automation, infrastructure as code, and orchestration (e.g., Puppet, Ansible, Kubernetes, CloudFormation, Terraform)
- Exposure to on-prem to AWS cloud migration projects and Red Hat OS upgrades is an asset
- Working knowledge of Splunk monitoring tools and strategies
- Strong foundation in Network Architecture and Security
- Experience with CI/CD pipelines and deployment automation in cloud environments (AWS preferred)
Education & Qualifications:
Essential
- Bachelor’s Degree in a technical discipline or equivalent experience
- Experience in supporting cloud-based production systems
- A technical certification in System Administration, Cloud Engineering, or DevOps
Desirable
- Formal training and certification in *nix scripting, non-SQL / SQL, Oracle databases and Big Data technologies and AWS Cloud Services is a plus
- HIPAA or HITRUST understanding
#LI-hybrid
Apply Now 
			 
			 
			 
			