Homeoffice Senior HPC Systems Administrator (Remote) bei RedLine Performance Solutions
RedLine Performance Solutions · Gaithersburg, Vereinigte Staaten Von Amerika · Remote
- Senior
- This position requires mission-critical monitoring and maintenance and will require off hours support in a team rotation.
- US citizenship and the ability to obtain a Public Trust clearance is a requirement to apply.
- The preference is for the candidate to be in the Phoenix, AZ area, however the position can be remote with the possibility of some travel.
- This full-time position includes a comprehensive benefits package featuring paid time off, a 401(k) match, health insurance, and a full range of additional benefits.
- Provide HPC cluster administration using technologies such as HPCM, Lustre, Slingshot, Cray OS, and Slurm
- Engage with the customer to identify the needs and user stories to build enhancements and upgrades for the HPC clusters
- Work with configuration management solutions to develop Ansible playbooks to support image generation and server support
- Work with version control systems to perform and review Git pull requests from the team to ensure that the cluster support follows best practices
- Update and expand existing systems monitoring capabilities
- Develop automation tools for cluster administration
- Participate in resource optimization and job scheduling software and policies
- Support HPE-based Cluster Management solutions
- Provide technical support to researchers using HPC resources, troubleshoot problems, and develop appropriate computational strategies.
- Minimum of 7 years SLES, RedHat and CentOS Linux system administrator experience in an HPC environment.
- Experience with schedulers/batch systems (e.g., SLURM, PBS, LSF)
- Experience with managing parallel and cluster file systems (e.g., GPFS, Lustre)
- Network management experience, including in an HPC context (e.g., InfiniBand, OmniPath)
- Demonstrated ability to configure, deploy, and manage a major system area such as batch system, network, data storage, backup system, database system, or distributed computing
- Scripting experience (e.g., bash, Python, Perl).
- Experience supporting HPC cloud environments (e.g., Azure)
- Server provisioning and image management
- Experience with Lmod/Lua
- Experience with MPI technologies
- One of the ISC2 certifications (e.g., CISSP, SSCP) or Security+ certification
- Experience integrating applications with cloud provider software stack.