The HPC Specialist is responsible for technical systems management, administration, and support for the high-performance computing (HPC) cluster environments. This includes all configuration, authentication, networking, storage, interconnect, and software usage & installation of HPC Clusters. The position is highly technical and directly impacts the daily operational functions of the above environments. The position will be the lead in installing/configuring/patching/upgrading software, and tuning, optimizing, proactively monitoring, and securing services. This position will work directly with all levels of Academic constituents, including faculty, researchers, and students, supporting and promoting the use of the HPC Cluster(s).
Responsibilities include the following:
- Oversight, systems management, and monitoring of the identified HPC Cluster systems. Ensure we maintain safe and secure systems. Alert management and respond to any anomalies within the system.
- Operational support of these systems. Monitor and work the tickets in support queue. Ensure timely completion and documentation of all work orders.
Develop and maintain clear well-written documentation.
- Lead the development of HPC standards, practices, and guidelines.
Bachelor’s degree is required and at least 2+ years of experience supporting and administering a large HPC environment – 3,000+ core with InfiniBand interconnect, working directly with customers to resolve complex technical issues. This includes a working knowledge of linux, networking, storage, cluster administration software, and cluster scheduling software.
Required skills, knowledge, abilities:
- Due to the sensitive nature of this position (candidate will have access to government research including access to software and data for contracts for Department of Defense and other federal entities), qualified applicants must have appropriate authorization to work in the U.S. We will not be able to provide sponsorship options.
- Excellent customer service skills, working directly with customers to resolve and troubleshoot technical issues and requests.
Ability to learn and adopt new technology.
Ability to apply troubleshooting techniques to resolve complex, cross functional issues.
Experience in an academic or research community environment.
- Four years of Linux experience, preferably in Redhat Linux and HPC cluster administration.
- Installing, testing, configuring, and administering HPC clusters/servers and software.
- Provide operating system support, software installation, patching, and maintenance, file system support and monitoring, and user support.
- Diagnosing and resolving system operational problems quickly and effectively.
- Verifying full performance of system components including network and storage.
Experience with Linux cluster resource allocation, job scheduling, InfiniBand networks, MPI communications, and cluster monitoring.
- Experience with any or all of the following technologies/products: Slurm, PBS Pro, Moab, Ganglia, Lustre, GPFS, Infiniband, MPICH, OpenMPI.
- Experience with cloud HPC in Azure or AWS a plus.
- Coordinating with vendors to resolve hardware and software problems.
Documenting system administration procedures for routine and complex tasks.
Maintaining and monitoring the security of HPC systems and servers.
- Extensive knowledge of RedHat, CentOS, Ubuntu Linux and Windows.
- Preferred Programming/Scripting capabilities in languages such as Bash, Perl, PHP, Python and C/C++.
- Ability to support IT Core Values by focusing on improvements, believing in our team and partners throughout the university, learning from mistakes, being accountable for actions and showing determination, focus and tenacity.
- Self-directed individual with strong desire to learn and contribute in a large team of technical peers.
- Strong written/oral communication skills with ability to communicate effectively.
Looking for a new gig? Our Jobs Board helps companies of all sizes hire the best talent and offers the best opportunity for job seekers to get hired.