Published

February 9, 2024

Location

Stanford, CA

Description

GPU Cluster System Admin

Business Affairs: University IT (UIT), Stanford, California, United States

Job Summary

SCHEDULE

Full-time

JOB CODE

4833

EMPLOYEE STATUS

Regular

GRADE

REQUISITION ID

102065

WORK ARRANGEMENT

Hybrid Eligible

GPU Cluster System Administrator

Stanford Research Computing is looking for a talented system administrator to join our team of collaborative and innovative professionals helping Stanford’s faculty and students use advanced computing and data tools to explore new frontiers in knowledge and solve some of humanity’s most urgent problems. Our staff work directly with some of the world's top researchers in a broad range of disciplines, across all of Stanford’s seven schools — while also supporting and learning from each other in cross-project endeavors. We maintain and steadily improve an advanced research computing facility, and we support a variety of environments for Stanford research. In Stanford Research Computing, you’ll have a rare opportunity to contribute to discoveries and inventions that have global reach and positive impact, and to share in the curiosity and commitment of the scholars and scientists who lead these projects.

This new position will support Stanford’s world-class data science and AI-focused research by managing and administering a substantial GPU-based cluster. You will partner closely with a team of data scientists from https://datascience.stanford.edu/ to ensure that the GPU cluster environment is tuned and utilized most efficiently to maximize research output. We’d love to have you join us on this exciting journey.

Responsibilities

This role is primarily systems-facing, but like all Research Computing positions, there is a significant researcher-facing component. In this position, you will put to use your in-depth knowledge of Slurm and Linux, your HPC cluster administration experience, and your passion for supporting ground-breaking research on a daily basis. You will play a crucial role in optimizing, improving and sustaining our advanced computing infrastructure.

- HPC Infrastructure Maintenance: Manage the day-to-day system administration of an NVIDIA DGX Superpod and associated storage, management and networking infrastructure, in alignment with applicable university, regulatory agency, and/or contractual security and privacy requirements, including HIPAA.

- Slurm: Responsible for all aspects of management of Slurm for efficient resource allocation and job scheduling across the cluster, consistent with faculty guidance on system resource usage and utilization.

- GPU Resource Management: Manage GPU resources within the cluster, optimizing utilization for compute-intensive tasks while maintaining a balance. between user requirements and system stability. Provide automated, easily accessible resource utilization metrics.

- User Support: Collaborate with Stanford Data Science team members and system users to understand their computing needs, provide technical assistance, and troubleshoot issues related to system performance and job execution. Provide user consultation and training in system use as needed.

- Performance monitoring: Monitor system performance, diagnose bottlenecks, and take necessary actions to improve system performance.

- Documentation: Maintain detailed documentation of system configurations, procedures, and troubleshooting guides to facilitate knowledge sharing and team collaboration. Develop user facing documentation in coordination with colleagues from Stanford Data Science.

- Planning: Meet regularly with stakeholders to understand existing challenges, anticipated needs, and opportunities for closer collaboration.

- Vendor engagement: Liaise with system vendors and other external partners as needed to ensure system issues are triaged and resolved expeditiously and correctly.

Minimum Requirements

Education and Experience:

Bachelor's degree and eight years of increasingly technical work experience or a combination of education and relevant experience. In-depth experience managing complex multiuser HPC clusters and storage environments is necessary, as is experience managing GPU-based infrastructure.

Qualifications:

This position requires in-depth knowledge of and substantial hands-on experience with:

- Linux cluster system administration

- GPU technologies and their integration into HPC environments

- Slurm configuration and management

- NFS-based storage management and configuration

- High-performance parallel filesystem (Lustre) management and configuration

- Scripting for system management, monitoring and task automation

- Installing and repairing servers and associated cluster hardware

- Complex technical problem solving and troubleshooting, with a proactive approach to system optimization and issue resolution

- Security practices and compliance standards in a computing environment

- Collaborating effectively across teams and with researchers

Additional desired skills and experience include:

- AI/ML software and frameworks, deep learning, and LLM training

- CUDA

- System benchmarking

The expected pay range for this position is $128,000 – $170,000 per annum.

Stanford University provides pay ranges representing its good faith estimate of what the university reasonably expects to pay for a position. The pay offered to a selected candidate will be determined based on factors such as (but not limited to) the scope and responsibilities of the position, the qualifications of the selected candidate, departmental budget availability, internal equity, geographic location and external market pay for comparable jobs.

At Stanford University, base pay represents only one aspect of the comprehensive reqards package. The Cardinal at Work (http://cardinalatwork.stanford.edu/benefits-rewards) provides detailed information on Stanford's extensive range of benefits and rewards to employees. Specifics about the rewards package for this position may be discussed during the hiring process.

Working Conditions

This is a hybrid position, in which you will work on-site at the Stanford campus for a minimum of 3 days a week through the first 9 months of employment, and at least 2 days a week thereafter.

Our core work hours are 9 am - 5 pm Pacific. This role occasionally will require extended hours and weekend work, and you will participate in rotation of on- and off-site responsibilities during the annual winter closure. Periodically, the data center is shut down for required maintenance. All team members with system responsibilities are expected to be physically on-site to return services to production status at the end of any planned facility outage.

Why Stanford is for You:

Imagine a world without search engines or social platforms. Consider lives saved through first-ever organ transplants and research to cure illnesses. Stanford University has revolutionized the way we live and enriched the world. Supporting this mission is our diverse and dedicated 17,000 staff. We seek talent driven to impact the future of our legacy. Our https://cardinalatwork.stanford.edu/benefits-rewards/sweeteners and https://cardinalatwork.stanford.edu/benefits-rewards empower you with:

Freedom to grow. We offer career development programs, tuition reimbursement, and course auditing. Join a TedTalk, watch a film screening, or listen to a renowned author or global leaders speak.

A caring culture. We provide superb retirement plans, generous time-off, and family care resources.

A healthier you. Choose from hundreds of health or fitness classes at our world-class exercise facilities. We provide excellent health care benefits.

Discovery and fun. Stroll through historic sculptures, trails, and museums.

Enviable resources. Enjoy free commuter programs, ridesharing incentives, discounts and more.

We look forward to receiving your application and cover letter.

*The job duties listed are typical examples of work performed by positions in this job classification and are not designed to contain or be interpreted as a comprehensive inventory of all duties, tasks, and responsibilities. Specific duties and responsibilities may vary depending on department or program needs without changing the general nature and scope of the job or level of responsibility. Employees may also perform other duties as assigned.

*Consistent with its obligations under the law, the University will provide reasonable accommodations to applicants and employees with disabilities. Applicants requiring a reasonable accommodation for any part of the application or hiring process should contact Stanford University Human Resources at stanfordelr@stanford.edu. For all other inquiries, please submit a https://docs.google.com/forms/d/e/1FAIpQLScuEhgtOxMfjr4HIAMjs011R9uGoq4jxXuLtp9pY9-pYikgew/viewform.

*Stanford is an equal employment opportunity and affirmative action employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic protected by law.

To apply: https://apptrkr.com/5005566

Apply Now

Apply

Your name *

Your e-mail address *

Message

Attachments

Drop files here browse files ...

Related Jobs

211-3606/24-2M System administrator with strong linux skills for the SCIENCE HPC Center Copenhagen, Denmark

February 13, 2024

211-3605/24-2M Specialised Python DevOps developer with Linux systems experience for the SCIENCE HPC Center Copenhagen, Denmark

February 13, 2024

Linux HPC systems administrator Centre Informatique, Bâtiment Amphimax. Route de la Sorge, Lausanne, Switzerland

February 8, 2024

GPU Cluster System Admin

Description

Related Jobs

Sponsored Guest Articles

‘Glow-in-the-Dark’ GPUs, Holes Burnt in Boards, Overprovisioning Systems ‘Until Funding Runs Out’ and Other Factors Calling for Optical I/O

White Papers

Energy efficiency drives HPC to the cloud

Featured RSS Feed

More News from insideBIGDATA