Considerations for Managing HPC Resources

Print Friendly, PDF & Email

In this special guest feature, Robert Roe from Scientific Computing World writes that companies faced with increasing complexity are developing new services and tools to help users manage their HPC resources.

As the complexity of both HPC hardware and the challenges that scientists are trying to solve using HPC systems increases, ensuring a system is utilized efficiently is necessary to deliver a computing service with value for money and a high research output.

This requires HPC centers focus on how to make best use of available resources through software and services designed to get applications running on a HPC system and to increase performance.

Andrew Jones, vice-president of strategic HPC services and consulting at The Numerical Algorithms Group (NAG), said that one of the most common pitfalls that hinders HPC performance is the focus on hardware, instead of overall service and investment in application development.

The single biggest thing that people do is focus too much on the hardware, and forget about the rest of the stuff. The procurement process is driven by the compute system. It doesn’t necessarily take into account that you are going to need people and software to deliver value,” said Jones. “It doesn’t necessarily look at total cost of ownership (TCO). Hardware is generally less than half of the TCO over the service life of a HPC system. The same is true in operational practice, people tend to start thinking around the hardware and then add things on top of that, in terms of people, support services or software and so on. Now that’s not to say they are an afterthought, but there is a hardware-centric focus around HPC,” he added.

Andrew Jones from NAG

Jones also noted the second most common mistake was the assumption that buying and setting up a system is easy. “In reality it is a complex research facility that requires expertise in the facility, just as much as it requires expertise in the science or other aspects. It’s not an IT facility, it’s a research facility that happens to have been built with IT components,” said Jones. “It’s the software, people and applications, the thought about the process; how are you going to actually use this HPC system?”

Encouraging HPC performance

NAG helps customers evaluate and procure the correct system for their workloads. Once a system is up and running, they also help users with porting applications and technology benchmarking.

These services all help users to better utilize their existing hardware to deliver better application performance by better managing their available HPC resources.

Jones notes that it does not require a massive change in computing hardware to require code tuning. “Even just moving to the next generation of CPU – your code will probably run but you need to put effort into tuning it to get best performance. Let’s say you have just upgraded from the previous generation to the new Xeon processor, or to the latest EPYC processor on the AMD side, for example,’ said Jones. ‘You are still going to have to do some tuning to get effective performance. Clearly, that step is bigger if you are moving to a GPU, there is potentially more code work to be done.”

Jones stressed that it is not always users adopting a new technology, it can be the move to scale up an existing workflow. “They have an application that runs at a certain resolution to simulate the properties of an aircraft wing, running on a hundred cores, for example. Now, they want to scale that up to model the whole aircraft at a higher resolution, running on a hundred thousand cores,” said Jones.

The final example is based on scientists or engineers that are moving to HPC for the first time. “This could be just getting the code up and running, or it could be parallelization work,” noted Jones. An example of this kind of work would be someone with a serial code, or something that is not in a traditional programming language, it could be a Python or Matlab problem, which they would like to develop into an HPC application.

Making HPC easier for scientists

Alongside the hardware and software improvements that can help to increase real application performance, HPC providers are also developing tools which make it easier for non-experts to access and run their applications without the need for teams of administrators to get their code up and running.

One company working in this area is Advanced Clustering Technologies (ACT). ACT has developed several tools to help HPC users, including their web-based job submission tool – eQUEUE.

Kyle Sheumaker, Advanced Clustering Technologies president and CTO, explains the idea behind this software and its development. ‘It came from a set of customers that were fairly novice when it came to HPC. The rest of the users were somewhat unfamiliar with Linux, but had these science jobs that they needed to run,’ said Sheumaker.

The idea was to be able to build a tool around those jobs to eliminate some of that learning curve. It sits on top of an existing scheduler and allows you to develop a web interface with these forms based on your jobs. It’s completely customizable, so the person that understands the job can ask the pertinent questions, whether its uploading a dataset or pointing to it on the file system, or answering a couple of questions and it will build the job submission script and build the job on the users behalf,” said Sheumaker. “It automates a lot of the intricacies of working with an HPC system if you are not familiar with it.”

Sheumaker explained that the software was spurred on by two different customers. One was a government agency that needed to do a large number of model runs which, at that time, were being run on workstations. The second were financial analysts with similar experience with Linux. “The idea was to make this as easy as possible for our end-user scientists that know how to run this application, but not how to SSH into a Linux,” said Sheumaker.

The software works based on the creation of job forms, which are developed by someone in the organisation who understands the application and has some understanding of the underlying system. The benefit here, however, is that once the forms are created, non-expert users can submit jobs without needing help from expert staff.

Sheumaker said: ‘When a user logs in to the eQUEUE web interface, there is a series of menus that they can go through to pick the application they want to run and it ends in a couple of questions. The idea is that the person that knows the application will ask pertinent questions, build a form around it and then it will automatically generate the stuff that you would need in the background.

The administrator has the choice of interfacing to Slurm, PBS or Torque and Grid Engine. Almost all of our customers these days are on Slurm, so that is the primary interface used. The beauty of this system is that a job submitted through the web interface is just a job like anybody else’s,” added Sheumaker.

One benefit of this interface is that it works with the existing job scheduler, so more technical staff can use the existing scheduling system if they wish to do so. ‘It’s really just an abstraction layer to the scheduler, to make it a lot easier for those people who have never written a job script before,’ said Sheumaker.

Jones feels that with increasing diversity available to HPC users – particularly the case for the HPC processor market – it is imperative to make the right choice when choosing the resources that will underpin a new system or upgrade.

While job scheduling and managing an HPC system are important jobs that must not be overlooked, ultimately the efficient management of an HPC system is not possible if the correct technologies are not chosen when procuring or upgrading an HPC system.

You must have a proper technology evaluation program that generally includes something like benchmarking the new technologies, understanding the requirements from your users and their application performance and their skillset to be able to use whatever new technology that it is you decide to deploy,’ said Jones. “A substantial fraction of our work is helping people move to the next processor family because there are performance opportunities, or because there are comparative studies to be made to help evaluate technologies. Even if you are saying that you just want to stay on the Intel line, figuring out which bit of the Intel line or how to architect a system and at what scale and so on, is an issue in itself, and then moving the code to that new system is plenty of work for HPC specialists,” concluded Jones.

This story appears here as part of a cross-publishing agreement with Scientific Computing World.

Sign up for our insideHPC Newsletter