Challenges on the Road to HPC Virtualization

This article is the second in an editorial series that explores the benefits the HPC community can achieve by adopting HPC virtualization and cloud technologies. 

Although both the enterprise and HPC can benefit from virtualization, the two have had dissimilar requirements. Until recently, typical enterprise applications were not resource intensive; it was the HPC crowd running complex modeling and simulation applications that required ever larger supercomputers and more powerful clusters.

For HPC environments – unlike the enterprise – massive consolidation of resources is not an option. In fact, the opposite is true: HPC users are always looking to add more hardware so they can solve bigger problems more quickly as they push the envelope of scientific and engineering knowledge.

HPC Virtualization Guide - CoverTraditional HPC clusters run a single, standard OS (often a flavor of Linux) and software stack across all nodes. By operating in a uniform, homogenous cluster environment, the job scheduler is free to place jobs anywhere on the cluster as long as the target node is not overloaded with other jobs. This uniformity allows data center managers to maintain one image and distribute it to all the nodes at system set up and then occasionally for maintenance and updates. However, this approach limits the flexibility of their computational resources, especially when trying to accommodate multiple user populations.

For example, individual researchers or engineers may require specific software stacks to run their applications. If this stack is not compatible with the HPC cluster’s standard OS, too often the result is an unhappy user and a beleaguered IT organization. One of the side effects of this kind of IT inflexibility is the user creation of separate “islands of compute” scattered across the organization.

This is an inefficient and expensive solution. It also adds to the complexity of cluster management, especially for users new to the world of HPC and those considering migrating from desktop systems to a lower end cluster priced below $250,000. Typically these users do not have in-house HPC experts that they can turn to when problems inevitably arise.
Many Jobs, One OS
In bare metal environments, running multiple user jobs within the same OS can cause more problems over and above data loss or leakage. If a job disrupts the OS by crashing a daemon or other component, saving excessive files to the hard drive, or some other malfunction, other unrelated jobs can be impacted and schedules disrupted.

Bare metal environments have additional inefficiencies. Consider this scenario relative to the placement of jobs. Several jobs are running on the cluster when a higher priority job is scheduled but no appropriate resources are available. The IT admin can either make the new job wait in queue which, given its status, is not a viable solution, or kill other jobs to run the newcomer – also not a very satisfactory state of affairs. Either solution reduces the cluster’s throughput. Also, killing jobs can be quite expensive if they are costly ISV applications – for example, EDA applications licenses can cost hundreds of thousands of dollars.

Life sciences is another sector where a lack of virtualization can cause problems. In fact, bare metal environments can be created that include HIPAA and FISMA (Federal Information Security Management Act) compliance. The virtualized environment provides the requisite compliance along with the benefits of cost savings, agility, self-provisioning, and fault and security separation discussed above. And, because the power of virtualizing the compute, network, and storage infrastructure unlocks the power of automation, there is less likelihood that a manual error or neglect will lead to security breaches or compliance lapses.

Next week’s article will look at Virtualization and the Secure Private Cloud. If you prefer the complete insideHPC Guide to Virtualization, the Cloud and HPC is available for download in PDF from the insideHPC White Paper Library, courtesy of VMware.