HPC Virtualization and Workload Agility

Print Friendly, PDF & Email

This article is the forth in an editorial series that explores the benefits the HPC community can achieve by adopting HPC virtualization and secure private cloud technologies.

Separate Workloads Means Security

Virtualization allows workloads to be compartmentalized in their own VM in order to take full advantage of the underlying parallelism of today’s multicore, heterogeneous HPC systems without compromising security. This approach is particularly beneficial for organizations centralizing multiple groups on to a shared cluster or for teams with security issues – for example, a life sciences environment where access to genomic data needs to be restricted to specific researchers.

HPC Virtualization Guide - CoverVM abstraction provides a security separation between workloads that is not available in traditional HPC environments.

Some government mandates require research organizations or companies (e.g. pharmaceutical companies) to store test results for years.  Archiving a VM is an easy way to record and save the precise software environment used for the trials.  The same holds true for academic and research organizations that are concerned about reproducing their scientific efforts or responding to any subsequent audits.

Workload Agility

HPC virtualization permits the live migration of workloads when there is contention in the cluster.  For example, it is often the case when several jobs are running that another job will be introduced that gobbles up more memory than anticipated.  This situation crops up frequently in EDA (electronic design automation) work.  In a bare metal environment, when the new job starts to consume all available memory there are only two somewhat unsatisfactory alternatives – either let the jobs continue to run very slowly, or manually intervene to kill and restart jobs to untangle the mess.

In a virtual environment, when a newly introduced job starts on a memory rampage, DRS can be used to move the offending workload to another, less loaded physical host, allowing the total environment to work as efficiently as before.

Multi-tenancy with Resource Guarantees

DRS, mentioned above, has other benefits as well. In a virtualized cluster, it can be used to enforce the fair sharing of resources between the workloads of multiple groups.  This takes place below the OS layer within the virtual platform. It allows HPC environments to move beyond the less structured, trust-based approach typically found in traditional HPC environments where multiple user jobs may be scheduled on the same OS instance.

Instead, the VM approach provides guaranteed resources to specific groups or departments as needed.

It also addresses another problem – when users with tight budgets and aggressive deadlines are able to add more computing power to the cluster, they are typically reluctant to share those resources.  So other users in need of more computing power must add their own resources to the cluster and CAPEX starts to escalate.  Cloud computing, which enables automated self-provisioning and policy-based resource sharing, can help.  When users can be given guaranteed access to their share of compute resources, they are more likely to contribute their physical resources to a common pool. For IT this has CAPEX implications – the need to add more computing power to the cluster can be postponed or eliminated.

Exploring Accelerators

Early VMware testing of GPUs for computation (GPGPUs) in a virtual environment indicate that performance is generally within 2% of that achievable in a native, bare metal environment. Tests using the Intel Xeon Phi coprocessor show the same close alignment with native performance.  Several VMware customers are exploring the use of GPUs and coprocessors in a virtualized environment for HPC.

Next week’s article will look at Software Defined Data Center. If you prefer the complete insideHPC Guide to Virtualization, the Cloud and HPC is available for download in PDF from the insideHPC White Paper Library, courtesy of VMware.