Practical Hardware Design Strategies for Modern HPC Workloads – Part 2

Print Friendly, PDF & Email

This special research report sponsored by Tyan discusses practical hardware design strategies for modern HPC workloads. As hardware continued to develop, technologies like multi-core, GPU, NVMe, and others have allowed new application areas to become possible. These application areas include accelerator assisted HPC, GPU based Deep learning, and Big Data Analytics systems. Unfortunately, implementing a general purpose balanced system solution is not possible for these applications. To achieve the best price-to-performance in each of these application verticals, attention to hardware features and design is most important.

Many new technologies used in High Performance Computing (HPC) have allowed new application areas to become possible. Advances like multi-core, GPU, NVMe, and others have created application verticals that include accelerator assisted HPC, GPU based Deep Learning, Fast storage and parallel file systems, and Big Data Analytics systems.

This technology guide, insideHPC Special Research Report: Practical Hardware Design Strategies for Modern HPC Workloads, shows how to get your results faster by partnering with Tyan.

Differentiation in Modern HPC Workloads

Most HPC traditional workloads consist of “number crunching” whereby large amounts of floating point  calculations are run to simulate or model complicated processes. These can include, materials and molecular  systems, weather forecast and astronomy, fluid dynamics, financial markets, oil and gas, physics, bioscience,  and many others. All of these share the need for large amounts of calculations in a reasonable amount of  time (i.e. there is no use in trying to predict tomorrow’s weather if it takes two days for the model to run).

Many HPC applications can be considered “compute bound,” where the limiting step is how much compute  performance they can provide over time. There are other applications that are IO bound, where the amount  of disk IO can be a limiting factor.

There are other types of applications that have overlap in the HPC market. These include Big Data Analytics  and Deep Learning. It can be argued that these two areas are not strictly “HPC,” but since a clear definition of  what constitutes an “HPC” problem is rather vague and both of these application areas seek to increase  performance by adding more hardware, it seems reasonable to include them as part of the high performance ecosystem. Indeed, many traditional HPC practitioners are turning to Big Data Analytics and Deep Learning to further their understanding of many natural phenomena.

Accelerated HPC Computation

Historically, many of HPC simulations and models were distributed across clustered servers (also called  “nodes”) that worked in concert to produce a solution. These types of applications are often considered  “compute bound,” because the amount of computation is the limiting factor in application progress. In many  instances, adding more servers (CPU/memory resources) allowed the problem to be scaled as more  computation was required. The ability to scale a problem size often comes with some limitations (due to the  need to move data and the nature of the problem at hand) and at some point begins to level off (i.e.  applications do not get any faster when adding more servers).

One way around this limitation is to increase the computation rate on the server. While 2nd Gen Intel®  Xeon® Scalable Processors have shown a steading increase in performance and core counts, specialized  accelerator processors have gained favor in recent years. Most notably the use of GPU based accelerators has  become a popular way of increasing performance on computational nodes.

The GPU’s main advantage is the ability to perform a single instruction across large amounts of data at the  same time. This type of operation is common in graphics applications and occurs in many HPC applications  as well—particularly array operations (linear algebra). Typically, the host Intel Xeon Scalable Processor  provides the basic computing platform (large memory, computational cores, operating system, IO,  networking, etc.) and uses one or more GPUs as accelerators for certain parallel operations.

IO-Heavy HPC Computing

This level of computing usually requires large amounts of data to be read and written to disk as part of the  computation. These types of applications are often considered “IO bound” problems because the speed at  which data can be read/written to disk defines the performance.

For example, an application may need to write temporary intermediate files during the course of the program  because the amount of data in these files is too large to keep in memory. Note that even “compute bound” applications can become IO bound at times if they are writing checkpoint/restart files during the  course of the application.

These types of applications can benefit from things like local NVMe storage where storage is directly accessible to the processor and does not have to traverse a network.

Using nodes fast IO also comes into play when building out fast parallel file systems. In this case, the speed  of storage IO on the servers used to create the filesystem have a large influence on how well the filesystem  can perform. Examples of these files systems may include Lustre, Gluster, and Ceph.

Big Data Computing

Big Data computing is similar to High-IO computing; however, the goal is both speed and bulk storage. With Big Data computing, the need is more focused on distributed bulk storage than single node performance.

For instance, applications like Hadoop-Hive, Spark, and the Hadoop Distributed File Systems (HDFS) are a  popular platform on which to build Big Data solutions. Typically HDFS is used to manage the large amounts  of data used for these analytics operations. Due to the sheer size of the data, backing up the data is nearly  impossible. In this case, the HDFS filesystem has built-in redundancy so that if one or two servers fail (at the  same time), the file system continues to operate. In addition, HDFS is designed to easily scale-up (adding  more storage to a storage server) and scale-out (adding more storage servers).

In addition to Hadoop, NoSQL databases rely on dense storage nodes. Similar to HDFS, these databases are  designed to use multiple servers, offer redundant performance, and provide a more flexible column-based  storage mechanism versus a more traditional row-based SQL transactional SQL database.

In both of these cases, storage nodes need large amount of bulk storage and typically employ 3.5 inch  (spinning) disks due to the greater capacity and lower cost per byte than solid state drives.

Deep Learning

The final type of HPC workload is very similar to HPC compute bound applications, however, the type of computation, linear algebra, is focused on one type of learning problem. These types of problems can require massive amounts of computation to provide usable results (i.e. the learning model can successfully predict a  certain percentage of outcomes from new data).

The amount of computation for Deep Learning is quite large and multiple GPUs are often dedicated to a  single problem. When running Deep Learning applications, all data are normally loaded onto the GPU and  less dependence on CPU memory transfer is needed, however, for large models, Deep Learning can benefit  from local caching of model epochs (learning steps).

Servers designed for Deep Learning have a definite overlap with accelerated HPC computing mentioned  above. Quite often the limiting step in GPU accelerated HPC codes is data movement to and from the GPU  (over the PCI bus). For this reason, there can be a point where adding GPUs does not increase the  performance of HPC applications.

Over the next few weeks we will explore these topics surrounding practical hardware design strategies for modern HPC workloads and how you can get your results faster by partnering with Tyan:

Download the complete insideHPC Special Research Report: Practical Hardware Design Strategies for Modern HPC Workloads,, courtesy of Tyan.