Scaling Hardware for In-Memory Computing

Print Friendly, PDF & Email

This is the second entry in an insideHPC series that delves into in-memory computing and the designs, hardware, and software behind it. This series, compiled in a complete Guide, also covers five ways in-memory computing can save money and improve TCO. In this feature, we’ll focus on scaling hardware. 

Scaling Hardware

The two methods of scaling processors are based on the method used to scale the memory architecture and are called scaling-out or scale-up. Beyond the basic processor/memory architecture, accelerators and parallel file systems are also used to provide scalable performance.

Scaling-Up with Shared Memory

The scale-up design uses one large pool of memory for multiple processors. This approach keeps the established processor-memory relationship intact. (i.e. All memory, and thus data, is viewable by all processors, there is no need to distribute data across multiple machines.) High performance scale-up designs for scaling hardware require that programs have concurrent sections that can be distributed over multiple processors. Unlike the distributed memory systems described below, there is no need to copy data from system to system because all the memory is globally usable by all processors. The full application can be run in-memory. It is possible, however for a processor to lock memory that it is using and thus cause other processors to wait until the memory is “unlocked” before it can be used. There is, however, no need to copy memory from processor to processor.

The advantage of scale-up system is that no data copying is needed. In general, many applications can run faster on scale-up systems because all data are located directly in-memory and available to all processors. In addition, applications can use extra memory without the need to distribute data across multiple systems. Perhaps the biggest advantage of scale-up systems is that the decision to add extra resources (memory or processors) to an application is always optional not required. In addition, there is no need to expand to multiple machines for extra memory or processors. Finally, scale-up systems are easily upgradable with little or no impact on users.

As compared to scale-out systems, scale-up systems have limits on scalability. (i.e. the number of processors and total memory size is much lower than can be achieved with scale-out systems). It should be noted that these limits are normally quite large and do not represent an issue for many applications. Scale-up systems do present a single point of failure, however they are usually well engineered and designed with redundant features to reduce the impact of any hardware issues. Finally, because memory is shared, there can be cases where locked portions of memory may cause processors to stall while waiting for access.

Scale-Out with Distributed Memory

In contrast with the scale-up design, a scale-out or a distributed memory system is a method where multiple independent computers are used together to solve big problems. Often referred to as a “computing cluster,” these designs require a network that connects component machines. All shared data must traverse this network (or interconnect) as the application progresses. Thus, if there is too much data to fit in one computer, the data are broken up and distributed across multiple machines. During the computation, if one machine needs data that resides on another machine that data must be copied between machines. This copied data is usually in the form of a message from one machine to another.

Scale-out systems have the advantage that they can grow to very large sizes and use moderately sized systems as building blocks (often referred to as compute nodes or “nodes”). If an application data can distribute independent or semi-independent data across a scale-up system, then computing can scale to a very large size often numbering in the hundreds or thousands of nodes. Scale-out systems also have a certain level of redundancy where the stoppage of a single node will not stop the entire system. It will in almost all cases crash the applications that were running on the failed node.

The disadvantages of scale-out systems include the need to copy data from node to node. If the amount of data copying is large, it can cause processors to wait and reduce application performance. Thus the computing cluster designer pays close attention to the network performance and uses technologies like InfiniBand for best performance. Application scalability depends upon the system interconnect performance and the concurrency in the application.

Both scale-out and scale-up systems use multi-core processors. Thus, a scale-out “cluster” is really a distributed collection of scale-up systems (islands of multi-core machines).

Multi-core Processors

All modern servers and desktop machines use multi-core processors. Currently the most popular are the x86_64 Xeon® processors from Intel®. As described above, the first versions of these machines had a basic single processor and memory architecture. Additional processors (called cores) were added to increase performance while keeping systems within an acceptable power envelope. Thus all modern multi-core machines use the form of scale-up design to achieve more performance.

Both scale-out and scale-up systems use multi-core processors. Thus, a scale-out “cluster” is really a distributed collection of scale-up systems (islands of multi-core machines). A true scale-up machine essentially continues the multi-core scaling found in all modern processors.

Accelerators

One of the current trends in HPC is the use of accelerators or co-processors to improve the performance of mathematical operations. These devices are usually GPUs or purpose-built accelerators like the Intel® Phi that assist the processors. In terms of architecture, accelerators are primarily used to off-load repetitive operations from the processor. Both scale-out and scale-up systems can use accelerators.

File Systems

Another aspect of computing is persistent data storage. For high performance systems, a parallel (multi-port) file system is often used. A parallel file allows simultaneous reading and writing by many different applications or one large parallel application. And example of a parallel file system is the Lustre® file system. Both scale-out and scale-up systems can use these parallel file systems.

Over the next few weeks this series on in-memory computing will cover the following additional topics:

You can also download the complete report, “insideHPC Research Report on In-Memory Computing: Five Ways In-Memory Computing Saves Money and Improves TCO,” courtesy of SGI and Intel.