High performance computing (HPC) has always pushed the limits and designs of computing technology. The ability of HPC to uniquely deliver new scientific discoveries and insights drives the market forward. Interestingly, many of the breakthroughs in the HPC sector often translate into improvements in the mainstream computing sector.
When the history of HPC is viewed in terms of technological approaches, three epochs emerge. The most recent epoch, that of co-design systems, is new and somewhat unfamiliar to many HPC practitioners. Each epoch is defined by a fundamental shift in design, new technologies, and the economics of the day.
The Evolution of HPC is the second article in a series from the insideHPC Guide to Co-Design Architectures.
Epoch 1: Breaking Supercomputers into Clusters
In the beginning, computers of any kind were expensive devices. The ability to perform numerical calculations at high speed made them attractive to the scientific community. Indeed, one of the first computer languages was Fortran, which stands for Formula Translation, was developed so scientific and engineering applications could be moved to different computers—saving the cost of re-programming.
Recognizing the need for specialized architectures, companies like Control Data and then Cray Research developed machines called supercomputers that specialized in floating point operations. These machines were rated by how many FLOPs (floating point operations per second) they could deliver. Many other large computers of the day were more focused on the needs of business computing. Developed in the 1960’s, these specialized and expensive supercomputers found many uses. In particular the cold war made supercomputing a strategic necessity and helped push the technology forward.
By 1980, the numbers of processors used in a supercomputer began to increase. To overcome the limits of a single processor, multiple processors (in some cases hundreds of processors) were connected through various means and used in concert to achieve better performance. At the same time, the cost to fabricate faster and faster processors began to increase. At one point, the projected cost of fabricating a leading edge processor exceeded the entire market size of the supercomputer sector. These economies of scale forced one of the biggest changes in the early supercomputing market.
In order to justify development and fabrication costs, supercomputer vendors began turning to the commodity markets where processors were sold in large quantities (and therefore justified the cost of fabrication). This change had two very dramatic effects on the market:
- First, monolithic supercomputer systems began to splinter as many of the commodity components could be purchased from competing vendors.
- And second, the shift to commodity hardware lowered the economic barrier of entry by at least a factor of ten (i.e. more organizations could afford supercomputing systems).
As the number of processor families continued to shrink due to economics of scale, the x86 family of processors grew in popularity—largely due to the desktop-PC revolution. In 1986, Intel began developing processors designed for servers (Pentium Pro) that further lowered the cost of delivering FLOPs. Like early supercomputers, servers often had two or more processors that worked together using shared memory (these system are often called Symmetric MultiProcessing or simply SMP systems). At the time, there were many SMP server options (each with their own home-spun processor) including those from Sun, IBM, HP, DEC, and others. These big players also offered large-scale SMP systems that could be leveraged for many high performance tasks.
The operating system of choice for supercomputing and SMP servers was UNIX. Each processor vendor offered their own version of SMP UNIX (often derived from a common source), but just like the commoditization of processors, the economics of maintaining these independent versions was expensive. In 1981, Linus Torvalds released a freely available version of Linux that was essentially a plug-and-play alternative to UNIX. The ability to standardize on an open and freely available UNIX clone allowed low cost clusters of x86 workstations and servers to be built. By using the Message Passing Interface (MPI) library, clusters could run many of the supercomputing applications that had been written for “bigger machines.” The performance of these first systems, often called “Beowulf Clusters” (named for the NASA project that developed and used these systems), came close to that of large UNIX SMP systems and even many supercomputers of the day.
The Beowulf systems were initially connected with Fast Ethernet, which turned out to be, in some cases, a bottleneck to performance. Other high performance interconnects were developed and eventually the market settled on InfiniBand as the de facto HPC interconnect.
By the late 1990’s the Beowulf approach to supercomputing had taken hold. The name supercomputer fell out of favor and the term “HPC systems” was used to described high performance clusters. Individual servers were often referred to as cluster nodes and due to the commodity nature of most components (i.e. users could choose individual components from a number of sources) HPC clusters provided a price-to-performance of at least ten times better than existing SMP and supercomputer systems.
Even in these early cluster systems there were some elements of co-design. Depending on the application set, some machines were designed around problems (or a single problem). Within a budget limit, the designer has to strike a balance between the number of nodes, amounts of memory, storage, and the type of interconnect. Using a slower less expensive interconnect (e.g. Gigabit Ethernet) allowed more compute nodes but, if a faster interconnect was needed (e.g. InfiniBand), then the number of nodes was reduced. Amounts of memory and storage also had to fit into a tunable budget-performance equation.
Epoch 2: The Multi and Many-core Explosion
By November 2005, the Top500 computer list was dominated (70%) by clusters of x86 servers. The move away from SMPs and dedicated supercomputers was complete and continues today where the top systems are 86.2% clusters (as of June 2016). These initial cluster systems used single core processors often with two processor sockets per cluster node. There had been a steady increase in processor clock speed for each new generation, however processors were about to undergo a huge change.
The steady increase in clock speed (and hence processor speed) was limited by three issues:
- Memory Speed: The gap between processor and memory speed continued to grow. In order to keep the processors busy, more on-chip cache was needed to aid with repeated memory access. However, many HPC applications are sensitive to memory bandwidth and the additional cache did not help with performance
- Instruction Level Parallelism: The increasing difficulty of finding enough parallelism in a single instruction stream to keep a high performance single-core processor busy.
- Power Wall: Increased processor frequency causes an increase in operating temperature. Shrinking the processor die can offset this increase, however, this increases the amount of leakage current creating more heat. (i.e. current that travels through transistors that are in the “off” state”)
Given these challenges, chip designers turned in another direction. Instead of making processors faster, they put more of them on a single processor substrate. The era of multi-core began with dual CPU modules (cores) sharing the resources of a single processor socket (i.e. they both “see” the same memory and act identical to a two processor socket server.) The new dual-core chips essentially doubled the number of CPU core elements in a cluster. Taking advantage of Moore’s law, more cores were added to the each new generation of processors. As of June 2016, eighty percent of the systems on the Top500 list have between 6-12 cores per processor socket. The increase in computing density allowed both the number of HPC jobs (capacity) and the size of HPC jobs (capability) to increase.
As the numbers of cores increased, another technology began to migrate into HPC that accelerated performance for many applications. These systems were based on commodity Graphics Processing Units (GPUs) that contain large numbers (hundreds to thousands) of small efficient cores that worked in unison. This new many-core approach essentially “offloaded” certain types of operations from the processor onto a GPU. These types of operations, which are similar to those needed to render graphics, allowed processors to perform what is known as Single Instruction Multiple Data (SIMD) parallel processing. Like the switch to multi-core, not all application could take advantage of the new resources, but those that could, showed remarkable speed-up—often 20-30 times faster.
Like the previous epoch, elements of co-design are needed to standup an efficient clustered system. Depending on the application set, there were decisions as to the number of cores and memory per node and the need for GPU accelerators. Indeed, due to the large amount of computing power available on a single node, certain applications that previously would span multiple nodes could now be run efficiently on one fat node. Thus, HPC cluster design would often take on a heterogeneous architecture and allow the workflow scheduler to direct applications to the appropriate resources within a cluster.
A snapshot of the state of the art HPC system can be found by inspecting the June 2016 Top500 list. In this tally of the world’s fastest computers (as determined by running single benchmark), 90% of all systems use x86_64 multi-core processors, 19% use GPUs or other accelerators, and 40% are connected by InfiniBand. Thirty five percent of the systems used 10 GbE, however, many of these clusters are not strictly HPC systems and are used for other purposes. Unlike many HPC applications, the Top500 benchmark is not very sensitive to interconnect speed.
While these systems have reached the petaFLOPs (10E15 Floating point operations per second) level of computing and are responsible for producing many new discoveries, the current design may have difficulty pushing the HPC into the Exascale (10E18) regime.
Epoch 3: Melting the Edges of Hardware and Software Through Co-Design
Various predictions expect Exascale computing to arrive in in the early 2020’s. Extending conventional approaches to this level is not expected to meet this goal. Though not a new concept1 , the co-design of hardware and software to meet a performance goal is the most encouraging approach. A good overview of HPC co-design can be found in On the Role of Co-design in High Performance Computing by R. F. Barrett et al. In particular, the Barrett paper defines the co-design approach as;
The co-design strategy is based on developing partnerships with computer vendors and application scientists and engaging them in a highly collaborative and iterative design process well before a given system is available for commercial use. The process is built around identifying leading edge, high-impact scientific applications and providing concrete optimization targets rather than focusing on speeds and feeds (FLOPs and bandwidth) and percent of peak. Rather than asking “what kind of scientific applications can run on an Exascale system” after it arrives, this application-driven design process instead asks “what kind of system should be built to meet the needs of the most important science problems.” This leverages deep understanding of specific application requirements and a broad-based computational science portfolio.
Thus Exascale machines are likely to become more purpose-built and rated using a specific application(s) performance rather than the general Top500 High Performance Linpack benchmark (unless the machine was built to run the HPL benchmark as fast as possible). Extending today’s multi-core/many-core clusters to the Exascale range is hampered by the disconnect between hardware and software. Past levels of co-design often stop at a software boundary. That is, application designers are given a new fixed hardware design and must optimize software to use this hardware (e.g. GPU accelerators) A hardware/software separation does, however, have portability advantages (recall standard Fortran) and affords easier hardware designs (i.e. underlying details are “abstracted” away from the user). However, because both hardware and software are developed in isolation, performance-enhancing optimizations are simply not possible.
As mentioned, some level of co-designed systems already exists in many areas of HPC. The rise of GPU derived co-processors or accelerators (with subsequent software modification) have allowed commodity-based system to improve on computational limits. In effect, the heavy lifting is now done by a SIMD co-processor(s) on GPU derived accelerators (essentially array processors). For instance, the popular AMBER molecular dynamics application is now run almost exclusively on GPU derived accelerators. While not strictly co-designed (i.e. NVidia has heavily contributed to the GPU version of AMBER) AMBER/GPU systems have become a purpose-built product line for many integrators.
Another example of HPC co-design are purpose built storage systems. Offloading the entire storage responsibility for a parallel file system has provided better performance for many HPC clusters. Parallel file systems running separate from the computation nodes are often designed with a specific application set in mind. File system designers prefer to know the application IO rates and patterns in order to design an optimal system. It is not uncommon for designers to meet with users to determine these parameters.
Network offloading is another technique that reduces the amount of CPU cycles needed for network communication. In the case fast networks, those that move data at the 10’s of Gigabits/second rate, the processing overhead of the network stack can become significant and limit performance. Stateless offloads, such as checksum offload, segmentation offload, and large receive offload are technologies used in many network interface cards to reduce the amount of work required by the processor. InfiniBand also provides a full transport offload that completely circumvents the operating system (often called “user space” or “zero copy” protocols). Another form of offloading is Remote Direct Memory Access (RDMA) where memory on remote systems can be accessed directly over the interconnect. RDMA transfers require no work to be done by the sending or receiving processor and transfers continue in parallel with other system operations.
Network hardware for HPC can now offload MPI operations or provide new levels of network programmability. This design invites new architectures and ushers in a network co-design for HPC. Instead of a monolithic CPU that manages MPI or SHMEM communication a programmable co-design presents a new model that blurs the lines between discrete cluster components (i.e. the server, accelerators, and the network). A network co-design model allows data algorithms to be executed more efficiently using smart interface cards and switches. As co-design approaches become more mainstream, design resources will begin to focus on specific issues and move away from optimizing general performance. Co-design will allow designers and users to co-build high performance machines around important problems.
Over the next several weeks we will explore each of these topic in detail.
- Designing Machines Around Problems: The Co-Design Push to Exascale
- The Evolution of HPC (this article)
- The First Step in Network Co-design: Offloading
- Network Co-design as a Gateway to Exascale
- Co-design for Data Analytics And Machine Learning
If you prefer you can download the insideHPC Guide to Co-Design Architectures from the insideHPC White Paper Library.