When I was just a young lad, my father handed me a canvas sack full of old nuts, bolts, half-eaten carburetors and a two stroke Briggs and Stratton engine. After several failed attempts at reassembly, I learned that scavenging the best parts from different sources doesn’t necessarily result in a quality or functional product. Much is the case with accelerated computing. In order to gain modest, kernel-specific single-threaded speedup, we utilize parts such as field programmable gate arrays (FPGAs) and graphics processors (GPUs) that don’t typically live inside the traditional HPC universe. Integrating these items into traditional HPC codes can be cumbersome and time consuming.
What if all the integration headaches disappeared?
The answer to this question is what spurred the collective minds of Bruce Toal, Tony Brewer and Steve Wallach to found Convey Computer. The idea the ex-heads of Convex Computer had was to take a holistic approach to develop an integrated system of hardware, software and execution models that would radically accelerate single threads. The secret sauce behind Convey’s technology is the ability to deliver on these goals completely abstracted from the user with no inherent code changes required: compile, run.
I recently had a chance to sit down with the three founders at their Richardson, Texas mother ship to find out exactly how they plan for all of this to work. By now, you’ve probably seen the various stories (for example, here) that were published when Convey de-cloaked last November, but very few people have taken a peek under the hood of Convey’s technology.
From the outside, their flagship HC-1 compute node looks like any other Intel-based x86_64 server. 2U in form factor with a dual socket motherboard. One socket is populated with a single, dual-core Intel Xeon 5400 series processor. The other socket is populated with what looks like a silicon stovepipe but is actually the socket interface to a daughter board mounted in the top 1U of the chassis. This daughter board is the first piece of the Convey solution.
The daughter board is an interesting piece of technology. It houses four Xilinx Virtex5 FPGAs or Application Engines (AE), eight memory controllers with sixteen DDR2 DIMM slots and the host computer interface, or Application Engine Hub (AEH). We’ve seen similar socket-native designs based on the AMD HyperTtransport interface. However, the AEH goes several steps further, providing cache coherent access between the host processor, acceleration units (FPGAs), and the Intel I/O hub to all memory on the system.
Users aren’t required to explicitly transfer memory blocks from host processor to acceleration processors within a compute kernel; the acceleration processors can reference all the memory on the host processor. Likewise, the host processor can access all memory physically located within the sixteen DDR2 DIMMs on the acceleration board. Practically-minded readers will realize that this requires significant memory bandwidth. The eight memory controllers on Convey board support up to 80GB/s of bandwidth, complete with integrated snoop filters.
For applications whose memory operations are usually very strided in nature, such as multidimensional fluid dynamics problems, Convey can provide a special Scatter-Gather DIMM. These applications may only need 8-bytes from an entire 64-byte cache line, so rather than wasting the additional cache, the Convey memory system has the ability to operate on single words. When coupled with SG-DIMMs this approach delivers a higher average peak memory bandwidth.
Expanding access by reducing barriers
The hardware just provides the capability to do work, actually getting anything done requires software, and those familiar with FPGAs know that they typically require a deep understanding of electrical engineering and logic circuits in order to make good use of them. One of Convey’s primary goals was to reduce the level of expertise required, even to the point of making these capabilities available for users running ISV codes. To make this work Convey depends on a runtime system.
Convey started with the Open64 compiler and built a fully compliant C/C++/Fortran compiler stack that generates GNU-compliant x86_64 code for both the host processor and the acceleration processor. What exactly does this mean? Code compiled with the Convey compilers will run on non-Convey and Convey-accelerated x86_64 machines. Binary compatibility. This also means that tools such as the GNU debugger system can be natively utilized to debug code on a Convey machine. It will even understand the accelerated instruction extensions that are present. All of which is a remarkable achievement.
The last portion of Convey’s solution is their ability to change the of configuration of the accelerated computing units. Keeping in mind that the core of the accelerated units are Virtex5 FPGAs, Convey has developed the ability to swap what they call hardware personalities. For example, a user has a code that processes large arrays that can be easily vectorized followed by a series of long state machines. When the application lands on the first vector instruction, the AEH signals the Application Engines to reconfigure themselves into a wide vector processor. At the point where the vector instructions are complete, the AEH signals again to switch to a specialized state machine. This hardware context switching occurs in real time with very little latency to the application. The AEH also caches these in the event that they must be reused.
The system comes bundled with a series of common hardware personalities that fit a variety of applications. However, Convey also bundles a Personality Development Kit, or PDK, so that users can mold the hardware into whatever form they need. If your application dictates that you need a specialized vector unit that receives thousands of instructions or a wide state machine that only requires a pair of entry/exit instructions, the Convey development platform will accommodate it.
The Convey Computer HC-1 platform is by far one of the most exciting developments for the high performance computing segment that I’ve personally witnessed in a very long time. This platform has the potential to reset the bar for the flagship HPC vendors. It delivers high single-threaded performance without sacrificing compatibility, compliance or usability. Ultimately, only time will tell whether the advanced capabilities hiding behind the faceplate will achieve widespread user acceptance. However, thanks to Convey, there is no doubt that the HPC universe has gotten a little bit bigger with a level of innovation that we haven’t seen tried in a while.