The GPUltima: Up to a Petaflop of Networked GPUs in a Single Rack

Print Friendly, PDF & Email
PCIe Gen 4

In this week’s Sponsored Post, Katie (Garrison) Rivera of One Stop Systems explains how GPUltima allows HPC professionals to create a highly dense compute platform that delivers a petaflop of performance at greatly reduced cost and space requirements.

As developers have begun requiring more GPUs to process complex applications, manufacturers have looked for new ways to add GPUs to systems. Most solutions involve adding numbers of high bandwidth slots to a system, typically accommodating up to 8 GPUs in one system. System constraints like power, cooling and BIOS issues present difficulties in supporting large numbers of these cards. One Stop Systems (OSS) put the GPUs in a separate enclosure with sufficient power and cooling and connected the enclosure to one or more dual or quad socket servers through PCIe, the servers’ native bus. The High Density Compute Accelerator (HDCA) adds up to 16 double-wide GPUs in 3U of rack space. The HDCA has been successful in a wide variety of applications, such as Defense, Oil and Gas, and Research as it provides the tremendous compute power needed to quickly process the amount of data generated in intensive applications.top_onestop

OSS then connected the GPUs through InfiniBand, allowing each GPU in the system to communicate with any other GPU. To make the system more accessible to our customers we added cluster distribution software so that all the customer had to do is add his own application software. But we didn’t stop there. If 16 networked GPUs is great, 128 networked GPUs is even better. We call it, the GPUltima.


OSS introduced the GPUltima at SC15 in November.  The GPUltima consists of eight nodes, with each node containing up to 16 accelerators and a dual or quad socket server. Each GPU is connected to the others through a 100Gb InfiniBand switch and the entire system is connected to the outside world through Ethernet. Customers can build up to the full rack, one node at a time, depending on their application requirements. A single node provides up to 139 Tflops, adding up to one Petaflop with all eight nodes.

Deep learning is one application that can benefit from either a single or multiple nodes of The GPUltima depending on the complexity of the specific application. Deep learning is a branch of machine learning that attempts to train computers to identify patterns and objects, in the same way humans do. For example, Google Brain, a cluster of 16,000 computers, successfully trained itself to recognize a cat based on images taken from YouTube videos. This technology is already used in speech recognition, photo searches on Google+ and video recommendations on YouTube. Training the neural networks used in deep learning is an ideal task for GPUs because GPUs can perform many calculations at once (parallel calculations), meaning the training will take much less time than it used to take. For an application such as deep learning, multiple networked GPUs like in The GPUltima can achieve speedups, especially with convolutional neural networks or applications that try more than one algorithm at the same time.


The GPUltima is application-ready so it’s a solution that contains hardware and software elements that allows the user to add his application and begin processing. Therefore, the GPUltima is an overall solution rather than just hardware in a rack.

This guest article was submitted by Katie (Garrison) Rivera, Marketing Communications at One Stop Systems