Sign up for our newsletter and get the latest HPC news and analysis.
Send me information from insideHPC:

Alternatives to x86: Future Processing Technologies

In the second article in his series on future hardware for HPC, Robert Roe from Scientific Computing World looks at alternatives to the x86, including server on a chip and OpenPower.

Robert Roe

Robert Roe

Pushing the limits of processor technology is necessary if the industry wishes to achieve exascale computing. This has led people to explore alternative processing technologies that could rival the x86 – the technology that has dominated HPC over the past five years.

The Japanese company NEC has a mainframe that can support large single-core bandwidth; ARM, based in the UK and with a history in the mobile market, has been developing its technology and software ecosystem to the point where it is becoming a real alternative to more conventional HPC technology. In addition to these technologies, the IBM OpenPower group has gained arguably the most momentum of any new architecture as it was recently awarded two of the three main contracts in the US Coral program.

In early 2014, the US Department of Energy (DOE) brought together several of its national laboratories in the joint Collaboration of Oak Ridge, Argonne, and Lawrence Livermore (Coral) to coordinate investments in supercomputing technologies, streamline procurement and reduce costs, with the aim of developing supercomputers that will be five to seven times more powerful when fully deployed than today’s fastest systems. The computers are expected to be online by 2018 at a cost in excess of $600 million. The two IBM Open Power machines will cost around $325 million and will make use of IBM Power Architecture, Nvidia’s Volta GPU, and Mellanox’s interconnect technologies.

But from ARM to IBM, all these technologies, however different on the surface, represent the efforts of the HPC industry to modify the current generation of HPC technology to better adapt to the demands of data-centric workloads that have increased dramatically over the last few years.

IBM, for example, worked with Nvidia to develop the NVLink interconnect technology, which will enable CPUs and GPUs to exchange data five to 12 times faster than they can today. NVLink will be integrated into IBM Power CPUs and next-generation Nvidia GPUs based on the Volta architecture, allowing the new national lab systems to achieve performance levels far exceeding today’s machines. With Mellanox, IBM is implementing a state-of-the-art interconnect that incorporates built-in intelligence, to improve data handling.

All of these technologies are aimed at increasing data throughput across the computational cores; moreover the incorporation of OpenPower technologies into a modular integrated system will enable Lawrence Livermore and Oak Ridge to customize the Sierra and Summit system configurations for their specific needs.

Open Power

The first OpenPower Summit, held earlier this year in San Jose, California in tandem with Nvidia’s GPU Technology Conference (GTC) demonstrated the prototype of Firestone, IBM’s first OpenPower server oriented towards HPC. The IBM Power8 server is due out later this year. It will be manufactured by Taiwan’s Wistron, sold by IBM, and combines the technologies of Nvidia and Mellanox.

At the OpenPower event, Sumit Gupta, general manager of accelerated computing at Nvidia, highlighted that this new system is the first in a new series of high-density GPU-accelerated servers for OpenPower high-performance computing and data analytics.

The introduction of NVLink and Mellanox’ constant efforts to deliver more bandwidth means this architecture is already adapting to the new data-centric workloads that were cited by the DOE as defining a paradigm shift in computational workloads over the coming years.

These first prototype boards will be used by the DOE to support the development of the Coral machines as they represent the closest thing available to the Power9 processor that will be installed in the DOE’s Oak Ridge and Lawrence Livermore national laboratories.

With the Pascal architecture not even released yet, Volta is set to be the next generation accelerator architecture and will feature NVLink, so it is likely that the IBM system will be heavily invested in the performance of these GPUs.

It may be that IBM is relying on this, by planning a small, very energy-efficient processor that essentially acts as a conduit for the accelerators to do the majority of the heavy lifting.

However it is not just IBM and the national laboratories that are developing processing technologies to cope with the demands of tomorrow’s data-centric applications. Several companies have used their own experience and IP to develop their own server on a chip (SOC) or even whole mainframe systems for the HPC industry.

ARM processors and SOCs

The investment from the US national labs is also a clear sign that energy efficiency as well as the ability to handle large amounts of data are primary concerns of the largest supercomputers. This need for energy efficiency and a drive towards data intensive compute operations has encouraged people to look even further outside the traditional HPC environment to find a solution to increasing power needs.

Jeff Underhill, director server programmes at ARM, cited ‘the need to process ever-increasing amounts of data and gain insight.’ The need for new processor technologies was not confined to HPC, he continued: ‘Look at more traditional server environments and a lot of the focus on big data analytics today where people are looking at increasingly scaled out architectures.’

Darren Cepulis, data centre evangelist at ARM, said: ‘For the larger supercomputer sites, there are major challenges around compute density and just how much compute one can fit within a power envelope. Early on, I think, that is what drove some of these investigations into leveraging ARM – the fact that we can operate an SOC purpose-built for servers and for the HPC space.’

Applied Micro

Applied Micro have developed SOCs based on ARM architecture infused with their own IP, which comes from the company’s traditions in the telecoms industry. This has led them to build a system from the ground up that focuses on increasing I/O, as Applied Micro can integrate networking components directly onto the chip.

Kumar Sankaran, senior director of software and platform engineering at Applied Micro, said: “Historically if you look at any server that was based on Intel, you would typically have a bunch of cores, of CPUs without any surrounding I/O. Or there would be very little I/O that is provided by the Xeon processor solutions in general.”

Sankaran continued: “So what we mean by server-on-chip is that we integrate all of these.”

I/Os within one single chip, so we integrate things like 10G Ethernet, PCIE gen 3 interfaces; we integrate SATA gen3 and we also integrate the whole memory controller subsystem within one SOC.’

Sankaran explained that if you were to build a design using the X-Gene family of products you would have a much simpler design, but also including the ARM cores and networking elements that differentiate them from some of the processing technologies already available.

Mike Major, vice president of marketing, said: ‘We looked at everything and made a very conscious decision around the ARM architecture. In fact, when we decided that there was a place for a 64-bit ARM architecture, we approached ARM and worked together with them to develop that in the first place.’

Major continued: “There is a sense within the EU in general that there is probably little love lost between users and the current incumbent. I think not only is ARM a terrific platform, but I think it is a very good thing to have a viable alternative to the x86 processor.”

Major highlighted the potential in pairing ARM 64 based servers with high-power GPUs, as is available in X-GENE, for the HPC industry. Sankaran echoed the point: ‘If you have a very low latency fabric across multiple nodes, that allows you to share data very effectively, which is a very important point for HPC.’ In Sankaran’s view, there is even a chance that ‘the entire data centre could be replaced by ARM.’

One point that Major stressed was the ecosystem for ARM designs has a massive installed user base and he cited the collaboration and sharing of information as key drivers behind Applied Micro’s decision to get involved with ARM at an early stage.

Major said: “One of the things that really makes ARM very interesting, even by the time that we got involved in 2010, was that ARM already had a significant installed user base for mobile devices and we could see the growth that was happening there and we could also see the development of an enormous developer community.”

However ARM and IBM are not the only games in town. There are a number of other companies with equally bold solutions that would certainly be applicable to a number of situations within HPC but that do not necessarily have the same widespread adoption as IBM or ARM technologies.

Texas Instruments (TI), for instance, has been using digital signal processors (DSPs) to accelerate data-hungry applications and NEC has designed its own mainframe system with impressive single-core data throughput and memory bandwidth. NEC has already sold several of these systems to research institutions in Japan, including the Tohoku University.

The NEC mainframe, known as the SX-ACE, is the successor to the previous SX-9 system, delivering an order of magnitude better power efficiency compared with the older system. The main focus of the developers at NEC is clearly single-threaded performance as the SX-9 system boasts what it claims is the world’s fastest single-core performance, with a processing speed of 64 GFlops. It also delivers a single-core memory bandwidth of 64 GBps.

Texas Instruments

Texas Instruments has developed servers that integrate its home-grown DSP technology and IP with (or without) ARM cores in a system that can be implemented as a single rack or scaled up into a fully integrated rack.

Arnon Friedmann, business manager at Texas Instruments, said: ‘Our systems use our C66x multicore DSPs with and without integrated ARM Cortex-A15 cores.’ Examples of these systems can be found in the Proliant M800 cartridge, part of HP’s Moonshot programme and through Prodrive who offer processing blade based of TI’s SOCs and ARM processors.

Each blade contains six TI processing units based on the KeyStone architecture. Each processing unit is composed of four ARM Cortex-A15 cores and 24 TI C66x digital signal processor cores, amounting to almost three trillion floating point operations per second (TFlops) of compute performance per blade.

Friedman commented that these could be used to accelerate scientific computing by offloading computational code to the DSP using OpenCL and OpenMP, much in the same way as FPGA manufacturers can use their technology. (See ‘Will OpenCL open the gates for FPGAs?’, SCW Feb/Mar 2015)

Freidman commented that this combination of technology – specifically the HP Proliant M800 – provides a solution that is comparable to today’s HPC technology with the added benefits of faster data processing. Freidman said: ‘This yields a system with compelling computational density for the cost and power. PayPal has made several presentations about the benefits they see from using this system.’

Friedman concluded that the move towards data-centric computing could benefit those with technology that is already designed from its inception to deal with large volumes of data. He said: ‘There is certainly an increase in data-centric computing and we do believe this is advantageous for us, based on the results PayPal has shown for their data-centric applications.’

Licensing model aids innovation

One thing that many of these solutions have in common is the potential to work with or integrate ARM processors. The wide selection of processors that are provided through ARMs partnerships provide a wide array of solutions which can be adapted for the server market, but also equally within the HPC industry. However, it is ARM’s traditions in mobile processing that have delivered a technology that is intrinsically linked with energy efficiency since it was first designed for battery-powered mobile devices.

Underhill said: “Historically, the strength with ARM is in portable mobile devices that are very power conscious; that said, the ARM architecture and road map has evolved considerably over the last couple of years and we have a broad spectrum of solutions from very small, very efficient cores all the way up to multi-issue, fully out of order superscalar architectures, all built around the same compatible instruction set. So there you have that spectrum of solutions.”

He went on to explain that a lot of what makes these processors an attractive option to the mobile and embedded computing industries are now also very attractive to the server market and HPC industry. Underhill said: ‘They are looking to build higher performance machines, improve efficiency, and improve the cost of ownership. And a lot of those same variables that have made the ARM architecture attractive for portable and mobile devices are now very attractive and compelling for server and HPC communities.’

Underhill concluded: “So we’ve even seen some people explore the use of what you would consider mobile technologies for deploying them to HPC and exploring the bounds of what might be possible – leveraging technologies from a totally different domain that you may not immediately categorise as a HPC technology right now.”

One thing that has benefitted ARM, in the eyes of Jeff Underhill, is that the licensing model of the ARM architecture fosters collaboration as ARM is most successful when its silicon partners are also successful. This leads to a more open ecosystem, which has been a clear driver of ARM technology within the HPC industry.

It’s really a result of keen interest from the HPC space in the ARM architecture, the progression that we’ve made, those lower power efficiency characteristics, and then fundamentally the ARM business model – allowing them to use as much or as little of the ARM IP and augment that with their own intellectual property whether that be networking, accelerators, GPGPUs, bulk encryption, bulk compression engines, storage accelerators.”

It is just this collaboration and freedom to adapt the ARM microarchitecture within the boundaries of the ARM instruction set that have attracted a number of partners who wish to develop ARM solutions. As a consequence of the work done by ARM, many of these companies have roots outside of conventional computing technologies. They bring new IP to the table, such as networking features that deliver the data throughput that data intensive applications of today require.

Applied Micro, for instance, has its roots in the telecoms industry as Major explained. He said: “The company has been around for thirty some years now. We went public in 1999. Our legacy and roots are in building high-speed, high-quality transport components for the telecoms industry and what’s happened over the years is telecoms has sort of morphed into datacom which has been morphing into datacenter which has meant that our product line has been adapting accordingly.”

Sankaran said: “The inherent advantage that we have in the compute space is because of our legacy. From the legacy we have had all of these IPs that are mainly borrowed from the communications space, we reused that in the computing area.”

It is unlikely that any one of these technologies will be able to dominate the HPC industry over the next few years. The market is generally uncertain about the future and braced for change. No one wants to be the first to stick their neck out or open their wallet in an uncertain time, but those who choose to adopt some of these technologies are helping to shape a future where energy efficiency and data intensive computing applications will be as important as floating-point performance.

This story appears here as part of a cross-publishing agreement with Scientific Computing World.

Sign up for our insideHPC Newsletter.

Resource Links: