In this special guest feature, Robert Roe from Scientific Computing World investigates the use of technologies in HPC that could help shape the design of future supercomputers.
In addition to the normal annual progression of HPC technology, 2017 will see the introduction of many technologies that will help shape the future of HPC systems. Production-scale ARM supercomputers, advancements in memory and storage technology such as DDN’s Infinite Memory Engine (IME), and much wider adoption of accelerator technologies and from Nvidia, Intel and FPGA manufacturers such as Xilinx and Altera, are all helping to define the supercomputers of tomorrow.
While the future of HPC will involve myriad different technologies, one thing is clear – the future of HPC involves a considerably greater degree of parallelism than we are seeing today. A large driver for this increased parallelism is the use of increasingly parallel processors and accelerator technologies such as Intel’s Knights Landing and Nvidia’s GPUs.
However, in the opinion of OCF’s HPC and storage business development manager David Yip, the lines between accelerator and CPU are blurring. This requires HPC developers changing their mindset in order to evaluate the kind of technology they want in their HPC systems.
Yip highlighted that, in the past, there have been other accelerator technologies. However, they were not realised at the correct time, or the critical mass needed to reach widespread adoption was not reached – so the technology was slowly surpassed by rivals. “What hampered their take up 5 to 10 years ago was the software, and how hard it was to program these very complicated systems. Our brains do not naturally think about the kind of extreme parallelism we are now seeing in HPC,” stated Yip.
This increasing number of computing cores, either through CPU or GPU, also impacts the way that HPC programmers must manage and develop their applications. As larger and larger numbers of small processing elements become commonplace, HPC developers need to address this in the way that they program algorithms and map them onto the hardware available.
Over the last 10 years we have had to get used to MPI programming on compute cores, and then we have seen the gradual adoption of GPUs,” said Yip. “If you think about it, we have had to do a phase shift in our thinking because our compute cores have 128GB or 256GB of memory in the compute node, whereas the graphics card itself has only got 6GB or 12GB.”
“This throws up a new challenge for HPC developers: how can they get all of this data across into the GPU or accelerator without causing major bottlenecks for the system? It is a paradigm shift in terms of programming because we are talking about going from coarse grain parallelism into ultra-fine grain within the GPU on the same node,” explained Yip.
New processing technologies for HPC
In 2016 it was announced that the follow-up to Riken’s flagship K supercomputer will feature ARM processors that will be delivered by Fujitsu, which is managing the development of the ‘Post-K’ supercomputer.
Toshiyuki Shimizu – vice president of the System Development Division, Next Generation Technical Computing Unit – explained that, in order to reach exascale systems early, the HPC industry needs to address certain challenges in addition to developing new processing technologies. “Generally speaking, budget limitations and power consumption are serious issues to reach exascale,” stated Shimizu. “Advantages found in the latest silicon technologies and accelerators for domain specific computing will ease these difficulties, somewhat.”
In regards to the Post-K system, Shimizu commented: “One major enhancement on the Post-K that is currently published is a wider SIMD, at 512 bits. With the Post-K, Fujitsu chose to stay with a general-purpose architecture for compatibility reasons and to support a wide variety of application areas.” Shimizu stressed that these are important features for this flagship Japanese supercomputer. “In terms of reaching exascale, we need to conduct research projects to discover new architectures, and research and development, to support specific applications.
In addition to development of the ‘Post-K’ computer, Shimizu commented that “interconnect technology and system software will become more important, as well as the design of CPUs.” He also mentioned that, for many applications, node scalability and efficiency will also be critical to system performance.
The future is parallel
Today the clearest path for exascale is through the use of increasingly parallel computing architectures. Partly this is due to savings in energy efficiency that use large numbers of low-power energy efficient processors – but also the performance introduced by accelerators such as GPUs and Intel’s ‘Knights Landing’.
Accelerators have continued to grow in popularity in recent years – but one company, Nvidia, has made significant progress when compared to its rivals.
According to Yip this is due to the promotional efforts of Nvidia, as they have not only raised awareness of GPU technology, but also spent considerable resources ensuring that as many people as possible have access to this technology through education, training and partnerships with research institutes and universities.
Nvidia has done the most fantastic job over the last 10 years,” said Yip. “What Nvidia has done is get cards out there to people; they have given training running workshops and education on all of their platforms. They have blanketed the market with their technology and that is what it takes – because, if you think about it, we are only just brushing the surface of what is possible with GPU technology.”
Towards the end of 2016 Nvidia announced that the second generation of its proprietary interconnect technology, NV Link 2.0, would be available in 2017. NV Link provides 160 GB/s link between GPU and Power 9 CPU. For the second iteration this will be increased to 200 GB/s, drastically increasing the potential for data movement inside a HPC node.
However, Yip was quick to point out that it is not just the highest performing technology that will see widespread adoption by the HPC industry. “It all takes time there is a lot of research that needs to be done to exact the most performance out of these systems,” said Yip. It’s not just hardware, it’s the software – and we also need education for all the people that want to take advantage of these systems.
Securing the future of storage
One challenge that is created by the increasingly parallel nature of processor architectures is the increased number of threads or sets of instructions that need to be carried out by a cluster. As the number of processors and threads increases, so must the performance of the data storage system, as it must feed data into each processor element. In the past it was flops or memory bandwidth that would limit HPC applications – but, as parallelism increases, the importance of input/output operations (I/O) becomes increasingly important to sustained application performance.
To solve this issue storage companies are developing several new methods and technologies for data storage. These range from the use of flash memory for caching, as well as in-data processing and improvements, to the way that parallel file systems handle complex I/O.
Storage provider DDN has been working on its own solution to tricky I/O challenges with its IME appliance – which, at its most basic level, is a scale-out native high performance cache.
Robert Triendl, DDN senior vice president of global sales, marketing and field services explained that IME is taking advantage of the convergence of important factors: flash memory technology such as 3D NAND and 3D XPoint, the decreasing cost of flash, and “a strong demand from the supercomputing community around doing something different for exascale.” “We see huge technological advances, which we need to take advantage of. The economics are right in terms of implementing flash at large scale, and there is obviously significant demand for doing I/O in a different way,” he said.
One of the biggest challenges facing the development of storage technology in HPC is that this increasing parallelisation introduced large numbers of processors with growing numbers of cores. This creates a problem of concurrency for I/O storage systems, as each processor or thread needs its own stream of data. Without the data reaching the processing elements in a quick enough time, the performance of the application will deteriorate.
Triendl went on to explain the main challenges facing HPC storage, and how DDN is developing in-house technologies in order to overcome these challenges. “One is concurrency, the number of threads in a supercomputer today, and that is driven not just by compute nodes but GPUs, Xeon Phi and Knights Landing. File systems – regardless of whether they are built on top of flash or hard drives – have limitations that are in-built into developers’ thinking so application developers think, for example, that they have certain limitations for certain types of I/O.”
IME is designed to fundamentally change the way that I/O is handled by the storage system. In addition to providing a flash cache to speed up data movement and storage, it can also provide a fairly large-scale, high-performance flash cache, so you can have an efficient way of implementing flash at scale in front of a parallel file system.”
The system aims to reduce some of the impact of I/O, particularly for application developers who would sometimes need to limit the number of I/O threads to avoid bottlenecking performance of the application. “With IME developers don’t have to worry about having a single shared file, random or strided I/O,” said Triendl. “Many of the patterns that were effectively banned from implementation are not things that developers need to worry about when using IME.” However, Triendl was keen to stress that it is not just applications bound by complex I/O that can benefit from the use of IME. He explained that, if a developer does have an application that is running fine on a parallel file system, it can usually be accelerated to some extent purely because IME can handle concurrency better than a traditional file system.
Essentially IME does not care, it does not object to this very high concurrency, partly because it is based on flash and partly because we are managing the I/O in a very different way than a file system. We do not deal with the same problems that a file system would need to deal with,” concluded Triendl.
ARM arrives as a technology for HPC
In addition, the announcement regarding the RIKEN system, another ARM-based supercomputer has also been recently announced through a collaboration between the UK universities of Bath, Bristol Exeter and Cardiff – collectively known as the Great Western Four (GW4) – in partnership with the UK Met Office and supercomputer manufacturer Cray.
This system, which will be called Isambard (after the legendary engineer Isambard Kingdom Brunel), funded through a £3 million grant from the EPSRC, will act as a tier-2 system as part of the UK’s HPC infrastructure. A tier-2 system is designed to test and develop applications on new hardware or emerging technologies to help inform the UK and the wider HPC industry on the potential for these new technologies.
Simon McIntosh-Smith, head of the Microelectronics Group who is heading up the development of Isambard for GW4, explained that this project is focused on figuring out just how comparable this ARM system will be to an Intel- or IBM-based supercomputer.
What I wanted to do was move this on to the next phase,” said McIntosh-Smith. “We now need to try this for real, in a real production environment, so we can get some data that is not just created in the lab but derived from real applications as part of a genuine HPC service.”
“We are very interested in key metrics. How does it perform like-for-like against all the other alternatives, Intel CPUs, Knights Landing, the Xeon Phi architectures, or the latest GPUs from Nvidia?”
In order to collect these key statistics, it is important for the GW4 team to try and limit the variation introduced when normally comparing different computing architectures. The problem here is that each architecture has its own dedicated software tools and compilers, so any comparison would need to take this into account.
To limit this, McIntosh-Smith and his colleagues plan to use Cray’s software tools – as, once they have been optimized for ARM, they will provide the closest thing to a vendor agnostic platform. They also hope to increase the accuracy of the comparison by including small partitions of other computing architectures.
They will all be running the Cray software stack and they will all be connected to the same InfiniBand interconnect,” stated McIntosh-Smith. “This will be because it allows us to do an apple-to-apples comparison. We can use the same compilers that compile for ARM or Intel, run the codes with the same interconnect and hopefully we can just compare the different performance figures.”
While ARM is an exciting prospect in HPC today, it has taken a lot of work to adapt the technology originally developed for embedded electronics and mobile phones for full-scale supercomputing applications.
McIntosh-Smith explained that, in his opinion, the ARM HPC ecosystem has only just reached the maturity that requires full-scale production testing: “The one important part that was missing was ARM-based CPUs that were in the same performance ballpark as state-of-the-art CPUs such as THE mainstream X86. “This year I think it is ready. The one thing that is missing then is people who have real data – that is exactly what Isambard is about. We are not just evaluating it, we are helping to make it happen. Projects like Isambard act as a catalyst for future development,” McIntosh-Smith concluded.