In this special guest feature from Scientific Computing World, Robert Roe explains how OpenCL may make FPGAs an attractive option.
In January 2015, FPGAs took a step closer to the mainstream of high-performance computing with the announcement that Xilinx’s development environment for systems and software engineers, SDAccel, had been certified as conforming to the OpenCL standard for parallel programming of heterogeneous systems.
Over the past few years, high-performance computing (HPC) has become used to heterogeneous hardware, principally mixing GPUs and CPUs, but now, with both major FPGA manufacturers in conformance with the OpenCL standard, the door is effectively open for the wider use of FPGAs in high-performance computing.
The changing landscape of HPC, with the move towards data-centric computing, could favour FPGAs with very high I/O throughput. However, it remains to be seen if FPGAs will be used as an accelerator or if supercomputers might be built using FPGA as the main processor technology.
One of the attractions of FPGAs is that they consume very little power but, as with GPUs initially, the barrier to adoption has been the difficulty of programming them. Manufacturers and vendors are now releasing compilers that will optimise code written in C and C++ to make use of the flexible nature of FPGA architecture.
Easier to program
Mike Strickland, director of the computer and storage business unit at Altera said: “The problem was that we did not have the ease of use, we did not have a software-friendly interface back in 2008. The huge enabler here has been OpenCL.”
Larry Getman, VP of strategic marketing and planning at Xilinx said: ‘When FPGAs first started they could do very basic things such as Boolean algebra and it was really used for glue logic. Over the years, FPGAs have really advanced and evolved with more hardened structures which are much more specialised.’
Getman continued: ‘Over the years FPGAs have gone from being glue logic to harder things like radio head systems, that do a lot of DSP processing; very high-performance vision applications; wireless radio; medical equipment; and radar systems. So they are used in high-performance computing, but for applications that use very specialised algorithms.’
Getman concluded: ‘The reason people use FPGAs for these applications is simple, they offer a much higher level of performance per Watt than trying to run the same application in pure software code.’
FPGAs are programmable semiconductor devices that are based around a matrix of configurable logic blocks (CLBs) connected through programmable interconnects. This is where the FPGA gets the term ‘field programmable’ as an FPGA can be programmed and optimised for a specific application. Efficient programming can take advantage of the inherent parallelism of the FPGA architecture delivering a higher level of performance than accelerators that have a less flexible architecture.
Millions of threads running at the same time
Devadas Varma, senior director of software Research and Development at Xilinx said: ‘A CPU, if it is single core CPU, executes one instruction at a time and if you have four cores, eight cores, that are multithreaded then you can do eight or sixteen sets of instructions, for example. If you compare this to an FPGA, which is a blank set of millions of components that you decide to interconnect, theoretically speaking you could have thousands or even millions of threads running at the same time.’
Reuven Weintraub, founder and chief technology officer at Gidel, highlighted the differences between FPGAs and the processors used in CPUs today. He said: ‘They are the same and they are different. They are the same from the perspective that both of them are programmable. The difference is coming from the fact that in the FPGA all the instructions would run in parallel. Actually the FPGA is not a processor; it is compiled to be a dedicated set of hardware components according to the requirements of the algorithm – that is what gives it the efficiency, power savings and so on.’
Traditionally this power efficiency, scalability, and flexible architecture came at the price of more complex programming: code needed to address the hardware and the flow of data across the various components, in addition to providing the basic instruction set to be computed in the logic blocks. However, major FPGA manufacturers Altera and Xilinx have both been working on their own OpenCL based solutions which have the potential to make FPGA acceleration a real possibility for more general HPC workloads.
Development toolkits
Xilinx has recently released SDAccel, a development environment that includes software tools including its own compiler, tools for code development, profiling, and debugging, and provides a GPU-like work environment. Getman said: ‘Our goal is to make an FPGA as easy to program as a GPU. SDAccel, which is OpenCL based, does allow people to program in OpenCL and C or C++ and they can now target the FPGA at a very high level.’
In addition, SDAccel provides functionality to swap multiple kernels in and out of the FPGA without disrupting the interface between the server CPU and the FPGA. This could be a key enabler of FPGAs in real-world data centres where turning off some of your resources while you re-optimise them for the next application is not an economically viable strategy at present.
Altera has been working closely with the Khronos group, which oversees a number of open computing standards including OpenCL, OpenGL, and WebGL. Altera released a development toolkit, Altera’s SDK for OpenCL, in May 2013. Strickland said: ‘In May 2013 we achieved a very important conformance test with the standards body – the Khronos group – that manages OpenCL. We had to pass 8,000 tests and that really strengthened the credibility of what we are doing with the FPGA.’
Strickland continued: ‘In the past, there were a lot of FPGA compiler tools that took care of the logic but not the data management. They could take lines of C and automatically generate lines of RTL but they did not take care of how that data would come from the CPU, the optimisation of external memory bandwidth off the FPGA, and that is a large amount of the work.’
Traditionally optimising algorithms to utilise fully the parallel architectures of FPGA technology involved significant experience using HDLs (hardware description languages) because they allowed programmers to write code that would address the FPGA at register-transfer level (RTL).
RTL enables programmers to describe the flow of data between hardware registers, and the logical operations performed on that data. This is typically what creates the difference in performance between more general processors and FPGAs, which can be optimised much more efficiently for a specific algorithm.
The difficulty is that that kind of coding requires expertise and can be very time consuming. Hand-coded RTL may go through several iterations as programmers test the most efficient ways to parallelise the instruction set to take advantage of the programmable hardware on the FPGA.
Strickland said: “With OpenCL or the OpenCL compiler, you still write something that is like C code that targets the FPGA. The big difference I would say is the instruction set. The big innovation has been the back end of our complier which can now take that C code and efficiently use the FPGA.”
Strickland noted that Altera’s compiler ‘does more than 200 optimisations when you write some C code. It is doing things like seeing the order in which you access memory so that it can group memory addresses together, improving the efficiency of that memory interface.’
Converting code from different languages into an RTL description has been possible for some time, but these developments in OpenCL make it much easier for programmers without extensive knowledge of HDLs, such as VHDL and Verilog, to make use of FPGAs.
However OpenCL is not the final piece of the puzzle for FPGA programming. Strickland said: ‘Over time you may want to have other high-level interfaces. There is a standard called SPIR (Standard Portable Intermediate Representation). The idea is that this allows you to kind of split up your compiler between the front end and the back end, enabling people to use different high-level language interfaces on the front end.’
Strickland continued: ‘In universities now there is research into domain-specific languages, so people are trying to accomplish a certain class of algorithms may benefit from having a higher level interface than even C. The idea behind exposing this intermediate compiler interface is you can now start working with the ecosystem to have front ends with higher-level interfaces.’
Over the past few years, there have been two ideas behind the best way to program FPGAs: high-level synthesis (HLS) or OpenCL. As OpenCL has matured, Xilinx decided to adopt the standard but to keep the work it had done developing HLS technology and integrate that into the development environment conforming to the OpenCL standard.
Getman said: “The main problem is that C is very much designed to go cycle to cycle, step by step. Unfortunately hardware doesn’t. Hardware has a lot of things running at the same time.” This aspect was what made HLS attractive as a compiler that can take OpenCL, C or C++ and architecturally optimise it for the FPGA hardware.
Xilinx acquired AutoESL and its HLS tool AutoPilot in 2011 and began integrating it into its own development tools for FPGAs. Getman said: ‘That was really the big switching point. For many years, people had been promising really great results with HLS but in reality the results were a lot bigger and a lot slower than what could have been done by hand.’
Getman continued: ‘We have integrated this technology into our tools and added a lot to it. This is really one of the big differentiators from our competition, even though we both have OpenCL support. This technology allows our users the opportunity to create their own libraries in real-time using C, C++ or OpenCL, rather than have to wait for the vendor to create specific libraries or specific algorithms for them.
Varma said: “The silver bullet in HLS is the ability to take a sequential description that has been written in C and then find this parallelism, the concurrencies, without the user having to think. That was a necessary technology before we could do anything. It has been adopted by thousands of users already as a standalone technology, but what we do is embed that technology inside OpenCL compilers so that now it can be utilised in full software mode and it is fully compatible with OpenCL.”
Getman said: “We consciously made a switch over the last few years to expand our customer base by both continuing technology development for our traditional users as well as expand our tool flow to cater to software coders.”
A key facet of this technology is that Xilinx is letting programmers take the work they have done in C and port it over to OpenCL using the technology from HLS that is now integrated into its compilers. Varma said: ‘One thing that changes when you go from software to hardware programming is that C programmers, OpenCL programmers, are used to dealing with a lot of libraries. They do not have to write matrix multiplications or filters or those kinds of things, because they are always available as library elements. Now hardware languages often have libraries, but they are very specific implementations that you cannot just change for your use.’
Varma concluded: “By writing in C, our HLS technology can re-compile that very efficiently and immediately. This gives you a tremendous capability.”
Coprocessor or something bigger?
FPGA manufacturers like Altera and Xilinx have been focusing their attention on using FPGAs in HPC as coprocessors or accelerators that would be used in much the same way as GPUs.
Getman said: “The biggest use model is really processor plus FPGA. The reason for that is there are still things that you want to run on a processor. You really want a processor to do what it is good at. Typically an FPGA will be used through something like a PCIE slot and it will be used as an acceleration engine for the things that are really difficult for the processor.”
This view was shared by Devadas Varma who highlighted some of the functionality in an earlier release of OpenCL that increased the potential for CPU/GPU/FPGA synergy.
Varma said: ‘The tool we have developed supports OpenCL 1.2 and importantly it can co-exist with CPUs and GPUs. In fact in our upcoming release we will support partitioning workloads into GPUs, we already support this feature regarding CPUs. That is definitely where we are heading.’
However this was not a view shared by Reuven Weintraub, at Gidel, who felt that to regard an FPGA simply as a coprocessor was to miss much of the point and many of the advantages that FPGAs could offer to computing. Weintraub said: “For me a coprocessor is like the H87 was, you make certain lines of code in the processor and then you say “there’s a line of code for you” and it returns and this goes back and forth. The big advantage of running with the FPGA is that the FPGA can have a lot of pipelining inside of it, solve a lot of things and have a lot of memory.”
He explained that an FPGA contains a ‘huge array of registers that are immediately available’ by taking advantage of the on-board memory and high-throughput that FPGAs can handle, meaning that ‘you do not necessarily have to use the cache because the data is being moved in and out in the correct order.’
Weintraub concluded: “Therefore it is better to give a task to the FPGA rather than giving just a few up codes and the going back and forth. It is more task oriented. Computing is a balance between the processing, memory access, networking and storage, but everything has to be balanced. If you want to utilize a good FPGA then you need to give it a task that makes use of its internal memory so that it can move things from one job to another.”
Gidel has considerable experience in this field. Gidel provided the FPGAs for the Novo-G supercomputer, housed at the University of Florida, the largest re-configurable supercomputer available for research.
The university is a lead partner in the ‘Center for High-Performance Reconfigurable Computing’ (CHREC), a US national research centre funded by the National Science Foundation.
In development at the UF site since 2009, Novo-G features 192, 40nm FPGAs (Altera Stratix-IV E530) and 192, 65nm FPGAs (Stratix-III E260).
These 384 FPGAs are housed in 96 quad-FPGA boards (Gidel ProcStar-IV and ProcStar-III) and supported by quad-core Nehalem Xeon processors, GTX-480 GPUs, 20Gb/s non-blocking InfiniBand, GigE, and approximately 3TB of total RAM, most of it directly attached to the FPGAs. An upgrade is underway to add 32 top-end, 28nm FPGAs (Stratix-V GSD8) to the system.
According to the article ‘Novo-G: At the Forefront of Scalable Reconfigurable Supercomputing’ written by Alan George, Herman Lam, and Greg Stitt, three researchers from the university, Novo-G achieved speeds rivaling the largest conventional supercomputers in existence – yet at a fraction of their size, energy, and cost.
But although processing speed and energy efficiency were important, they concluded that the principal impact of a reconfigurable supercomputer like Novo-G was the freedom that its innovative design can give to scientists to conduct more types of analysis, and examine larger datasets.
The potential is there.
This story appears here as part of a cross-publishing agreement with Scientific Computing World.