“In order for developers to be able to focus on their application, a Vision Algorithm Designer application is included in the Intel Computer Vision SDK. This gives users a drag and drop interface that allows them to create new applications on the fly. Large and complex workflows can be modelled visually which takes the guesswork out of bringing together many different functions. In addition, customized code can be added to the workflows.”
Intel-owned Movidius has introduced a fascinating new device called the Fathom Neural Compute Stick, a modular deep learning accelerator in the form of a standard USB stick. “The Fathom Neural Compute Stick is the first of its kind: A powerful, yet surprisingly efficient Deep Learning processor embedded into a standard USB stick. The Fathom Neural Compute Stick acts as a discrete neural compute accelerator, allowing devices with a USB port run neural networks at high speed, while sipping under a single Watt of power.”
With modern processors that contain a large number of cores, to get maximum performance it is necessary to structure an application to use as many cores as possible. Explicitly developing a program to do this can take a significant amount of effort. It is important to understand the science and algorithms behind the application, and then use whatever programming techniques that are available. “Intel Threaded Building Blocks (TBB) can help tremendously in the effort to achieve very high performance for the application.”
“When designing an application that contains many threads and less cores than threads, it is important to understand what is the optimal number of threads that should be assigned to a core. This value should be parameterized, in order to easily run tests to determine which is the optimum value for a given machine. One thread per core on the Intel Xeon Phi processor will give the highest performance per thread. When the number of threads per core is set at two or four, the individual thread performance may be lower, but the aggregate performance will be greater.”
Accelerated computing continues to gain momentum as the HPC community moves towards Exascale. Our recent Tesla P100 GPU review shows how these accelerators are opening up new worlds of performance vs. traditional CPU-based systems and even vs. NVIDIA’s previous K80 GPU product. We’ve got benchmarks, case studies, and more in the insideHPC Research Report on GPU Accelerators.
As data center sprawl is now understood to be expensive and may not deliver performance increases for all types of applications, new technologies are coming to the rescue. A field-programmable gate array (FPGA) is an integrated circuit designed to be configured by a customer or a designer after manufacturing – hence “field-programmable”. While the use of GPUs and HPC accelerators are generally understood today, there are a number of misconceptions about FPGAs that need to be understood.
The Intel Xeon Phi processor supports different types of memory, and can organize this into three types of memory mode. The new processor from Intel contains two type of memory, MCDRAM and DDR memory. These different memory subsystems are complimentary but can be used in different ways, depending on the application that is being executed. “By using these two types of memory in the same system gives flexibility to the overall system and will show an increase in performance for almost any application.”
FPGAs will become increasing important for organizations that have a wide range of applications that can benefit from performance increases. Rather than a brute force method to increasing performance in a data center by purchasing and maintaining racks of hardware and associated costs, FPGAs may be able to equal and exceed the performance of additional servers, while reducing costs as well.
To get maximum parallelization for an application, not only must the application be developed to take advantage of multiple cores, but should also have the code in place to keep a number of threads working on each core. A modern processor architecture, such as the Intel Xeon Phi processor, can accommodate at least 4 threads for each core. “On the Intel Xeon Phi processor, each of the threads per core is known as a hyper-thread. In this architecture, all of the threads on a core progress through the pipeline simultaneously, producing results much more quickly than if just one thread was used. The processor decides which thread should progress, based on a number of factors, such as waiting for data from memory, instruction availability, and stalls.”
“Designing a new generation of hardware with such high performance needs to make sure that developers understand the basics, and are familiar with the architecture of a new system. Single thread performance with the Intel Xeon Phi processor is significantly better than previous designs. In addition, in order to speed up performance even more, vector processing, where applicable is critical in application performance. With two vector processing units (VPUs) per core, applications can execute two 512-bit vector multiply-add instructions per cycle. Each of these cores can deliver 32 double precision operations per clock cycle. The VPU executes all of the floating point operations as well as legacy instructions from SSE to AVX to the new AVX-512 instructions.”