“In the past, developers would get best results if a loop was unrolled, that is, duplicating the body as many times as needed to that the operations could be operated on using full vectors. The number of iterations would reflect the hardware that the code was targeted towards. Since the application may have to run on different hardware in the future, results for todays generation of hardware may be compromised in the future. In fact, it is better to let modern compilers to the unrolling.”
The European PRACE initiative has published a Best Practices Guide for GPU Computing. “This Best Practice Guide describes GPUs: it includes information on how to get started with programming GPUs, which cannot be used in isolation but as “accelerators” in conjunction with CPUs, and how to get good performance. Focus is given to NVIDIA GPUs, which are most widespread today.”
Today the Google Cloud Platform announced that it is the first cloud provider to offer the next generation Intel Xeon processor, codenamed Skylake. “Skylake includes Intel Advanced Vector Extensions (AVX-512), which make it ideal for scientific modeling, genomic research, 3D rendering, data analytics and engineering simulations. When compared to previous generations, Skylake’s AVX-512 doubles the floating-point performance for the heaviest calculations. In our own internal tests, it improved application performance by up to 30%.”
Intel DAAL is a high-performance library specifically optimized for big data analysis on the latest Intel platforms, including Intel Xeon®, and Intel Xeon Phi™. It provides the algorithmic building blocks for all stages in data analysis in offline, batch, streaming, and distributed processing environments. It was designed for efficient use over all the popular data platforms and APIs in use today, including MPI, Hadoop, Spark, R, MATLAB, Python, C++, and Java.
Today Allinea Software launched the first update to its well-established toolset for debugging, profiling and optimizing high performance code since being acquired by ARM in December 2016. “The V7.0 release provides new integrations for the Allinea Forge debugger and profiler and Allinea Performance Reports and will mean more efficient code development and optimization for users, especially those wishing to take software performance to new levels across Xeon Phi, CUDA and IBM Power platforms,” said Mark O’Connor, ARM Director, Product Management HPC tools.
“By implementing popular Python packages such as NumPy, SciPy, scikit-learn, to call the Intel Math Kernel Library (Intel MKL) and the Intel Data Analytics Acceleration Library (Intel DAAL), Python applications are automatically optimized to take advantage of the latest architectures. These libraries have also been optimized for multithreading through calls to the Intel Threading Building Blocks (Intel TBB) library. This means that existing Python applications will perform significantly better merely by switching to the Intel distribution.”
Applications that can take advantage of the new vectorization capabilities of the Intel Xeon Phi processor will show tremendous performance gains. “When considering vectorization, there are different tools that can assist the developer in determining where to look further. The first is to look at the optimization reports that are generated by the Intel compiler and then to also use the Vector Analyzer that can give specific advice on what to do to get more vectorization from the code.”
In-Memory Computing can accelerate traditional applications by using a memory first design. Applicable to a wide range of domains, In-Memory Computing and In-Memory Data Grids take advantage of the latest trends in computer systems technology. “In-memory computing is designed to address some of the most critical and real-time task requirements today. This include real-time fraud detection, biometrics and border security and financial risk analytics. All of these use cases require very low latency access to data from very large amounts of data, which results in faster and more accurate decisions.”
“CUDA C++ is just one of the ways you can create massively parallel applications with CUDA. It lets you use the powerful C++ programming language to develop high performance algorithms accelerated by thousands of parallel threads running on GPUs. Many developers have accelerated their computation- and bandwidth-hungry applications this way, including the libraries and frameworks that underpin the ongoing revolution in artificial intelligence known as Deep Learning.”
“Many of the libraries developed in the 70s and 80s for core linear algebra and scientific math computation, such as BLAS, LAPACK, FFT, are still in use today with C, C++, Fortran, and even Python programs. With MKL, Intel has engineered a ready-to-use, royalty-free library that implements these numerical algorithms optimized specifically to take advantage of the latest features of Intel chip architectures. Even the best compiler can’t compete with the level of performance possible from a hand-optimized library. Any application that already relies on the BLAS or LAPACK functionality will achieve better performance on Intel and compatible architectures just by downloading and re-linking with Intel MKL.”