Quantum mechanics have been used over time to solve many challenges and have progressed as computer power has increased as well. Molecular processes can be studied using higher order many-body methods. New and much more accurate algorithms to study chemical reactivity, molecular properties and interactions of light with matter have been developed that can take advantage of the increasing power of computer systems.
Hybrid architectures based on the main CPUs and coprocessors have reduced the time to solution. Applications can be tuned to use both the Intel Xeon and the Intel Xeon Phi simultaneously, without modifying the code to just run on the coprocessor. Using a number of software tools from Intel, performance of a coupled cluster method can be demonstrated to gain a tremendous performance with excellent scaling.
The NWChem software was developed about 20 years ago and was designed to scale to a high number of processing units and had a modular framework so that new methods could be used. NWChem uses the Global Array (GA) toolkit for the bulk of the communication. The GA toolkit is a library that was designed for applications with large and dense arrays. The performance of the GA toolkit depends on how well the Aggregate Remote Memory Copy Interface has been implemented.
Offloading some of the application to the Intel Xeon Phi was considered for one of the main parts of the application, since it contains several highly parallel kernels that are heavy with floating point calculations. In order to determine which parts of the application would be candidates for offload to the coprocessor, the hotspots need to be identified, running a representative benchmark. The data set for the benchmark needs to be manageable, but one that gives a good overall signature of longer runs. The VTune Amplifier XE from Intel was used for this purpose. After the target portion of the application was determined to be a candidate for offload, it is important to determine how to best use the Intel Xeon Phi coprocessor.
After modifications are made to the overall structure of the application as well as tuning of the kernels, the benchmarks show a tremendous increase in performance. On a large cluster, consisting of 7360 Intel Xeon cores and a total of over 62,500 cores for the heterogeneous runs, the performance scaled very well. Using well known programming environments, such as Fortran and OpenMP on an application such as NWChem has been shown to reduce the total application time, especially using a combination of using the Intel Xeon cores and offloading certain parts of the application to the Intel Xeon Phi coprocessor cores.
Source: Pacific Northwest National Laboratory USA, Intel USA and Intel Germany.
Deliver top application performance and reliability with Intel Parallel Studio XE: A C++ and Fortran tool suite that simplifies the development, debug, and tuning of code. Compatible with leading compilers.