Better Concurrency with HMB

Print Friendly, PDF & Email

bshcmodOcean modeling is complex from an algorithm standpoint and can require significant amounts of computing power to arrive at a simulation endpoint. However, many of the calculations that are contained in the simulation can be distributed among a cluster of systems and used on shared memory systems as well. The HIROMB-BOOS-Model (HBM) is an example of an application that has been tuned for better concurrency on modern computer architectures.

The HMB code can be run in serial mode, with MPI, with OpenMP or a combination of MPI and OpenMP, and can be selected at run time, depending on the systems used. The Danish Meteorological Institute (DMI) is using HMB, and relying on short time to solution for its forecasts.  The DMI provides a number of meteorological services including forecasting and warnings for an area around Denmark, and includes both weather and water level forecasts.

It was determined after analyzing where the time in the application was being spent, that data proximity was of critical importance. Location, location, location. Since the grid used would contain both active and inactive compute points, it became important to focus on just the points in the grid where computation was taking place.

The next optimization that was investigated was to make sure that outmost loop could be vectorized, so it could run on a SIMD architecture.  Some transformations had to be done in order to take advantage of compiler generated SIMD code. It is also important to make sure that the code had enough computational intensity to use an SIMD based processor.

After optimization to use more of the Intel Xeon Phi capabilities, actual data sets were used to measure the performance increase.  Besides application performance, other measurements were taken over two different data sets. The application reached about 90 percent of the practical achievable bandwidth on the Intel Xeon Phi. The HBM application showed excellent, almost linear scaling to 60 Intel Xeon Phi cores, while just using the Intel Xeon (E5-2697v2) did not show the same level of scaling.

In summary, the process is to focus on data locality and apply this to threading and vectorization techniques.  This leads to improved performance for the application on both the Intel Xeon processors and the Intel Xeon Phi coprocessors.  Good SIMD and thread performance are obtained when the the developer can focus on data proximity and take advantage of the superior memory bandwidth of the coprocessor. In addition, the power efficiency of the Intel Xeon Phi was higher than of just the host processor. However the limited memory on the Intel Xeon Phi is the limiting factor in gaining even more performance.

Transform Your Code

Deliver top application performance and reliability with Intel Parallel Studio XE: A C++ and Fortran tool suite that simplifies the development, debug, and tuning of code. Compatible with leading compilers.


Source: Danish Meterological Institute, Denmark ,  Intel Corporation, USA