Heterogeneous MPI Application Optimization

Print Friendly, PDF & Email

imagesThe Message Passing Interface (MPI) is the backbone of many types of applications that run on a conglomeration of independent servers. These servers can be architecturally exactly the same, or can have different components. As long as the application knows and can utilize the different capabilities, performance speedups can be implemented. An example of using MPI on a combination of Intel Xeon processors and Intel Xeon Phi coprocessors would be for options pricing in the financial services market.

The option market consists of buyers and sellers agreeing to sell and purchase a stock market asset at a future date, by signing a contract. A Monte Carlo simulation is run to perform a risk analysis for a certain price of these assets. Monte Carlo simulations are what is considered “embarrassingly parallel”, in that many processes can be run independently of other runs. These algorithms can be easily parallelized on various levels and leads to running on the Intel Xeon Phi coprocessors.

Two components of ITAC, the Intel Trace Collector and the Intel Trace Analyzer can be used to understand the performance and bottlenecks of a Monte Carlo simulation. When each of the strike prices are distributed to both the Intel Xeon cores the Intel Xeon Phi coprocessor, the efficiency was about 79%, as the coprocessors can calculate the results much faster than the main CPU cores. This leads to coprocessor stall.

By manually balancing the workloads across the main CPUs and the coprocessors, better utilization can be obtained. The idea is to give more work to faster devices. If all iterations take the same amount of execution time, then this task can be easily computed. However, if the times for each iteration varies, then this spreading of the tasks to different devices takes more effort. To accomplish this, run time values can be used from previous runs, to determine how much work an iteration takes and use that information to set up the amount of work sent to a device.  Fewer iterations would be sent to the Intel Xeon cores and more blocks of work sent to the faster Intel Xeon Phi coprocessors.

Another method to distribute the iterations is based on a Boss-Worker model.  Basically, the boss on the CPUs sends the work to the coprocessors. When the coprocessor completes their work, they ask for more. If a bottleneck appears due to too much communication per amount of work, larger chunks of work can be sent to the coprocessors.

Using the Intel ITA products, better load balancing can be achieved. Since native code can be executed on the Intel Xeon Phi coprocessors, they can be treated like any other compute nodes. However workload balancing is required to efficiently use these compute units. Manual load balancing can be used for better utilization of all of the system components, and further use of a Boss-Worker algorithm will work as well. A benefit of this approach is that workloads can scale without any additional programming.

Source: Colfax International, USA

Solve more problems with the new data analytics tools. Intel® Parallel Studio XE. Try it today.