With the development of heterogeneous computing systems that are designed for High Performance Computing applications, there are certain high level concepts that should be investigated to get maximum performance from an application. While legacy applications may run well due to new technology on each CPU, there are a number of ways in which an application can take advantage of the latest generation of CPUs, processors, and associated processing power.
This challenge of modifying (or developing) an application for high performance can effectively be broken down into the following areas:
- Managing the work on each node
- Managing the work for each core (and thus threads)
- Vectorizing the loops
- Minimize communication (and data movement).
There are certain characteristics that are associated with each of the optimization methods listed above, as well as methods that can help to achieve better performance.
Managing the work on each node can be referred to as Domain parallelism. During the run of the application, the work assigned to each node can be generally isolated from other nodes. The node can work on its own and needs little communication with other nodes to perform the work. The tools that are needed for this are MPI for the developer, but can take advantage of frameworks such as Hadoop and Spark (for big data analytics).
Managing the work for each core or thread will need one level down of control. This type of work will typically invoke a large number of independent tasks that must then share data between the tasks. The tools that can help the programmer will typically be OpenMP as well as PGAS programming languages. These techniques can be combined with the managing the work on each node, such that MPI and OpenMP will be working together.A number of techniques can be used for modernize your HPC application.Click To Tweet
Parallelizing the lower level of loops for example, will apply the same algorithm or mathematical operations on significant amount of data items. By recognizing the potential for thinking of data as a vector and using directives to vectorize these loops can take advantage of the latest innovations in CPUs and associated processors. To aid the developer, there are libraries that have been vectorized that can be called, as well as help from modern compilers. In addition, a skilled developer can recognize and tell the compiler to vectorize part of the application.
By minimizing communication between the CPUs and memory can be easily overlooked when modernizing an application. With a good understanding of the algorithm, data that is already in the CPUs caches can be reused, minimizing the transfer of data back and forth to memory. This can be thought of as either arranging the algorithm to access memory in a cache friendly way, or arranging the data that is used to fit the science behind the algorithm. To use memory efficiently with respect to the application is perhaps the hardest of this set of techniques.
Legacy or even modern codes may take advantage of many of the techniques that need to be done to take advantage of state of the art processors. Even if one or two of the techniques described above have been used, there may be other options available to obtain higher performance. It is important to analyze an application and determine how work can be spread out among nodes and available cores. Then, vectorization can be investigated as well as how the cpu instructions interact with the data.