An expanding area of work both on the hardware front and the software side is to modify and optimize applications to run on both the host processor and a coprocessor. Many techniques to transform applications to reduce runtime have been discussed and implemented across a wide variety of applications. In many cases, this work has been completed for applications in the science and research domains. However, there has been less emphasis on applications in a number of industrial domains. An example of recent work to optimize the Black-Scholes algorithm for the pricing of European options.
The Black-Scholes algorithm is well described in financial mathematics and the size of the code is quite small, which can lead to a lot of experimentation. Baseline measurements were taken for various problem sizes, using just the Intel C++ compiler with the –O2 flag set. A common mistake that programmer use is to mix data types, typically using both float and double. By modifying the code to use logf(), sqrtf(), and expf(), a small increase in performance was noticed, but nothing significant. The second option was to let the compiler vectorize the loops. Using both compiler directives and #pragma directives, the performance for the four test case sizes was increased on the order of 10 % from the baseline (or baseline + not mixing data types). The next step was to use more optimized math functions where appropriate. Using more optimized math functions increased the performance by almost of factor of 30. Yes, 30 times better performance. By then aligning arrays the tests were re-run, but showed approximately the same performance.
Further, by using compiler directives, the floating point precision of calculations can be performed. By only using 11 bits for the mantissa instead of the normal 24, the performance of the tests was increased by about 23 %. Another optimization was to now set the application to run using OpenMP directives to run across all the cores in a socket. Performance for the small to large tests showed a 7.6 X to 11.3 X performance improvement. At this phase, the execution times were getting so fast that the concern is that the overhead of creating threads was more than the computation section of the application. For this benchmark testing, a “warm-up” loop was used to minimize the thread creation time, for these tests. Further improvement was noted, especially for the smaller test cases.
By assigning portions of the code to the Intel Xeon Phi coprocessor, using from 60 to 240 threads, the performance improved significantly. By using these fairly simple techniques, an example of the performance from the baseline (CPU based) to an optimized version using the Intel Xeon Phi coprocessor was about 2600 times faster. Experiments show that even when fully optimized on the coprocessor, that the memory bandwidth wall takes effect.
Source: Lobachevsky State University of Nizhmi Novgorod, Russia and Intel, Russia