Solving many scientific and technical applications entails the use of matrix multiplies somewhere in the algorithm and thus the computer code. With today’s multicore CPUs, proper use of complier directives can speed up matrix multiplies significantly.
OpenMP is an API that supports multi-platform shared memory multiprocessing. Originally for Fortran, the API is now available in compliers for C, C++, as well as Fortran. The most popular operating systems are supported. The OpenMP API consists of compiler directives, library routines and environment variables. The management of the OpenMP API is by a non-profit consortium with an Architectural Review Board (ARB). The first version of the OpenMP API specification was release in 1997 for Fortran.
The goals of OpenMP can be summarized as follows:
- Portability – for languages and operating systems
- Ease of Use – provide for various levels of parallelism and do so incrementally
- Standardization – jointly created and maintained by a number of software and hardware vendors.
From the Openmp.org web site: The OpenMP ARB now has 14 permanent members and 13 auxiliary members. Permanent members are vendors who have a long-term interest in creating products for OpenMP, and include AMD, ARM, Convey Computer, Cray, Fujitsu, HP, IBM, Intel, NEC, NVIDIA, Oracle Corporation, Red Hat, ST Microelectronics and Texas Instruments. Auxiliary members are organizations with an interest in the standard but that do not create or sell OpenMP products, and include the Argonne, Lawrence Livermore, Lawrence Berkeley, Oak Ridge and Sandia National Laboratories, Barcelona Supercomputing Centre, cOMPunity, EPCC, NASA, RWTH Aachen University, TACC, and the University of Houston.
In many application areas, certain parts of the algorithm, which was then implemented in a computer programming environment can be run in parallel. Individual sections of the application can be sent to a different processor or core, and then the results used later.
Multiplying two matrices is common in many applications and can be easily parallelized. First, lets look at a simple example of parallelizing a loop, in Fortran.
!$OMP PARALLEL DO !I is private by default
DO I=2,N
B(I) = (A(I) + A(I-1)) / 2.0
ENDDO
!$OMP END PARALLEL DO
The loop will then get sent to various cores on the system.
A slightly more complex example is to use OpenMP for a matrix multiply.
!$OMP DO SCHEDULE(STATIC, CHUNK)
DO 60 I=1, NRA
PRINT *, ‘Thread’, TID, ‘did row’, I
DO 60 J=1, NCB
DO 60 K=1, NCA
C(I,J) = C(I,J) + A(I,K) * B(K,J)
60 CONTINUE
Various tests with different hardware configurations consistently show almost a linear speedup by using these techniques. Although these examples are quite simple, it is important to understand how to use OpenMP when running on a single system that contains more than one processing unit.
References:
Is this a day late? This OpenMP Fortran example is not even wrong.