Fast Matrix Multiply with OpenMP

Print Friendly, PDF & Email

Solving many scientific and technical applications entails the use of matrix multiplies somewhere in the algorithm and thus the computer code. With today’s multicore CPUs, proper use of complier directives can speed up matrix multiplies significantly.

OpenMP is an API that supports multi-platform shared memory multiprocessing.  Originally for Fortran, the API is now available in compliers for C, C++, as well as Fortran.  The most popular operating systems are supported. The OpenMP API consists of compiler directives, library routines and environment variables.  The management of the OpenMP API is by a non-profit consortium with an Architectural Review Board (ARB). The first version of the OpenMP API specification was release in 1997 for Fortran.

The goals of OpenMP can be summarized as follows:

  • Portability – for languages and operating systems
  • Ease of Use – provide for various levels of parallelism and do so incrementally
  • Standardization – jointly created and maintained by a number of software and hardware vendors.

From the web site: The OpenMP ARB now has 14 permanent members and 13 auxiliary members. Permanent members are vendors who have a long-term interest in creating products for OpenMP, and include AMD, ARM, Convey Computer, Cray, Fujitsu, HP, IBM, Intel, NEC, NVIDIA, Oracle Corporation, Red Hat, ST Microelectronics and Texas Instruments. Auxiliary members are organizations with an interest in the standard but that do not create or sell OpenMP products, and include the Argonne, Lawrence Livermore, Lawrence Berkeley, Oak Ridge and Sandia National Laboratories, Barcelona Supercomputing Centre, cOMPunity, EPCC, NASA, RWTH Aachen University, TACC, and the University of Houston.

In many application areas, certain parts of the algorithm, which was then implemented in a computer programming environment can be run in parallel. Individual sections of the application can be sent to a different processor or core, and then the results used later.

Multiplying two matrices is common in many applications and can be easily parallelized. First, lets look at a simple example of parallelizing a loop, in Fortran.

!$OMP PARALLEL DO !I is private by default

 DO I=2,N

 B(I) = (A(I) + A(I-1)) / 2.0



The loop will then get sent to various cores on the system.

A slightly more complex example is to use OpenMP for a matrix multiply.


      DO 60 I=1, NRA

      PRINT *, ‘Thread’, TID, ‘did row’, I

        DO 60 J=1, NCB

          DO 60 K=1, NCA

            C(I,J) = C(I,J) + A(I,K) * B(K,J)


Various tests with different hardware configurations consistently show almost a linear speedup by using these techniques. Although these examples are quite simple, it is important to understand how to use OpenMP when running on a single system that contains more than one processing unit.

Transform Your Code

Deliver top application performance and reliability with Intel Parallel Studio XE: A C++ and Fortran tool suite that simplifies the development, debug, and tuning of code. Compatible with leading compilers.




  1. Is this a day late? This OpenMP Fortran example is not even wrong.