The Intel Parallel Computing Center at King Abdullah University of Science and Technology (KAUST) aims to provide scalable software kernels common to scientific simulation codes that will adapt well to future architectures, including a scheduled upgrade of KAUST’s globally Top10 Intel-based Cray XC40 system. In the spirit of co-design, Intel PCC at KAUST will also provide feedback that could influence architectural design trade-offs.
The Intel PCC at KAUST is hosted in the KAUST’s Extreme Computing Research Center (ECRC). Directed by co-investigator David Keyes, the center aims to smooth the architectural transition of KAUST’s simulation-intensive science and engineering code base. Rather than taking a specific application code and optimizing it, the ECRC adopts the strategy of optimizing algorithmic kernels that are shared among many application codes, and of providing the results in open source libraries. Chief among such kernels are Poisson solvers and dense symmetric generalized eigensolvers.
We focus on optimizing two types of scalable hierarchical algorithms – fast multipole methods (FMM) and hierarchical matrices – on next generation Intel Xeon processors and Intel Xeon Phi coprocessors. These algorithms have the potential to replace workhorse kernels of molecular dynamics codes (drug/material design), sparse matrix preconditioners (structural/fluid dynamics), and covariance matrix calculations (statistics/big data). Co-PI Yokota is the architect of the open source fast multipole library ExaFMM, which attempts to integrate best solutions offered by FMM algorithms, including the ability to control expansion order and octtree decomposition strategy independently to create the fastest inverter to meet a given accuracy requirement for solver or a preconditioner on manycore and heterogenous architectures. Co-PI Ltaief is the architect of the KBLAS library, which promotes the directed acyclic graph-based dataflow execution model to create NUMA-aware work-stealing tile algorithms of high concurrency, with innermost SIMD structure well suited to floating point accelerators.
The overall software framework of this Intel PCC at KAUST, Hierarchical Computations on Manycore Architectures (HiCMA), is built upon these linear solvers and the philosophy that dense blocks of low rank should often be replaced with hierarchical matrices as they arise. Hierarchical matrices are natural algebraic generalizations of fast multipole, and are implementable in data structures similar to those that have made FMM successful on distributed nodes of shared memory cores.
FMM and hierarchical matrix algorithms share a rare combination of O(N) arithmetic complexity and high arithmetic intensity (flops/Byte). This is in contrast to traditional algorithms that have either low arithmetic complexity with low arithmetic intensity (FFT, sparse linear algebra, and stencil application), or high arithmetic intensity with high arithmetic complexity (dense linear algebra, direct N-body summation). In short, FMM and hierarchical matrices are efficient algorithms that will remain compute-bound on future architectures. Furthermore, these methods have a communication complexity of O(log P) for P processors, and permit high asynchronicity in their communication. Therefore, they are amenable to asynchronous programming models that are gaining popularity as architectures approach the exascale.