“The move away from the traditional single processor/memory design has fostered new programming paradigms that address multiple processors (cores). Existing single core applications need to be modified to use extra processors (and accelerators). Unfortunately there is no single portable and efficient programming solution that addresses both scale-up and scale-out systems.”
Articles and news on parallel programming and code modernization
“Intel recently announced the first product release of its High Performance Python distribution powered by Anaconda. The product provides a prebuilt easy-to-install Intel Architecture (IA) optimized Python for numerical and scientific computing, data analytics, HPC and more. It’s a free, drop in replacement for existing Python distributions that requires no changes to Python code. Yet benchmarks show big Intel Xeon processor performance improvements and even bigger Intel Xeon Phi processor performance improvements.”
While HPC developers worry about squeezing out the ultimate performance while running an application on dedicated cores, Intel TBB tackles a problem that HPC users never worry about: How can you make parallelism work well when you share the cores that you run upon?” This is more of a concern if you’re running that application on a many-core laptop or workstation than a dedicated supercomputer because who knows what will also be running on those shared cores. Intel Threaded Building Blocks reduce the delays from other applications by utilizing a revolutionary task-stealing scheduler. This is the real magic of TBB.
“Managing the work on each node can be referred to as Domain parallelism. During the run of the application, the work assigned to each node can be generally isolated from other nodes. The node can work on its own and needs little communication with other nodes to perform the work. The tools that are needed for this are MPI for the developer, but can take advantage of frameworks such as Hadoop and Spark (for big data analytics). Managing the work for each core or thread will need one level down of control. This type of work will typically invoke a large number of independent tasks that must then share data between the tasks.”
Thomas Sterling presented this Invited Talk at SC16. “Increasing sophistication of application program domains combined with expanding scale and complexity of HPC system structures is driving innovation in computing to address sources of performance degradation. This presentation will provide a comprehensive review of driving challenges, strategies, examples of existing runtime systems, and experiences. One important consideration is the possible future role of advances in computer architecture to accelerate the likely mechanisms embodied within typical runtimes. The talk will conclude with suggestions of future paths and work to advance this possible strategy.”
To achieve high performance, modern computer systems rely on two basic methodologies to scale resources. A scale-up design that allows multiple cores to share a large global pool of memory and a scale-out design design that distributes data sets across the memory on separate host systems in a computing cluster. To learn more about In-Memory computing download this guide from IHPC and SGI.
With modern processors that contain a large number of cores, to get maximum performance it is necessary to structure an application to use as many cores as possible. Explicitly developing a program to do this can take a significant amount of effort. It is important to understand the science and algorithms behind the application, and then use whatever programming techniques that are available. “Intel Threaded Building Blocks (TBB) can help tremendously in the effort to achieve very high performance for the application.”
Today ORNL announced the full schedule of 2017 GPU Hackathons at multiple locations around the world. “The goal of each hackathon is for current or prospective user groups of large hybrid CPU-GPU systems to send teams of at least 3 developers along with either (1) a (potentially) scalable application that could benefit from GPU accelerators, or (2) an application running on accelerators that need optimization. There will be intensive mentoring during this 5-day hands-on workshop, with the goal that the teams leave with applications running on GPUs, or at least with a clear roadmap of how to get there.”
Manuel Arenaz from Appentra presented this talk at the OpenMP booth at SC16. “Parallware is a new technology for static analysis of programs based on the production-grade LLVM compiler infrastructure. Using a fast, extensible hierarchical classification scheme to address dependence analysis, it discovers parallelism and annotates the source code with the most appropriate OpenMP & OpenACC directives.”
The Seventh International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) has issued its Call for Papers. The event takes place May 29 in Orlando, Florida in conjunction with the IEEE International Parallel and Distributed Processing Symposium.