The Exascale Report Asks: How do you define “multicore optimization?”

Print Friendly, PDF & Email

As the global HPC community forms circles of opinion on the challenges of making exascale a reality, it seems ‘multicore optimization’ — at some level — will have to be a key ingredient. How do you define ‘multicore optimization’, and what role do you see this technology playing in the development of production exascale systems?

Everyone knows Moore’s Law, and multi-core processor advances will play an important role in exascale evolution. But a much less discussed canon, Amdahl’s Law (of Serialization) will become equally, or more, prominent. As we deploy servers with 64/128/256 cores (and beyond), we need to address how applications take advantage of massively parallel processing capacity given that most applications today are serial or only lightly parallel designs. While advances – tools, libraries, education – are taking place that will, over time, help developers parallelize applications to a greater degree, ultimately most problems are not massively parallel by nature, with the degree of parallelism described by Amdahl’s Law. So, we need to parallelize applications when and where possible, and to also recognize that effectively utilizing many system cores will require that many concurrent tasks need to run safely and predictably within a single system.

As we attempt to optimize multi-core system performance given Amdahl’s Law, we must consider that when running many tasks within a time-share based operating system, the likelihood of negative interference for shared resources – which degrades performance – becomes greater as the number of tasks increases. Think about traffic through Manhattan. There are plenty of shared resources available (e.g., streets, traffic lanes) and lots of jobs (cars) to process. Without traffic lights? Chaos, even if the drivers try to apply relatively fair time-sliced access to a resource. With traffic cops or traffic lights things get better. With well-synchronized traffic lights, you can actually experience good throughput.

For maximum effectiveness, exascale deployments will require new resource allocation intelligence that improves how multi-core system resources are allocated to applications. Standard time-slice operating system behavior is insufficient as it fails to prevent shared resource interferences, inflates job runtimes, and has limited ability to recognize and resolve resource imbalance and overload. System level ‘intelligent resource synchronization’ needs to recognize the value (priority) of a task and its resource needs, expedite resource allocation to higher priority tasks when necessary, proactively resolve resource imbalance or overload, and dynamically assign system level resources in a manner that optimizes throughput.

The move by mainstream processor vendors in the past decade to improve total performance with multiple cores instead of focusing on improving single core performance means applications need to take advantage of multicore parallelism or suffer degraded performance. This is not a problem unique to exascale, it affects applications from cell phones to business data processing. Unless there is a major shift in the way processors are designed, exascale systems will be built largely from commodity processors and memories, and will use slightly modified commodity operating systems and program development tools.

To reach exascale performance, applications need to be able to exploit all the levels of parallelism available in the system. This is due to the fact that the performance is related to the product of the parallelism at each level, so leaving anything out results in serious performance loss. As the core count grows, effective use of the many cores available will necessitate changes in how these chips are designed and used. For processor architectures, this will require designing multicore processors that can be programmed coherently, instead of as a collection of independent cores that just happen to share a memory controller. Operating systems must efficiently manage collections of threads and cores as a unit, instead of scheduling each process independently onto some core. Program development tools must expose enough of the performance critical multicore features so programs can be tuned to take advantage of them, automatically using both static and dynamic program behavior to optimize total application performance. And, yes, applications will have to be tuned (or re-tuned) for scalable multicore performance.

This is not the right question. It makes the assumption that “multicore” just needs a little tweak to get the next 1000X in supercomputing performance.

It’s like asking in the 1990’s how to tweak vector supercomputers to get to petascale – when the true petascale systems that emerged were built from entirely different technology.

To get 1000 times more real computing power, we need to look at every assumption. Some scaling factors are now fully tapped – like voltage scaling and transistor switching performance – so we need to look somewhere other than simple silicon scaling of serial processors and piling 10’s of thousands of sequential processor cores into a chassis. Pretty much the same processor that’s in your PC.

Designing a true parallel processor is more than printing additional cores on a die because you have room. The design of the parallel processor directly impacts everything else about the computer, most importantly how you get work done through the programming model. If the processor itself doesn’t scale, then there’s no path to exascale. The true path to exascale – where thousands of processors have to work in concert to solve a problem requires a fundamental rethinking of how processors work together. Both the processor itself and the current programming models need to evolve since they emerged with teraflop computers – 1/1,000,000th the speed of the exascale project.

Performance is one goal. Building a system with reasonable power consumption is an equally important goal and was very well described by Bill Dally in a recent article.

“To continue scaling computer performance, it is essential that we build parallel machines using cores optimized for energy efficiency, not serial performance. Building a parallel computer by connecting two to 12 conventional CPUs optimized for serial performance, an approach often called multi-core, will not work. This approach is analogous to trying to build an airplane by putting wings on a train. Conventional serial CPUs are simply too heavy (consume too much energy per instruction) to fly on parallel programs and to continue historic scaling of performance.”

There are a great number of decisions and challenges we need to make to get to exascale. How to tweak current technology is the wrong starting point.

For related stories, visit The Exascale Report Archives.

Comments

  1. WolfgangGentzsch says