PNNL and Micron Partner to Push Memory Boundaries for HPC and AI

Researchers at Pacific Northwest National Laboratory (PNNL) and Micron are are developing an advanced memory system to support AI for scientific computing. The work is designed to address AI’s insatiable demand for live data — to push the boundaries of memory-bound AI applications — by connecting memory across processors in a technology strategy utilizing the Compute Express Link (CXL) data interface, according to a recent edition of the ASCR Discovery publication.

“Most of the performance improvements have been on the processor side,” said James Ang, PNNL’s chief scientist for computing and the lab’s project leader. “But recently, we’ve been falling short on performance improvements, and it’s because we’re actually more memory-bound. That bottleneck increases the urgency and priority in memory resource research.”

Boise-Idaho-based memory and storage semiconductor company Micron is collaborating with PNNL, Richland, WA, on this effort, sponsored by the Advanced Scientific Computing Research (ASCR) program in the Department of Energy (DOE), to help assess emerging memory technologies for DOE Office of Science projects that employ artificial intelligence. The partners say they will apply CXL to join memory from various processing units deployed for scientific simulations.

Tony Brewer, Micron’s chief architect of near-data computing, says the collaboration aims to blend old and new memory technologies to boost high-performance computing (HPC) workloads. “We have efforts that look at how we could improve the memory devices themselves and efforts that look at how we can take traditional high-performance memory devices and run applications more efficiently.”

Part of the strategy is to implement a centralized memory pool would help mitigate the issue of over-provisioning the memory.

“In HPC systems that deploy AI, high performance but low-capacity memory (typically gigabytes in capacity) is typically coupled to the GPUs, whereas a conventional system with low-performance but high capacity memory (terabytes) is loosely coupled via the traditional HPC workhorses, central processing units (CPUs),” PNNL said. “With PNNL, Micron will create proof-of-concept shared GPU and CPU systems and combine them with additional external storage devices in the hundreds of terabytes range. Future systems will need rapid access to petabytes of memory – a thousand times more capacity than on a single GPU or CPU.”

The intent is to create a third level of memory hierarchy, Brewer explains. “The host would have some local memory, the GPU would have some local memory, but the main capacity memory is accessible to all compute resources across a switch, which would allow scaling of much larger systems.” This unified memory would let researchers using deep-learning algorithms to run a simulation while its results simultaneously feed back to the algorithm.

A centralized memory system could also benefit operations because an algorithm or scientific simulation can share data with, say, another program that’s tasked with analyzing those data. These converged application workflows are typical in DOE’s scientific discovery challenges. Sharing memory and moving it around involves other technical resources, says Andrés Márquez, a PNNL senior computer scientist. This centralized memory pool, on the other hand, would help mitigate the issue of over-provisioning the memory.

Because AI-aided data-driven science drives up demand for memory, an application can’t afford to partition and strand the memory. The result: memory keeps “piling up underutilized at various processing units. Having the capability of reducing that over-provisioning and getting more bang out of your buck by sharing that data across all those devices and different stages of workflow cannot be overemphasized,” Márquez explained.

Some of PNNL’s AI algorithms can underperform when memory is slow to access, Márquez says. In PNNL’s computational chemistry group, for instance, researchers use AI to study water’s molecular dynamics to see how it aggregates and interacts with other compounds. Water is a common solvent for commercial processes, so running simulations to understand how it acts with a molecule of interest is important. A separate research team at Richland is using AI and neural networks to modernize the power grid’s transmission lines.

Micron’s Brewer said he looks forward not only to the development of tools with PNNL but also for commercial use – by any company working on large-scale data analysis. “We are looking at algorithms,” he said, “and understanding how we can advance these memory technologies to better meet the needs of those applications.”

PNNL’s computational science problems provide Micron a way to observe applications that will most stress the memory.  Those findings will help Brewer and colleagues develop products that help industry meet its memory requirements.

Ang, too, said he expects the project to help AI at large, pointing out that the Micron partnership isn’t “just a specialized one-off for DOE or scientific computing. The hope is that we’re going to break new ground and understand how we can support applications with pooled memory in a way that can be communicated to the community through enhancements to the CXL standard.”