Storage Advances for HPC and AI

In this special guest feature from Scientific Computing World, Robert Roe looks at storage technologies being developed to suit both AI and HPC workloads.

The storage market is in a unique position, in that there is demand from both the traditional HPC and enterprise storage markets, such as media and entertainment, alongside new growing markets for AI and machine learning. This has created huge potential to increase market share for storage vendors, as long as they can deliver the necessary performance required by HPC and AI users.

While storage volumes continue to increase dramatically, storage providers are trying to meet demand by increasing storage performance, while introducing more efficient methods of managing data across large multi-petabyte storage platforms.

Choosing the right system for a particular workflow is critical to getting the most out of storage technology. The traditional products associated with parallel file systems still persist, but now there is increasing competition from cloud and all-flash storage arrays, which are becoming increasingly attractive to users at opposite ends of the hardware spectrum.

Many HPC users have complex requirements, and this means that there is no one storage technology that is perfect for every situation or workflow. As the price of storage hardware drops and new technologies, such as 3D NAND, become commoditized, HPC users are readying themselves for the next generation of HPC storage technology.

While it is not completely clear just how these technologies will be used in large scale file systems, the consensus among vendors is that SSD and flash technology will be key to developing large-scale storage architectures – particularly as we approach initial targets for exascale computing.

The classical techniques for increasing computing performance typically revolve around turning up clock frequencies and increasing use of parallelization, but this has created a disparity between storage, memory and compute, as huge amounts of data must be fed into increasing numbers of processing elements.

To combat this, storage vendors are moving towards much faster, lower latency storage architectures with many opting to move data as close as possible to processing elements. This helps to reduce the penalty of moving data into and out of processors or accelerators for computation.

AI generates new storage technology

In March, Pure Storage announced that it had partnered with NVIDIA to release a portfolio of storage products for AI initiatives, from early inception to large-scale production.

The company announced a new hyperscale configuration of AI-Ready Infrastructure (AIRI), which has been designed to deliver supercomputing capabilities for enterprise users. The technology is aimed at AI users who demand the highest performance, or have grown beyond the capabilities of AI-ready solutions in the market.

Built jointly with Nvidia and Mellanox, hyperscale AIRI provides multiple racks of Nvidia DGX-1 and DGX-2 systems with both Infiniband and Ethernet fabrics as interconnect options. In addition, Pure Storage announced FlashStack for AI, a product built jointly with Cisco and Nvidia to deliver storage performance to meet the demands for data generated by the DGX-2.

The partnership with Nvidia has also enabled Pure to deliver software advancements that fit with the existing tools provided by Nvidia. Using Nvidia NGC software container registry and AIRI scaling toolkit, data scientists can begin building applications with containerized AI frameworks and rededicate time to deriving valuable insights from their data. In addition, integration with Kubernetes and Pure Service Orchestrator means IT teams can deliver an AI-infrastructure with cloud-like elasticity. Hyperscale AIRI allows enterprises to scale beyond Nvidia DGX-1 to DGX-2.

Businesses from a variety of industries are discovering that AI is necessary to tackle existing problems and create new opportunities,” said Matt Burr, general manager of FlashBlade, Pure Storage. “For example, healthcare organizations from around the world are using AI to bring advancements to treatments and quality of care. AI is complex and in its early stages, which means the solutions built to enable AI must be straightforward and user-friendly. Hyperscale AIRI is designed to bring supercomputing capabilities to pioneers of real-world AI, without the complexities that often occur when scaling across multiple racks.”

AIRI is built to enable data architects and scientists to operationalize AI at scale. In a blog post from Cisco in January, senior product marketing manager Maggie Smith stated: ‘FlashStack for AI workloads provides a proven, validated design for these highly data-intensive workloads. With FlashStack for AI workloads, organizations can deploy a validated architecture for AI workloads that reduces design risks in building a data, compute, and storage infrastructure for the AI data pipeline, and help achieve better business outcomes.

FlashStack is a converged infrastructure designed to deliver outstanding performance and reliability. With the addition of the UCS C480 ML, FlashStack for AI is extending the existing infrastructure to support AI/ML workloads without adding new infrastructure silos,” said Todd Brannon, senior director of product marketing for UCS Portfolio, Cisco.

Cloud technology

While AI may steal the headlines, HPC is still generating advances in storage technology. Cloud environments have made significant progress over the last five to ten years, going from what was seen as fringe or niche technology to a much more ubiquitous platform, particularly with the advances made by cloud giants such as Google and Amazon.

However, while cloud bursting can be effective, the use of in-house cloud environments are still popular for large scale HPC systems, as the staff often have experience dealing with parallel file systems.

In March hybrid cloud storage provider Qumulo announced that its technology has been chosen for the National Renewable Energy Laboratory’s (NREL) Computational Science Center.

NREL is the US Department of Energy’s primary national laboratory for renewable energy, and energy efficiency research and development. The organization focuses on research for renewable power technologies, sustainable transportation, energy efficiency and energy systems integration.

NREL required a high-capacity storage solution for its researchers and engineers to store, access and manage their unstructured data. It has deployed three clusters of Qumulo storage to date, a four-node cluster on the Qumulo Capacity Series QC40; a four-node cluster on the Qumulo Capacity Series QC208, and a four-node cluster on the HPE90.

The lab is using Qumulo storage for its researcher home directories on the new Eagle System, the newest HPC system at NREL; as well as for administrative and researcher NAS requirements, and replication of all HPC NAS requirements.

With the impact of climate change, it’s crucial for organizations such as NREL to adopt technologies that accelerate their research and outcomes. The lab’s HPC environment has a broad range of demands encompassing multiple use cases, a large number of users, and long-term data retention,” said Molly Presley, Qumulo’s director of global product marketing “With Qumulo’s incredibly scalable architecture and real-time file analytics, the NREL team can gain better insight into their data, and in turn focus more effectively on their mission to drive innovation in energy efficiency and renewable energy technologies.”

This story appears here as part of a cross-publishing agreement with Scientific Computing World.

Sign up for our insideHPC Newsletter