No Hoof, No Horse: The Long, Winding Trail to Exascale Storage

Print Friendly, PDF & Email

There’s a saying in equestrian circles, “No hoof, no horse.” In the world of HPC, this translates as “No storage, no system.”

In the not too distant past storage was often an afterthought when configuring a new supercomputer — an uninteresting but useful appendage that performed quietly in the background while rows of fancifully painted compute cabinets hogged the limelight. In one apocryphal industry story, a university alumni fund presented its alma mater with a leading-edge HPC system so it could join the hallowed ranks of the TOP500. The machine screamed through the LINPACK tests. Only problem was that after the handshaking and backslapping was over, the university’s researchers found they couldn’t use the system — no one had made any provision for adequate, high-speed storage.

Image of Jason Hick

Jason Hick, NERSC

Jason Hick, who heads the Storage Systems Group at DOE’s National Energy Research Scientific Computing Center (NERSC), points out that, “Storage and I/O really have similar design methodologies as architecting compute systems, but for the longest time we’ve focused purely on optimizing the speed of the compute systems rather than storage.  As we prepare for the exascale era, storage must not be an afterthought.”

Storage, he implies, could become a serious bottleneck in the drive to design efficient, affordable exascale computers.

Hick adds that storage designers are feeling a sense of urgency about getting programs underway to correct the situation. New file and archive systems usually take between five to eight years — or longer — to move from conception to widespread adoption. “So, new designs for file or archival storage systems need to be underway in the next several years in order to be production ready for the dawn of the exascale era,” he says.

Last year, at a NERSC-hosted workshop on the dawn of exascale storage, Hick commented, “Disk performance is increasing at about five percent a year; this is impacting the performance and size of file systems available to science researchers. It further impacts the feasibility in managing the data — whether it be analyzing or archiving. This trend is one example of a problem that storage has to deal with in order to realize science researchers’ desire to improve on their science productivity in the extreme scale era.”

Or as Ken Batcher, Professor of Computer Science, Kent State University so famously said, “A supercomputer is a device for turning compute-bound problems into I/O problems.”

Putting storage in the picture

Image of Dave Fellinger

Dave Fellinger, DDN

As supercomputers move into the petaFLOPS range, storage and file systems are no longer the compute infrastructure’s poor relations. Dave Fellinger, CTO at storage vendor DataDirect Networks, notes “Storage was taken into account from the very beginning in the design of Jaguar, the 1.8 petaFLOPS Cray system at Oak Ridge National Labs. Sequoia, a 20 petaFLOPS IBM Blue Gene/Q supercomputer which is scheduled to go online in 2011 at Lawrence Livermore, is another good example of system design that recognizes the importance of I/O, file systems, and the key role that storage hierarchies play in the overall infrastructure.”

“Now the big question is what are the implications for storage design as we target moving to exaFLOPS systems in the 2015 to 2018 time frame?” he adds. “A thousand-fold increase in speed is going to mean a huge increase in storage capabilities. Scaling to these levels is mind-boggling. ORNL’s Spider file system, a 13 petabyte, 240GB/s file storage environment, has close to 15,000 drives. Now ramp that up by 1,000 and add to it the energy requirements of the compute portion of the architecture, and we could wind up with is something like the cartoon that was shown at the recent SciDAC conference in Chattanooga — a data center with its own nuclear power plant next door.”

Energy aside, just what are some of the pressing problems facing storage designers in the quest for exascale? Here are just a few of the more interesting challenges that are being addressed by technologists at storage companies, government labs, and academia.

  • New data models. For decades now system storage software has used the same kinds of abstractions to determine how data is stored. But as we move into petabytes, and ultimately exabytes, new data models are needed that better take into account what scientists and engineers need to accomplish. A major disruption is required (think how the use of tabular data was transformed by the introduction of database management systems).

    Image of Rob Ross

    Rob Ross, ANL

    Rob Ross, Computer Scientist with the Mathematics and Computer Science Division of Argonne National Laboratory, points out that traditional file systems models come with a lot of baggage that will slow us down on the road to exascale. Current data models do not provide adequate support for irregular data structures and graphs, as well as the adaptive formats needed to manage memory footprints and computational loads for data intensive scientific problems.

    He points to several current efforts that use new data models to address multidimensional data storage issues, including the HDF5 (Hierarchical Data Format) library and NetCDF (Network Common Data Form) library. HDF includes abstract data and storage models with libraries to map the storage model to different storage mechanisms. NetCDF is a set of software libraries and self-describing formats that allows users to work with array-oriented scientific data. Also in the works is SciDB, an open source database software project designed to handle data intensive scientific problems. But these systems may not scale to exascale levels, and as yet undefined new data and metadata models must be developed.

  • Locality. When moving petabytes of data from compute nodes to attached devices for storage, subsequent analysis, and archiving in case of failure or for future reference, closer is better. Expected to produce 15 million gigabytes of data annually when fully operational, the Large Hadron Collider is the poster child for locality on a grand scale. Says Ross, “We need to allow applications to talk to I/O systems in terms of the data models they actually use. What’s required is high bandwidth, low capacity storage near the HPC system clients to capture and cache data, as well as the means to move these data out to more persistent devices over time. We need to think how and where that information is stored so when scientists want to retrieve data for analysis, they can exploit temporal and spatial locality for rapid and accurate data transfer.”

  • Checkpointing. Quips Fellinger, “The more stuff you floor, the more problems you’re likely to encounter.” Because exascale systems involve vast numbers of components, MTBF and recovery are key issues. Checkpoints — saving the current state of a program and its data to non-volatile storage in case of failure — become even more crucial. Checkpoints also represent a potential bottleneck of truly monumental proportions.

    Today’s checkpoint scenarios are too cumbersome to work with tomorrow’s data scales. “Without a new solution to this problem, you may wind up using your compute resources primarily to do checkpoints with very little left over for actual computation,” Fellinger predicts. Today’s gold standard in the exascale world is resiliency — and the current state of the art in checkpointing just does not measure up.

  • Archive. Jason Hick and his colleagues at NERSC are part of the HPSS collaboration that is working on a significant redesign of that venerable storage management software, which runs the archival storage so essential to scientific endeavors. For example, as scientific knowledge and data analysis techniques progress, researchers like to revisit old data and examine it with a fresh eye. To meet the challenge of exascale, the NERSC team is concentrating on designing multiple metadata servers, a promising approach for archival storage at extreme scale. Although HPSS version 8 could be considered a revolutionary step forward, it is powered by IBM’s tried and true relational database technology, DB2. IBM plans to run a demo at SC10 in New Orleans this fall.

There are many other challenges in store for researchers working on exascale storage, e.g. eliminating bottlenecks, providing consistent service, improving rebuild rates, and guaranteeing data integrity. Ross explains that these efforts are driven by the two fundamental roles of storage in exascale systems. Role number one is “Defensive I/O” in which storage systems are used to tolerate failures (i.e., checkpointing). The essential requirement here is a high (perceived) I/O rate that allows the application to quickly return to computation. The second role is “Analysis I/O,” where the storage system captures computational results for further study. This often involves reduction of the application’s raw data output to speed up analysis. The I/O system must also support timely interactive analysis — not an easy task when petabytes of data are coming down the pipe.

Seeking solutions

The goal of achieving a 1,000-fold increase in computational performance in the very near future is understandably causing some consternation among the HPC community. But Ross likes to quote Niccolò Machiavelli, “Never waste the opportunities offered by a good crisis.”

He advocates looking at potential exascale storage solutions as a unified memory hierarchy, almost like a single fabric. Data is moved and cached into in very fast, low capacity resources, such as flash storage or more exotic solid-state devices like memristors or carbon nanotubes. Then, in a more leisurely fashion, the data is asynchronously staged to slower but more capacious devices such as disks and tape. Resources could either be managed by the storage system itself, or by slightly higher software layers that handles resources directly on the compute node in order to reduce checkpoint overhead and latency. Merging analysis and storage resources has the potential of reducing costs and improving post-processing rates. The illustration below shows Ross’s concept of an alternative storage model.

Image of storage system

Alternative storage model as proposed by Ross. (click for higher-res version)

At DataDirect, Fellinger and his colleagues are also looking for non-traditional storage solutions — for example, finding ways to either reduce or eradicate bottlenecks related to shortcomings in MPI; plotting to get rid of SCSI and external bus layers; and using hypervisors and service nodes in virtual space to eliminate the concept of a back-end socket on the server are just a few of the tactics they are exploring. They are attempting to mitigate the agonizing slowness of rotating disk drives by embedding a layer of solid-state technology within the system for super-fast checkpointing and highly efficient DMA processing to non-volatile memory.

“The concept of efficiency is key,” he says. “We have to gnaw at each aspect of the system, study the bottlenecks, and then, ask, at the end of the day, how much power and machine cycles does it take to move the data produced by the supercomputer out to the disk drives. We don’t want to have to build that nuclear reactor next to our exascale data centers.”

Evolution or revolution?

Although he is optimistic about meeting exascale storage goals by 2018, Hick cautions that the number of companies in the HPC storage business has been drastically reduced over the past decade or so, and that the survivors have a road map that extrapolates from what they are offering now to future systems that build on current solutions. Budget is a factor: it costs a minimum of $30-50 million to bring a new system to production. Under these circumstances, evolution appears to be a favored approach for the storage vendors.

DataDirect’s Fellinger is not only looking for ways to evolve existing technology; he is totally open to exploring new and potentially disruptive technologies such as phase-change memory, and molecular memory devices built with semiconducting single-walled carbon nanotubes.

As for Ross, he is of the opinion that “Over the next five years we are likely to see competition between research groups that are taking a revolutionary approach to building and prototyping exascale storage systems, while at the same time there will be continued efforts, especially from the commercial side, to push for more evolutionary solutions. Personally I’m more interested in the revolutionary approaches. By 2015, I think we’ll be in for some interesting times as those non-traditional designs mature and we are confronted with the possibility of real change. It’s quite likely that both the HPC community and the vendors will be providing us with highly revolutionary solutions to the challenge of exascale storage.”