A new paper outlining NERSC’s Burst Buffer Early User Program and the center’s pioneering efforts in recent months to test drive the technology using real science applications on Cori Phase 1 has won the Best Paper award at this year’s Cray User Group (CUG) meeting.
Based on the Cray DataWarp I/O accelerator, the burst buffer on Cori is designed to move data in and out (I/O) of the processor cores more quickly, which improves the overall performance of the system. This helps researchers make more effective use of the resource as they address key scientific challenges.
“The burst buffer is providing a great new capability for data-intensive computing at NERSC,” said Sudip Dosanjh, director of NERSC. “The NERSC-Cray partnership has led to a working product that is having real-world impact on the science done by NERSC users. I am very proud of the NERSC/Cray team that has been collaborating with users to make it a success.”
The Best Paper award was presented today, May 11, during the CUG meeting in London. Accepting the award was Wahid Bhimji, a data architect at NERSC who is co-leading the Early User Program and one of the CUG paper’s authors.
Burst buffers have the potential to transform science on supercomputers: removing I/O bottlenecks, enabling new workflows and bringing together data analysis and simulations,” Bhimji said. “However, this was a brand new technology, offering a new way of working and bringing challenges from software to scheduling. So it is really exciting to finally be able to demonstrate its use with these important science projects. Doing so has required a considerable efforts from NERSC staff, across different groups, as well as the science project teams, so it’s great to have that recognized by the CUG award.”
In August 2015, NERSC put out a Burst Buffer Early User Program call for proposals, asking NERSC’s nearly 6,000 users to describe use cases and workflows that could benefit from accelerated I/O. NERSC received over 30 responses from the user community and ultimately chose to support 13 applications teams, plus give an additional 16 teams early access to the burst buffer hardware without dedicated support from a NERSC staff member.
Initially, burst buffers were conceived primarily for use in checkpoint/restart situations; however, with NERSC’s broad user base and workload, the use cases turned out to be much more diverse, ranging from coupling together complex workflows and staging intermediate files to including use cases for database applications requiring a large number of I/O operations.
People have been talking about burst buffers for five years, but no one had any hands on experience with it until this past year. This is the first time it’s been put to the test,” said Deborah Bard, a data architect at NERSC, co-lead on the Early User Program and co-author of the CUG paper. “To the best of our knowledge, this is the first time a burst buffer has been stressed at scale by diverse, real user workloads.”
The Datawarp architecture is fairly complex with a lot of “moving pieces,” added co-author David Paul, NERSC computer systems engineer.
But Cray and SchedMD (SLURM) have done a good job implementing and integrating Datawarp so that most of this complexity is hidden from users,” Paul said. “Failure analysis, however, has been difficult from the aspect of system administrators because of the myriad of pieces and the uniqueness of the technology. But while the technology is young, it has been proven to work and will continue to improve rapidly as new functionality is provided and knowledge gained.”
Five Use Cases Highlighted
The CUG paper describes the experiences of five use cases in terms of performance measurements and lessons learned during their first few months of working with the burst buffer on Cori. The use cases in the paper represent a broad range of science and applications:
- Nyx/BoxLib: cosmology simulation code
- Chombo-Crunch + VisIt: simulation and visualization of carbon sequestration processes
- VPIC-IO: simulation and analysis of plasma physics simulations
- _TomoPy and SPOT: real-time image reconstruction of Advanced Light Source (ALS) and Advanced Photon Source (APS) data
- ATLAS/Yoda: simulation and data analysis for the LHC ATLAS detector
Partnerships between Cray and our customers have been central to Cray’s success,” said Barry Bolding, Cray’s chief strategy officer. “In partnership with NERSC, we identified an opportunity to accelerate I/O and implemented it with DataWarp. It’s great to see this system in use and accelerating production applications.”
Allowing NERSC users to test drive the burst buffer on Cori using Cray has been a real plus, and Cray’s ongoing involvement and support has been invaluable, Bard emphasized.
Without having this great variety of users we would never have uncovered all the bugs we did, which is great because the software needs testing out like that,” she said. “It is really valuable for road testing the new software and for pushing the performance of the system in ways that hadn’t been anticipated in the original design.”
Additional co-authors on the CUG paper were Doga Gursoy of Argonne National Laboratory and Melissa Romanus, Andrey Ovsyannikov, Brian Friesen, Matt Bryson, Joaquin Correa, Glenn Lockwood, Vakho Tsulaia, Suren Byna, Steve Farrell, Chris Daley, Vince Beckner, Brian Van Straalen, Nick Wright, Katie Antypas and Prabhat, all of Berkeley Lab.