In this podcast, Franck Cappello from Argonne describes EZ, an effort to effort to compress and reduce the enormous scientific data sets that some of the ECP applications are producing.
The current simulations or instruments generate too much data—more than can be properly stored, analyzed, or transferred. There are different approaches to solving the problem. One is called lossless compression, a data-reduction technique that doesn’t lose any information or introduce any noise. The drawback with lossless compression, however, is that user-entry floating-point values are very difficult to compress: the best effort reduces data by a factor of two. In contrast, ECP applications seek a data reduction factor of 10, 30, or even more.
Among the lossy compressors in software literature, one of the difficulties is that they cannot compress one- or two-dimensional data sets. The EZ project is targeting that type of data set and data sets of higher dimensions. In some cases, SZ provides better compression for the very large highly dimensional data sets than for the ones with smaller dimensions. For exascale applications, a reduction of at least one order of magnitude is required. The loss of information is acceptable, but the user must be able to set the limits for necessary accuracy. The SZ compressor provides such control to the user.
The EZ project looks to provide a production-quality lossy compressor for scientific data sets, while VeloC is centered on supplying an optimized checkpoint/restart library for applications and workflows. In addition to his work on EZ and VeloC, Cappello is the data reduction lead for ECP’s CODAR Co-Design Center.
Video Chat Notes:
- The motivation for the EZ project [1:24]
- Strict accuracy control is needed for lossy compression [3:22]
- Compression for more than images [4:20]
- Saving on storage footprint or bandwidth? [5:07]
- Four uses cases in addition to visualization: (1) reduction of the footprint on the storage system [5:20], (2) the NYX Application and reduction of I/O time [6:10], (3) the NWChem application and lossy checkpointing [6:58], and (4) acceleration of the GAMESS application with lossy compression [8:24]
- A multi-algorithm compressor [10:25]
- Automatic configuration by analyzing the data set while compressing [11:30]
- Deep learning is too slow to be integrated into fast lossy compressors [12:39]
- Different uses, different needs [13:46]
- Co-design, working with applications developers [14:31]
- How the work of the EZ project is benefiting ECP [14:56]
- Metrics for lossy compression quality and tools for assessing errors [16:09]
- Why compressing floating-point data multiple times would be undesirable [19:49]
- The results have been encouraging [21:00]
- Testing on Theta and other systems [21:31]
Franck Cappello is a senior computer scientist at Argonne National Laboratory and an adjunct research professor at the University of Illinois at Urbana-Champaign. He also is principal investigator for the Exascale Computing Project (ECP) efforts called EZ and VeloC.