In this podcast, the Radio Free HPC team discuss Henry Newman’s recent editorial calling for a self-descriptive data format that will stand the test of time. Henry contends that we seem headed for massive data loss unless we act.
In 20 years, much less thousands of years, how is anyone going to figure out what data is stored in each of these file formats? Of course, some of them are open source, but many are not. And even for open source, who is going to save the formats and information for a decade or more? I cannot even open some MS Office documents from the early 2000s, and that is less than two decades. The same can be said for many other data formats. There are self-describing data formats such as HDF (Hierarchical Data Format), which is about 30 years old, but outside of the HPC community, it is not widely used. There are other self-describing technologies in other communities, and maybe like HDF they could be used for virtually any data type. However, everyone wants what they have, not something new or different, and NIH is what usually happens in our industry.
Already we are seeing data formats that rely on antiquated hardware. Rich notes that data translation sites like Zamzar can help, and Shahin notes that the Living Computer Museum in Seattle has a mission to keep legacy computer systems running and available for people to see in action.
Rich points out that this is not just a problem for future scientific data. A recent article in the Economist describes how the number of genomics papers packaged with error-ridden spreadsheets is increasing by 15% a year, far above the 4% annual growth rate in the number of genomics papers published.
To wrap things up in our Catch of the Week, Rich points to a talk by Larry Smarr on 50 Years of Supercomputing. And Henry can’t help bun ring the security klaxon now that Yahoo has disclosed a breach of half a billion user accounts.