Improving access to archived data sets

John’s comment about “write-once/read-never” data on the Astoria Data Services post reminded me of some work being done at the University of Maryland’s HPSL lab to improve access to scientific data archives:

Active Data Repository (ADR), a kind of database system for large multidimensional datasets that lets you efficiently access sub-ranges (spatial and temporal) of your data using indexes and query planning and optimization.

DataCutter extends the ADR concept to shared, distributed systems where datasets might be spread across different locations and data-processing resources may also be distributed.

Although existing self-describing file formats already have some features of database systems, the ability to grab arbitrary subsets of data from an archived file without suffering the latency of transferring the entire file to a compute server from tape is generally missing, and would help ease the pain of actually doing something with all this data we’re piling up.