Inside HPC & AI News | High-Performance Computing & Artificial Intelligence
At the Convergence of HPC, AI at Scale, Quantum
Subscribe
  • News
    • AI News
    • Business of HPC
    • New Installations
  • HPC-AI Hardware
    • Compute
    • CPUs, GPUs, FPGAs
    • Exascale
    • Future Technology
    • Green HPC
    • HPC/AI Chips and Systems
    • Network
    • Quantum Computing
    • Storage
  • HPC-AI Software
    • AI & Machine Learning
    • Cloud HPC
    • High Performance Analytics
    • Lustre
    • Parallel Programming
    • Systems Management
    • Tools
  • Quantum
  • Resources
    • Thought Leader Articles
    • Education / Training
    • Events Calendar
    • HPC Career Notes
    • Industry Perspectives
    • Industry Segments
      • Enterprise HPC
      • Financial Services
      • Government
      • Manufacturing
      • National Lab News
      • Research / Education
    • Jobs Board
    • Vanguards HPC-AI
    • Special Reports
    • The Exascale Report Archives
    • White Papers
  • Podcasts & Videos
    • @HPCpodcast
    • Other Podcasts
    • Videos
  • Power & Cooling
    • Advanced Tech & Efficiency
    • Air & Liquid Cooling
    • Data Center
    • Green Data Center
    • Infrastructure Design/Management
    • Interconnects & Networking
    • Nuclear, Solar, Wind, LNG, Geothermal, Fusion
    • Sustainability
    • System & Facility Monitoring
  • AI News
  • Search

Keeping Up with the Growth of Scientific Data

April 9, 2015 by Doug Black
Print Friendly, PDF & Email
  • share 
  • share 
  • share  
  • share  
  • email 

In this special guest feature from Scientific Computing World, Bob Murphy from General Atomics writes that metadata is key to mastering the volumes of data in science and engineering.

Bob Murphy, General Atomics

Bob Murphy, General Atomics

It’s no surprise to readers of Scientific Computing World that scientific data is increasing exponentially. And ever-advancing storage technology is making it easier and cheaper than ever to store all this data (vendors will soon be shipping 840TB in a single 4U enclosure). So what’s missing? How about: how to keep track of all that data? How to find what you are looking for in these multi-petabyte ‘haystacks’? How to share selected data with your colleagues for collaborative research, and then make it available to support the mandate that published results must be reproducible? How to ensure the consistency and trustworthiness of scientific data, selective access, provenance, curation and availability in the future? How to find data that was created years or decades ago but is needed now? And how to identify and remove data that’s no longer needed, to avoid accumulating useless ‘data junkyards’?

Metadata is the key

The solution has been around for decades: it’s metadata. Metadata, or data about data, lets scientists find the valuable data they are looking for. Metadata especially helps find value in data that’s been created by others, no matter when or where. Without rich metadata, scientists increasingly risk spending their time just looking for data, or worse, losing it – instead of exploiting that data for analysis and discovery.

Physicists are the high priests of metadata, and astronomers their first disciples

In addition to inventing the World Wide Web to support its amazing work, Big Science physics pioneered the use of metadata to manage the moving, processing, sharing, tracking and storing of massive amounts of data among global collaborators. Physicists have been using metadata to manage really big data for decades, developing their own bespoke metadata and data management tools with each new project. Cern actually developed three separate metadata systems to manage the two storage systems used in their ground-breaking LHC work that famously captured 1PB of detector data per second in search of the elusive Higgs boson.

So when NASA needed to keep track of all the data coming from the Hubble Space Telescope, it consulted the physicists at the Stanford Linear Accelerator (SLAC) BaBar experiment, and applied their metadata-based techniques to astronomy. Data collected from Hubble over the decades is meticulously annotated with rich metadata so future generations of scientists, armed with more powerful tools, can discover things we can’t today. In fact, because of rich metadata, more research papers are being published on decades- old archived Hubble data than on current observations.

General solutions to managing metadata

So what if your organization isn’t part of a multi-billion dollar, multinational Big Science project with the resources to build a custom system for managing metadata? Good news, there are a couple of broadly available and generally applicable metadata-oriented data management systems already used by hundreds of scientific organizations: iRODS and Nirvana. These ‘twin brothers from different mothers’ were both invented by Dr Reagan Moore (a physicist of course!), formerly with General Atomics and the San Diego Supercomputing Center, and now with the Data Intensive Cyber Environments (DICE) research group at the University of North Carolina. iRODS is the Integrated Rule-Oriented Data System, an open source project developed by DICE. Reagan Moore discussed the system in his article ‘How can we manage exabytes of distributed data?’ on the Scientific Computing World website in March 2014.

Nirvana is a commercial product developed by the General Atomics Energy and Advanced Concepts group in San Diego, from a joint effort with the San Diego Supercomputing Center’s Storage Resource Broker (SRB).

(‘Taking action on big data’ is a recurrent theme for North Carolina, as Stan Ahalt, director of the Renaissance Computing Institute (RENCI), professor of computer science at UNC-Chapel Hill, and chair of the steering committee for the National Consortium for Data Science (NCDS), discussed in his article published on the Scientific Computing World website at the end of March 2015.)

How they work

These systems have agents that can mount pretty much any number and kind of file or object-based storage system, and then ‘decorate’ their files with rich metadata that is entered into a catalogue that sits on a standard relational database such as Postrgres or Oracle. GUI or command-line interfaces are used for querying and accessing the data. Data can then be discovered and accessed through an object’s detailed attributes such as creation date, size, and frequency of access, author, keywords, project, study, data source, and more. All this data can reside on very different, incompatible platforms crossing multiple administrative domains, but now tied together under a single searchable Global Name Space. Several processes run in the background of this federation that move data from one location to another, based on policies or events, to coordinate scientific workflows and data protection like the systems at Cern. These systems can also generate audit trails, track and ensure data provenance and data reproducibility, and control data access – exactly what’s needed to manage and protect scientific data.

Metadata is the future of scientific data management

Scientific Big Data, and the metadata-based techniques that manage it, are no longer the reserve of Big Science. Increased sensor resolution from more and more sequencers, cameras, microscopes, scanners and instruments of all types are driving a deluge in data across all science. Fortunately, robust tools are readily available for effectively managing all this data. Now it’s up to you to use them!

To learn more download the white paper, Tackling the Big Data Deluge in Science with Metadata .

This story appears here as part of a cross-publishing agreement with Scientific Computing World.

  • share 
  • share 
  • share  
  • share  
  • email 
Filed Under: HPC-AI Software, Resources, Storage, Tools Tagged With: big data, Metadata, Nirvana software
«
»
»
«

Sponsored Guest Articles

Re-Engineering Ethernet for AI Fabric

[SPONSORED GUEST ARTICLE]   For years, InfiniBand has been the go-to networking technology for high-performance computing (HPC) and AI workloads due to its low latency and lossless transport. But as AI clusters grow to thousands of GPUs and demand open, scalable infrastructure, the industry is shifting. Leading AI infrastructure providers are increasingly moving ….

White Papers

AI Transportable Market

This whitepaper from One Stop Systems, describes how the requirements for AI in the field form a specific and distinct segment in the big, fast-growing edge computing market, separate from the familiar segments of edge data centers and the Internet of Things. One way to describe this emerging segment is “AI Transportables.”

Download
More White Papers

Join Us On Social Media

Featured From
  • DDN Introduces AI Data Architecture, Addresses NAND Shortages

    Chatsworth, CA — AI data platform provider DDN announced new capabilities across its EXA and Infinia product lines desitned to enable organizations to enhance AI performance and GPU utilization even as global NAND shortages drive SSD prices up by 75–125 percent. These advancements uniquely position DDN as the only vendor capable of maintaining AI factory […]

More News from insideAI News

  • Report: AI Back-End Networks Continue Shift to Ethernet
  • NVIDIA Introduces CUDA 13.1 with CUDA Tile
  • The Infrastructure Revolution for AI Factories
  • Vultr and AMD Expand AI Supercluster Collaboration
  • Red Hat Expands Inference Collaboration with AWS AI chips
  • ZincFive Raises $30M for AI Data Center Batteries
  • Taking on ASML: U.S. Invests $150M in Gelsinger-Backed EUV Startup
  • About insideHPC
  • Contact
  • Advertise with insideHPC
  • Visit Our Other Site – insideBIGDATA
  • Terms of Service & Copyright
  • Privacy Policy
Inside HPC & AI News | High-Performance Computing & Artificial Intelligence
Copyright © 2025