High-Performance Computing News Analysis | insideHPC
At the Convergence of HPC, AI and Quantum
Subscribe
  • News
    • Business of HPC
    • New Installations
  • HPC Hardware
    • Compute
    • CPUs, GPUs, FPGAs
    • Exascale
    • Future Technology
    • Green HPC
    • HPC/AI Chips and Systems
    • Network
    • Quantum Computing
    • Storage
  • HPC Software
    • AI & Machine Learning
    • Cloud HPC
    • High Performance Analytics
    • Lustre
    • Parallel Programming
    • Systems Management
    • Tools
    • Visualization
  • Industry Segments
    • Collaboration
    • Data Center
    • Enterprise HPC
    • Financial Services
    • Government
    • Manufacturing
    • Research / Education
  • Resources
    • Dell Technologies Spotlight
    • Education / Training
    • Events
    • Events Calendar
    • HPC Career Notes
    • Industry Perspectives
    • Jobs Board
    • Research / Reports
    • Rock Stars of HPC
    • The Exascale Report Archives
  • Special Reports
    • White Papers
  • Podcasts & Videos
    • @HPCpodcast
    • Other Podcasts
    • Videos
  • Jobs in HPC
  • Search

Keeping Up with the Growth of Scientific Data

April 9, 2015 by Doug Black
Print Friendly, PDF & Email
  • tweet 
  • share 
  • share  
  • share  
  • email 

In this special guest feature from Scientific Computing World, Bob Murphy from General Atomics writes that metadata is key to mastering the volumes of data in science and engineering.

Bob Murphy, General Atomics

Bob Murphy, General Atomics

It’s no surprise to readers of Scientific Computing World that scientific data is increasing exponentially. And ever-advancing storage technology is making it easier and cheaper than ever to store all this data (vendors will soon be shipping 840TB in a single 4U enclosure). So what’s missing? How about: how to keep track of all that data? How to find what you are looking for in these multi-petabyte ‘haystacks’? How to share selected data with your colleagues for collaborative research, and then make it available to support the mandate that published results must be reproducible? How to ensure the consistency and trustworthiness of scientific data, selective access, provenance, curation and availability in the future? How to find data that was created years or decades ago but is needed now? And how to identify and remove data that’s no longer needed, to avoid accumulating useless ‘data junkyards’?

Metadata is the key

The solution has been around for decades: it’s metadata. Metadata, or data about data, lets scientists find the valuable data they are looking for. Metadata especially helps find value in data that’s been created by others, no matter when or where. Without rich metadata, scientists increasingly risk spending their time just looking for data, or worse, losing it – instead of exploiting that data for analysis and discovery.

Physicists are the high priests of metadata, and astronomers their first disciples

In addition to inventing the World Wide Web to support its amazing work, Big Science physics pioneered the use of metadata to manage the moving, processing, sharing, tracking and storing of massive amounts of data among global collaborators. Physicists have been using metadata to manage really big data for decades, developing their own bespoke metadata and data management tools with each new project. Cern actually developed three separate metadata systems to manage the two storage systems used in their ground-breaking LHC work that famously captured 1PB of detector data per second in search of the elusive Higgs boson.

So when NASA needed to keep track of all the data coming from the Hubble Space Telescope, it consulted the physicists at the Stanford Linear Accelerator (SLAC) BaBar experiment, and applied their metadata-based techniques to astronomy. Data collected from Hubble over the decades is meticulously annotated with rich metadata so future generations of scientists, armed with more powerful tools, can discover things we can’t today. In fact, because of rich metadata, more research papers are being published on decades- old archived Hubble data than on current observations.

General solutions to managing metadata

So what if your organization isn’t part of a multi-billion dollar, multinational Big Science project with the resources to build a custom system for managing metadata? Good news, there are a couple of broadly available and generally applicable metadata-oriented data management systems already used by hundreds of scientific organizations: iRODS and Nirvana. These ‘twin brothers from different mothers’ were both invented by Dr Reagan Moore (a physicist of course!), formerly with General Atomics and the San Diego Supercomputing Center, and now with the Data Intensive Cyber Environments (DICE) research group at the University of North Carolina. iRODS is the Integrated Rule-Oriented Data System, an open source project developed by DICE. Reagan Moore discussed the system in his article ‘How can we manage exabytes of distributed data?’ on the Scientific Computing World website in March 2014.

Nirvana is a commercial product developed by the General Atomics Energy and Advanced Concepts group in San Diego, from a joint effort with the San Diego Supercomputing Center’s Storage Resource Broker (SRB).

(‘Taking action on big data’ is a recurrent theme for North Carolina, as Stan Ahalt, director of the Renaissance Computing Institute (RENCI), professor of computer science at UNC-Chapel Hill, and chair of the steering committee for the National Consortium for Data Science (NCDS), discussed in his article published on the Scientific Computing World website at the end of March 2015.)

How they work

These systems have agents that can mount pretty much any number and kind of file or object-based storage system, and then ‘decorate’ their files with rich metadata that is entered into a catalogue that sits on a standard relational database such as Postrgres or Oracle. GUI or command-line interfaces are used for querying and accessing the data. Data can then be discovered and accessed through an object’s detailed attributes such as creation date, size, and frequency of access, author, keywords, project, study, data source, and more. All this data can reside on very different, incompatible platforms crossing multiple administrative domains, but now tied together under a single searchable Global Name Space. Several processes run in the background of this federation that move data from one location to another, based on policies or events, to coordinate scientific workflows and data protection like the systems at Cern. These systems can also generate audit trails, track and ensure data provenance and data reproducibility, and control data access – exactly what’s needed to manage and protect scientific data.

Metadata is the future of scientific data management

Scientific Big Data, and the metadata-based techniques that manage it, are no longer the reserve of Big Science. Increased sensor resolution from more and more sequencers, cameras, microscopes, scanners and instruments of all types are driving a deluge in data across all science. Fortunately, robust tools are readily available for effectively managing all this data. Now it’s up to you to use them!

To learn more download the white paper, Tackling the Big Data Deluge in Science with Metadata .

This story appears here as part of a cross-publishing agreement with Scientific Computing World.

  • tweet 
  • share 
  • share  
  • share  
  • email 
Filed Under: HPC Software, Resources, Storage, Tools Tagged With: big data, Metadata, Nirvana software

Sponsored Guest Articles

HPC and AI Workloads Drive Storage System Design

Many organizations are tied to outdated storage systems that cannot meet HPC and AI workload needs. Designing high‑throughput, highly scalable HPC storage systems require expert planning and configuration. The Dell Validated Designs for HPC Storage solution offers a way to quickly upgrade antiquated storage….

White Papers

Bringing Personalized Medicine to Citizens

Novo Genomics   Healthcare start-up Novo Genomics is laying the foundation for personalized medicine in Saudi Arabia, with cutting-edge sequencing using the Lenovo Genomics Optimization And Scalability Tool (GOAST) architecture—based on Lenovo ThinkSystem SR630 V2 Servers powered by 3rd Gen Intel® Xeon® Scalable processors. The organization aims to harness genomics and multiomics to develop personalized […]

Download
More White Papers

Join Us On Social Media

Related Posts

Featured From

RSS Featured RSS Feed

  • Why FinOps Needs DataOps Observability
    In this special guest feature, Chris Santiago, Vice President/Solutions Engineering, Unravel Data, talks about controlling cloud spend through three phases of the FinOps lifecycle.

RSS More News from insideBIGDATA

  • Why Investors have to Appreciate the Diversity of AI 
  • TOP 10 insideBIGDATA Articles for August 2023
  • The Three Greatest Areas of Impact for AI in Automation
  • insideBIGDATA AI News Briefs – 9/22/2023
  • insideBIGDATA Latest News – 9/21/2023
  • Intel Innovation 2023 Highlights
  • 26 Years Since its Inception, Postgres is Just Getting Started 
  • About insideHPC
  • Contact
  • Advertise with insideHPC
  • Visit Our Other Site – insideBIGDATA
  • Terms of Service & Copyright
  • Privacy Policy
High-Performance Computing News Analysis | insideHPC
Copyright © 2023