In this video from SC14, Bob Murphy from General Atomics describes how the company’s Nirvana software provides sophisticated metadata management for unstructured data.
By presenting a single global namespace across any storage device, anywhere in the world, Nirvana allows data to be easily and securely shared among globally distributed teams. Nirvana also automatically moves data to various workflow resources, based on policies so data is always available at the right place, at the right time, and at the right cost ― while keeping an audit trail as data is ingested, transformed, accessed, and archived through its complete lifecycle.”
Full Transcript:
insideHPC: Why don’t we start with the beginning because some people might not know. Who is General Atomics and who do you help?
Bob Murphy: As many people know, General Atomics is a huge conglomerate company, making everything from– perhaps our most famous product today is the Predator Drone. We also do a significant amount of fusion research, energy research in Lahoya. And that’s the group that I’m a part of, the energy and advanced concepts group of General Atomics.
insideHPC: So what are you guys showing off here at SC14?
Bob Murphy: We have a product called Nirvana. It’s a sophisticated data management product that takes meta data out of existing source systems and populates it into a relational database so we can do a bunch of analysis and queries on the data that people are storing over time. After that what people do is then they add very workforce specific meta data to better track that data through its life cycle. Analyzing things like how it’s processed, who accesses it, what software is used to process various versions of the data. That is becoming more and more important as open science in medical and translational medicine requires people to basically hit the record button as all these transforms happen to data. And then the results data has to be verified through all its steps to make sure that the data has gone through the right procedures and it’s accessed by the right people et cetera.
insideHPC: I’m curious as to the origins of this. You guys developed this for your own purposes with the fusion research or– where’d it come from?
Bob Murphy: We do use it internally for a variety of reasons – fusion research, the predator drones, the video signals coming down from that, the data stored within Nirvana. But where it came from – and a lot of people ask that, how’d General Atomics get into this end of the business – is that General Atomics used to run the San Diego Supercomputing Center. The San Diego Supercomputing Center is the origin of this product, SRB, which is well known to a lot of people in the field here.
Reagan Moore was an employee of General Atomics who developed SRB, and that product was distributed throughout research in universities, and General Atomics distributes it at a commercial basis through a licensed software. And then, as you know, Reagan has left San Diego Supercomputing Center about five or six years ago and has started a open source type of product very similar to this called iRODS. iRODS and SRB are sort of the “Lustre and GPFS” versions of a meta data management software.
It would seem that the need for this has never been greater and with all this explosion of data, right? Everything is about big data these days. You guys seem to have something that help people track it. That sounds like a unique solution to me.
It is. iRODS does it too but we do it on the commercial side. Just as you said, the time for this product has come. Because as you know the industry has provided just the ability of customer to pack-rat petabytes and petabytes of data now. Even though the industry has created these systems that can scale out, and that can scale to these massive amounts of data, they just become these what we call data junkyards. No one knows what’s inside these data silos any more. And as people build up these data junkyards over time [investigators?] come in and out, projects come in and out, they’re afraid to go in and start removing data because they don’t know what inside of it any more. So they go off and buy more storage.
What this product is used for, is to go in and we scan all the files in the file system. We could scan at 2,000 files per second. Scan up to 500 million files in these systems. And we provide all the information and analysis so the people that run these systems – the administrators and the owners of these systems – can go in and see who’s storing what data. What’s clogging up the system? What hasn’t been touched in several years? And they use that to better manage their data. Stop buying expensive storage, move off to slower tier storage, move things off the tape, and it just provides a better handle of what they are storing now.
insideHPC: That’s my last question for you. To use this, do I have to buy the storage from you – the whole storage system? I already have a storage system.
Bob Murphy: Use your existing storage system. We latch on to your existing storage system, and we scan the files in your storage system. Pull out the metadata and we put it into a relational database, post-GRASS, Oracle. From there then we do all the analysis. So that does a couple of things, it leaves your storage system all by itself, so it’s not doing any of these scanning functions. It just keeps running along and we do all the analysis off-line, outside the machine. And it’s just a relational database. So all the features and functionality that people are associating with a relational database, we have. That’s how we are able to pull this off.
See our Full Coverage of SC14. * Sign up for our insideHPC Newsletter.