Curating, Discovering and Disseminating HPC Research Elements Using iRODS

Print Friendly, PDF & Email

By Dave Fellinger

Many research centers are primarily focused on specific areas of HPC. Today, much of that activity relates to data reduction of sensor and instrumentation data. Too often the tasks of managing data both into and out from the HPC center take a second priority and solutions range from “home grown” software to aging solutions that only track file attributes such as time, date, extensions, and permissions. Unfortunately, these solutions do not scale well in this age of Big Data.

Data librarians currently manage far bigger data libraries and policy decisions regarding retention, quotas, file system performance, etc. require the consensus of those tasked with this challenge of curation. Decisions also have to be made regarding the federation of data sites promoting collaboration. In many instances the results of these policy decisions must be auditable to satisfy funding organizations as well as government entities.

How does a data librarian cope with these challenges always assuring that institutional policies are adhered to? Once assured of a collaborative environment, how does a researcher discover pertinent materials in her field?

Challenges Addressed by iRODS 

The Integrated Rule-Oriented Data System (iRODS) is an open source initiative that has been developed to manage data from raw instrument readings through publication solving the challenges of curation and addressing secure collaboration. The iRODS Consortium is a community supported organization that develops and supports the technology.

The four major challenges solved by iRODS include:

  1. Data virtualization. This allows a diverse set of storage technologies to be used in one namespace including local and cloud storage.
  2. Data discovery. Rich user defined or extracted metadata is stored in a catalog that can be queried to enable finding relevant data.
  3. Workflow automation. Rules can be written to assure the adherence to site policies regarding the entire data workflow from automated ingestion, tiering, retention, and publication.
  4. Secure collaboration. Data sets can be shared between individuals or sites maintaining auditable domain-controlled security.

iRODS Use Cases 

Large scale iRODS deployments have enabled collaborations of multi-national scientists and researchers. In the US the iPlant Collaborative was formed in 2008 with funding from the National Science Foundation. Data management was based on iRODS from the start of the project and it initially served the plant science communities primarily in the US. From its inception, iPlant quickly grew into a mature organization providing powerful resources and offering scientific and technical support services to researchers nationally and internationally. In 2015, iPlant was rebranded to CyVerse to emphasize an expanded mission to serve all life sciences [1]. Today CyVerse serves over 80,000 users with 5,690 participating academic institutions and 2,438 non-academic institutions [2]. A major feature of the collaborative is the Discovery Environment (DE) which allows researchers to quickly find files of interest relating to their life science discipline. The primary site is in Tucson, Arizona with a mirror at Texas Advanced Computing Center (TACC) in Austin, Texas. Both data management and workflow control is enabled by the use of iRODS [3].

In Europe, the EUDAT Collaborative Data Infrastructure (CDI) was formed to host the data of over 50 universities and research institutions in the European Union. The infrastructure is managed under iRODS and the data covers over 30 scientific disciplines from atmospheric research to physics, hydro-meteorology, genomics, and ecology. As with CyVerse, a major feature of EUDAT is data discovery across the entire geography of the EU. The goal is to provide both data access and re-use for near term needs as well as data preservation to build a long term archive [4].

In the Netherlands, SURF has built a research data management (RDM) framework based on iRODS. Countrywide data from several universities is stored at their data site [5]. Besides the service of offering data storage and management, they also offer data processing and analysis as well as compute and federated services [6]. All of the data at the site is moved to various platforms and tiers using iRODS. Focused recently on FAIR data principles and practice [7], SURF is a member of the iRODS Consortium as well as several universities in the Netherlands.

In Sweden, The Swedish National Infrastructure for Computing (SNIC) is a national research infrastructure that makes available large-scale high-performance computing resources, storage capacity, and advanced user support, for Swedish researchers. This service is managed under iRODS control [8]. This service uses the Swedish University Network (SUNET) which links the infrastructure at the KTH Royal Institute of Technology to other universities in Sweden with a 100Gbps link to facilitate data movement [9]. Parallel data migration between large parallel file systems at KTH is facilitated utilizing the iRODS rule engine [10].

In the state of Victoria, Australia the Department of Agriculture is capturing, managing, and analysing data from “smart farms” in order to define new and more efficient farming methods [11].

In the United States, the National Institute of Environmental Health Sciences (NIEHS) has adopted iRODS technology to better identify and organize their research materials through the use of catalogued rich metadata to enable collaborative discovery [12].

These are just a few of the iRODS deployments in both the academic and commercial sectors. The use of iRODS and its discovery capabilities accelerate scientific research allowing researchers to quickly find relevant materials while building on them. The power of iRODS to manage data based on collection policies cannot be overstated as data sets grow and automation becomes a requirement. Many worldwide universities, libraries, museums, and companies have chosen iRODS as a technology that allows the “future proofing” of data collections independent of the evolution of storage, networking, authentication, and compute technologies. These institutions have realized that their data policy decisions can be maintained by iRODS at any scale regardless of the guaranteed eventual migration of the rest of their infrastructure over time.

About the author

Dave Fellinger is a Data Management Technologist and Storage Scientist with the iRODS Consortium. He has over three decades of engineering experience including film systems, video processing devices, ASIC design and development, GaAs semiconductor manufacture, RAID and storage systems, and file systems. As Chief Scientist of DataDirect Networks, Inc. he focused on building an intellectual property portfolio and presenting the technology of the company at conferences with a storage focus worldwide.

In his role at the iRODS Consortium, Dave is working with users in research sites and high performance computer centers to confirm that a range of use cases can be addressed by the iRODS feature set. He helped to launch the iRODS Consortium and was a member of the founding board.

He attended Carnegie Mellon University and holds patents in diverse areas of technology.

References

  1. The history of CyVerse is available from; https://www.cyverse.org/about, accessed 5 August, 2020
  2. A presentation describing iRODS history and usage at CyVerse is available from, https://irods.org/uploads/2020/Edgin-Skidmore-CyVerse-Data_Store_and_the_Future-slides.pdf, accessed 6 August, 2020
  3. A presentation describing iRODS usage at TACC is available from; https://irods.org/uploads/2020/Jordan-TACC-The_Past_Present_and_Future_of_iRODS_at_TACC-slides.pdf, accessed 6 August, 2020
  4. Information regarding EUDAT is available from; https://www.eudat.eu/eudat-cdi, accessed 5 August, 2020
  5. Information regarding SURF is available from; https://www.surf.nl/en/research-ict, accessed  5 August, 2020
  6. A presentation describing SURF and federation is available from; https://irods.org/uploads/2020/Cacciari-SURF-Federated_Identity_Authentication-slides.pdf, accessed 6 August 2020
  7. Wikipedia entry for FAIR Data Principles; https://en.wikipedia.org/wiki/FAIR_data, accessed 6 August 2020
  8. Information regarding SNIC is available from; https://www.snic.se/ ,accessed 5 August, 2020
  9. Information regarding SUNET is available from; https://www.sunet.se/about-sunet/ ,accessed 5 August, 2020
  10. A presentation describing data management at KTH is available from; https://irods.org/uploads/2020/Korhonen-KTH-Migration_Between_GPFS_Filesystems-slides.pdf, accessed 6 August 2020
  11. A presentation describing iRODS usage at the Victoria Department of Agriculture is available from; https://irods.org/uploads/2020/Murphy-AgVic-SmartFarm_Data_Management-slides.pdf, accessed 6 August 2020
  12. A presentation describing iRODS usage at NIEHS is available from; https://irods.org/uploads/2020/Conway-NIEHS-Applications_of_iRODS-slides.pdf, accessed 6 August 2020