How to Manage Exabytes of Distributed Data?

In this special guest feature from Scientific Computing World, Reagan W. Moore from RENCI argues that policy-based data-management systems are the next stage in the evolution of large-scale data management.

Reagan MooreWith the exabytes of data that are being generated today, it has become essential to integrate networking technology and data management technology in order to manage the movement and storage of the data. Policy-based data-management systems provide a way to proceed. They represent perhaps the latest stage in the evolution of data-management systems from file-based systems, to information-based systems, and now to knowledge-based systems.

File-based system focused on the management of bits, and provided a standard I/O interface for reading and writing files. Information-based systems added support for information about the files, including provenance, descriptive, and structural information — stored as metadata. Knowledge-based systems add support for procedures that either extract or generate information, and enable the processing of data within the storage environment.

For the past sixteen years, the Data Intensive Cyber Environments (DICE) group at the University of North Carolina at Chapel Hill has been developing data-management systems called data grids. This is software that makes it possible to organize distributed data into sharable collections, while enforcing access controls. The original system, the Storage Resource Broker (SRB), focused on ensuring consistency across all operations performed in a distributed environment. Implemented as middleware, the SRB was installed at each location where data would be stored.

Applications of the technology included: the BaBar High Energy Physics project, which moved two petabytes of data between Palo Alto, California and Lyon, France; the US National Optical Astronomy Observatory, which managed the migration of data from telescopes in Cerro Tololo, Chile, to archives in Illinois; and the United Kingdom’s e-Science data grid, which federated storage resources across institutions. The SRB provided a standard I/O interface, while managing metadata about the distributed files. The applications managed petabytes of data, and hundreds of millions of files, in international collaborations.

Despite SRB’s success in managing both data and information, users requested the ability to modify consistency constraints, and implement multiple types of data-management policies. A driving requirement from the UK e-Science data grid, for example, was to create a collection in which files were permanently managed and could never be deleted by anyone. But, at the same time, it was desirable that administrators should be able to replace corrupted files, and users update their own files in their own collections. This implied the need to manage at least three different consistency constraints on data deletion within the same data-management system: no deletion allowed; deletion by administrator; and deletion by file owner.

The DICE group developed a policy-based system to extract knowledge about management policies from the software, and apply the knowledge via computer-actionable rules stored in a rule base. Effectively, every software-encoded consistency constraint was replaced by a policy-enforcement-point. Actions by clients were trapped at the policy-enforcement-points. By searching the rule base, an appropriate rule could then be identified, which controlled the execution of a workflow that applied the required management policy. This meant that the knowledge needed to manage the system could be captured in computer-actionable rules. The system was no longer restricted to managing files and static representations of information. Instead, a data-management system could use rules that controlled the behaviour of the system and dynamically change the rules in a rule base. It became possible to use generic infrastructure to implement archives, digital libraries, data grids for sharing data, project collections, and processing pipelines — simply by changing the rules and procedures enforced by the system.

The integrated Rule Oriented Data System (iRODS) was developed over the past seven years, and has replaced the SRB. Within iRODS, policies can be enforced for: preservation (authenticity, integrity, chain of custody, original arrangement, retention, disposition); or for data publication in a digital library (descriptive metadata annotation, arrangement, creation of presentation versions such as image thumbnails); or for sharing in a data grid (access controls, distribution, caching); or for reproducible data-driven research in a processing pipeline (workflow procedures, workflow provenance, workflow re-execution); or for validating assessment criteria (repository trustworthiness, compliance with regulations).

imgresToday, viable data-management systems automate enforcement of management policies within storage controllers; automate administrative tasks such as data migration;, automate validation of assessment criteria; capture knowledge (processes) associated with creating derived data products; capture knowledge (communication protocols) needed to interact with remote systems; and automate processing of data within workflow pipelines. The automation of these tasks corresponds to the creation of knowledge procedures that can be applied by a policy-based data-management system.

Through policy-based data management systems, it will be possible to implement feature-based indexing of data collections. Discovery of data can be driven by the presence of desired features within the data set, instead of descriptive metadata. This requires the ability to apply a procedure to the data, determine whether the desired feature is present, and build an associated index. Policy-based systems can control the execution of the associated procedures.

Through policy-based data-management systems, it will be possible to link virtual collections to virtual networks, and access data by name instead of network location. A data-management system can be integrated with network routers, such as the OpenFlow technology, and dynamically define the network path that is used to access a file. If a file is replicated within the logical collection across multiple storage locations, the request for a file can be automatically routed to the closest copy.

These types of applications imply that policy-based systems will become pervasive, and migrate into storage controllers (for automated data processing) and into the internet (for intelligent networks). In each case, the knowledge required for processing or transferring data can be captured as procedures that are automatically applied under policy-based control.

This story appears here as part of a cross-publishing agreement with Scientific Computing World.

Resource Links: