OrangeFS: A Scalable, Parallel File System Whose Time Has Come

Print Friendly, PDF & Email

Sponsored Post

Many of the major advances in HPC have been the result of collaboration between academia and the big government labs. This has been the case with PVFS (Parallel Virtual File System) and its latest generation, the scale-out file system known as OrangeFS.

PVFS, developed at Clemson University and Argonne National Labs, had focused primarily on a few large-scale scientific workloads.  It has also been used by researchers to explore different aspects of parallel file system design and implementation.

OrangeFS logoThe development team tuned PVFS to work better with smaller I/O workloads and in the process initiated the next generation of the software – OrangeFS.  This open source file system moves beyond PVFS’s scope by providing production quality services for a broad range of data-intensive applications, such as HPC, Big Data, streaming video, genomics and bioinformatics.

Omnibond, founded in 1999 to provide infrastructure software development and support, got its start by licensing software developed by the Clemson IT department.  Part of the company’s current offerings is to provide commercial grade support and services for OrangeFS.

OrangeFS – A Storage System for Today’s HPC Environment

OrangeFS is a user-friendly, parallel file system designed specifically for today and tomorrow’s high performance compute and storage clusters.  Its distributed file structure provides outstanding scalability and capacity.

The system boosts I/O performance by storing files as objects across multiple servers and accessing those objects in parallel.  It’s this object-based infrastructure that allows OrangeFS to work so well.  Each file and directory consists of two or more objects – one containing file data, and the other file metadata.  Depending on their role in the file system, objects may contain both data and metadata. None of this is visible to the end users, who see a user-friendly, traditional, logical file view.

Among the features unique to OrangeFS are:

  • Unique object-based file data transfer, allowing clients to work on objects without the need to handle underlying storage details, such as data blocks
  • Ability to have unified data/metadata servers (they can also be separated if needed; the default is unified)
  • Distribution of metadata across storage servers
  • Distribution of directory entry metadata (in v2.9)
  • Diverse client access methods including Posix, MPI, Linux VFS, FUSE, Windows, WebDAV, S3, Hadoop and REST interfaces
  • Ability to configure storage parameters by directory or file, including stripe size, number of servers, and immutable file replication
  • Virtualized storage over any Linux file system as underlying local storage on each connected server
  • Replacement of Hadoop DFS using MapReduce extension and JNI – no modification of MapReduce code is needed.

OrangeFS at Work

At Clemson, OrangeFS has a wide range of users from such diverse fields as bioinformatics, digital production, astrophysics, humanities, and cloud computing.

Other universities and research labs are migrating to OrangeFS as well. For example, at Johns Hopkins University, OrangeFS helps provide an order of magnitude better performance from the school’s existing HPC cluster, and a two orders of magnitude speedup in its storage array.

Omnibond has also been working with users in various HPC industries, as well as a broad range of science and engineering firms. One large corporate client employs OrangeFS exclusively for data mining.  Reference architectures are available for Dell PowerEdge R720 and PowerVault MD storage arrays and also the DDN SFA12k platform that are used in a wide variety of high performance, data intensive computing environments.

OrangeFS is key building block for Amazon Web Services Cloud, acting as a distributed file system for high performance scratch. Recently an AWS Cloud-based bioinformatics application was optimized by rewriting portions of the code in MPI-IO to leverage OrangeFS in AWS.

Recent tests at Clemson achieved a 25% improvement in Apache Hadoop Terasort run times by replacing the Hadoop Distributed File System (HDFS) with OrangeFS.

Benefits of Using OrangeFS

When compared to other file systems, OrangeFS provides users with two key benefits:

  • OrangeFS is one of the best performing parallel file systems available.  Based on the powerful, modular PVFS architecture, this well-designed solution incorporates distributed directories, optimized requests, and a wide variety of interfaces and features.
  • OrangeFS is an extremely easy file system to build, install, and run – it has proven to be a highly usable file system.

OrangeFS has been fine tuned and hardened through several years of development, testing and support by a professional development team.  Although the system remains open source, it is now being deployed and backed up by commercial support from Omnibond for a wide range of scientific and enterprise applications.

How to Give OrangeFS a Try

Simplicity itself – just go to the OrangeFS website download page to download a tarball or obtain the latest changes to the system from the OrangeFS repository. Instructions on how to install the system are found in the documentation section of the web site.

Give Orange FS a try and experience the advantages of a unique parallel, high performance file system whose time has truly come.