Introduction to the Lustre File System

Lustre is a recognized leading parallel file system that is used in many of the Top500 sites on a consistent basis. The ability of Lustre to handle billions of files on a massive scale and with top performance has enabled organizations from research institutions to enterprise corporations to deliver a state-of-the-art solution to their clientele.  Although there are a number of truly huge implementations of Lustre today, the community is still far from reaching the maximum configurations that the Lustre architecture is designed for. Inside the Lustre File System describes the basics of how the Lustre File System operates with descriptions of the newest features.

Some of the main components of the Lustre File System include:

  • MDS (Metadata Server)
  • MDT (Metadata Target)
  • OST (Object Storage Target)
  • OSS (Object Storage Server)
  • LNET
  • Lustre Client
  • LNET Router

Inside LustreLustre is based on Linux and uses kernel based modules to achieve the expected performance.  Lustre separates the metadata and the content of the files on different systems. Although this is not unique, the manner in which Lustre does this has proven highly efficient and reliable. The basics as to how Lustre works are:

  • Client needs to write a file
  • Contacts the MDS for authentication
  • Receives authentication and a list of the OSTs available for writing
  • Client interacts with the assigned OSTs directly and writes the file
  • If the communication uses InfiniBand, the communication will be done using RDMA

In addition, to optimize the writing of the file, the specific striping settings for the write, whether specified or inherited for the file system or directory are used. This includes:

  • Stripe size – The specific size of an object (a file usually consists of a number of stripes). The stripe size is usually set to 1 MB as this corresponds to the default RPC size in Lustre.
  • Stripe count – Determines how many OSTs are to be used for a single file. The default is 1, but it can be set arbitrarily.
  • Stripe index – Where to put the initial object of a file. This is usually set to MDS discretion. This allows the MDS to place files on OSTs with more capacity than others to maintain a more balanced system.

Similar mechanisms and data flow are used for the read process as the write process.

One of the unique features of the Lustre File System is the ability to abstract the network layer, using LNET. A number of network entities and legacy fabrics are supported.  As mentioned, Lustre uses RDMA for InfiniBand file transfers, greatly speeding up the read or write process.

Lustre Version 2.x is available and includes a number of new features.  Lustre changelogs are now part of the distribution and records events that change the file system namespace or file metadata.  Data layout policies so that a file can be distributed up to 2,000 OSTs.  Reporting on directory and file sizes has also been improved and optimized. 4 MB I/O has now been implemented, improving performance significantly.  These are just a sampling of the new features in Lustre 2.x.

Lustre has been a mature and stable file system for a number of years, but its requirements and associated use model are changing. Simplicity of management is now an often stated requirement in Request for Proposals (RFPs), and customer requirement discussions are occurring specifically around the need for a fully functional and scalable management graphical-user interface (in addition to the ubiquitous command line interface). Another trend in Lustre adoption and proliferation is that Lustre is no longer only used for the original purpose of constituting a scratch file system for HPC computations. Today, Lustre is increasingly used as a file system for home directories as well as for project space.

The whitepaper, “Inside The Lustre File System” is an excellent introduction to the inner-workings of Lustre without getting too deep. To better understand how Lustre works and how your organization can benefit from Lustre, download it now.