In this special guest feature, Malcolm Cowe from Intel writes that the Lustre parallel file system has come a long way.
One of the great benefits of adopting the Lustre* parallel file system in any environment—HPC or enterprise—is that it’s an open source project with a deep and broad user community in front of it and strong developer community behind it. That gives Lustre both the flexibility to quickly respond to features the user community requests and the development ammunition to make those features innovative and the best they can be. Thus, the Lustre community has seen it evolve from its HPC infancy to enterprise-class maturity. It has become extremely attractive to enterprises and adopted in enterprise HPC solutions across multiple industries, from Oil and Gas to Financial Services, to science and technology, among others. According to Earl Joseph, a top HPC industry analyst with IDC,
“Along with IBM’s General Parallel File System (GPFS), Lustre is the most widely used file system. But Lustre is experiencing healthy growth in terms of market share while GPFS remains flat. Lustre is also supported by a large number of OEMs, providing the HPC community with a strong base for growth.”
However, it seems that some in the industry continue to think Lustre has not matured beyond its beginnings and is still a scratch file system. We in the Lustre community see quite a different picture of Lustre.
Lustre—Matured to Become Enterprise-Grade
Yes, historically, Lustre was developed for super-high-performance workloads and deployed as a scratch file system for HPC in order to provide the highest throughput that supercomputer centers could get at the time to store and manage their data for these extremely fast machines. And that’s why over the last decade Lustre has been adopted—and continues to be adopted—in most of the fastest supercomputers in the world.
But, over the last several years, an enormous amount of development effort has gone into Lustre to address users’ enterprise-related requests. Their work is not only keeping Lustre extremely fast (the Spider II storage system at the Oak Ridge Leadership Computing Facility (OLCF) that supports OLCF’s Titan supercomputer delivers 1 TB/s ; and Data Oasis, supporting the Comet supercomputer at the San Diego Supercomputing Center (SDSC) supports thousands of users with 300GB/s throughput) but also making it an enterprise-class parallel file system that has since been deployed for many mission-critical applications, such as seismic processing and analysis, regional climate and weather modeling, and banking.
These are business domains that do not tolerate downtime; where established Service Level Agreements (SLAs) are watched and expected to be met. In Oil and Gas, for example, companies place high expectations on their HPC systems, which include the file system. Any downtime has a significant corporate cost associated with it. Therefore, solution providers and IT departments carefully choose the components and designs that go into their HPC solutions in order to continuously meet SLAs. And many have chosen Lustre because of both its high performance and enterprise-grade reliability and data availability features.
A Well Understood, Efficient and Cost-effective High Availability Solution
Business continuity is a key driver for any mission-critical computing solution in an enterprise environment. Data availability ensures computing—and thus business operations—can continue in the presence of hardware failures, such as disk failures. Different file systems use different mechanisms to maintain data availability. The Hadoop Distributed File System (HDFS) and IBM’s General Parallel File System (GPFS) replicate data across multiple disks. Lustre uses a well-known and understood high availability design pattern, where the architecture deploys metadata servers (MDS) and object storage server (OSS) in cooperative HA cluster pairs, with each pair attached to a reliable, scalable storage system. In the event one of the servers fails, the storage targets on that server are migrated to the remaining server. It’s a very mature and understood design pattern, and a common model in IT centers around the world.
One of the main benefits—besides availability—of Lustre’s high availability (HA) design pattern is it does not compromise performance for the file system. Replication methodologies can continuously degrade performance and compromise latency in the file system, because every file or block replication consumes additional bandwidth. That reduces the bandwidth available to the application. Synchronous replication is especially known for substantial latency degradation, because the application has to wait until all the targets acknowledge the data has been written before it can continue execution. The high availability method used in Lustre deployments makes more effective use of the available network bandwidth. When all of the servers are online, the maximum bandwidth is available to applications– there is no replication overhead, and so no overprovisioning of network and servers is required in order to make up any shortfall in available throughput. In a failure scenario, applications may experience a pause in I/O until the targets have been migrated, but on Lustre this happens quickly.
Any consideration of costs for a design solution—including reliability components—needs to look beyond initial hardware acquisition and evaluate Total Cost of Ownership over the expected life-cycle of the system. Lustre’s HA design pattern, since it’s a shared storage array, allows more efficient use of storage. Combined with Lustre’s overall performance and the impact it has on business operations versus other architectures, the overall TCO and cost for data reliability over the lifetime of the system may, in fact, be lower than, say, a coarse-grained replication solution, where every terabyte drive must have at least one or possibly two additional terabyte drives for replication. That’s a potentially costly method to maintain data reliability, especially when considering how systems are often built.
A shared-nothing solution is predicated on the inherent lack of reliability in the individual components and relies on over-provisioning of resources to provide reliability through redundancy. In such systems, to minimize acquisition costs, reliability is often circumvented by choosing lower-cost components with lower reliability characteristics. Administrators can end up with a large footprint of systems to manage, which, of course, can also drive up operational costs, like power and cooling. Certainly, in any data availability solution, there are up-front, extra costs in terms of storage hardware, but with Lustre, the serviceability of that storage tends to be easier and thus more cost-effective in the long term, leading to a good Return on Investment.
Innovative Replication on the Horizon
It is, nevertheless, recognized that, in some deployments, replication is going to be a requirement. That’s a known request from Lustre’s user community.
Fundamentally, the Lustre community believes that, instead of being restricted how their data will be stored, end-users should be offered flexibility and options about how their data is recorded in order to best meet the requirements of their application or project. So, developers are diligently working on a solution that is flexible and innovative. It’s one thing to integrate a fixed solution and claim the file system supports replication. It’s another to deliver an innovative solution that can accommodate different user needs for replication. Lustre has maintained a history of innovation. Integrating a replication methodology in Lustre is going to follow that tradition.
Thus, the Lustre project is working on first defining a strategy for file layouts that can be arbitrary and decided upon at run time, instead of requiring a pre-determined layout strategy when the file system is first set up. That will give users the flexibility to choose the best file layout for their workload when they’re ready to run it, and then extend that layout to replicated data across the file system.
For example, an application that places emphasis on throughput performance above all other considerations–very large scale streaming IO workloads–is more likely to benefit from a striped file layout, equivalent to RAID 0 and essentially the storage structure employed in Lustre today. The emphasis is on persisting data to storage as quickly as possible. The data may be further refined and processed as part of a pipeline, but raw results may have limited utility and will be purged at some future date. This is the classic scratch storage paradigm. The file system still needs to be reliable, but the data does not have any characteristics that might benefit from long term persistence.
However, an application that processes data that is vital to business operations, or, even more critically, upon which human life depends, will set different requirements on the persistence of the data, and will generally impose requirements on increased reliability through replication of the data. Examples range from numerical weather prediction and climate data processing applications to medical imaging and genome sequencing applications in the healthcare and life sciences industries. In these cases, loss of data can have immediate negative consequences, placing emphasis on data availability in the file system. The cost of restoring lost data and re-running applications is not about dollars, but about risk to human life. For example, if a weather forecast is late, there is an immediate effect on transport infrastructure, such as shipping; if the data sets required for processing disaster event modeling, such as tsunami early warning, are lost, then insight into where to position emergency services and focus evacuation efforts is compromised.
Critical data sets will also typically have longevity requirements, meaning that the information must be stored reliably for long periods of time. Ensuring that both availability and longevity requirements are met for permanent production data requires redundant data replication across multiple storage systems.
So, the Lustre developer community has not forgotten about replication, nor is it just a scratch file system any more. Replication is in the works.
Developers have focused on adding reliability and availability features that enterprise users have wanted. And these features seem to have made Lustre very attractive to the businesses where reliability and availability are among the highest expectations from IT. Lustre has matured significantly over the years. It’s not your grandmother’s (or grandfather’s) file system.
The Proof is in the Market
Lustre’s maturity seems to be having an impact on Lustre momentum and adoption. As Earl Joseph pointed out, Lustre is gaining ground while the alternative remains flat. This sentiment is also repeated in recent Intel statements. Bret Costelow, Director of Global Sales for Lustre Solutions at Intel has said, “Intel has experienced considerable growth in sales, measured by support contracts for the Intel editions of Lustre, year-over-year from 2013 to 2014. This is representative of those Lustre adopters who have moved away from unsupported, roll-your-own versions, and the competition to Intel’s editions. And, we continue to see growth momentum in 2015.”
Enterprise HPC is making an investment in Lustre as shown in deployments around the world, regardless of the distribution being used. The fact that Lustre momentum is increasing makes it obvious that to those who know and care, Lustre is a mature, effective solution for a fast, reliable, and scalable parallel file system in HPC.
Learn more about Intel Solutions for Lustre Software.
Malcolm Cowe is a product manager for Lustre solutions in Intel’s High Performance Data Division. He is based in Melbourne, Australia.