Live Report from LUG 2016 Day 3

In this special guest feature, Ken Strandberg offers this live report from Day 3 of the Lustre User Group meeting in Portland.

lugMoving and copying data was the topic of the first two presentations today. Frederick Lefebvre and Simon Guilbault from Calcul Quebec reported on their work on CopyTool for HSM on Lustre. Their users asked for cheaper tiers of storage while still being able to run parallel jobs. Their tool includes local tiers of low-cost storage and leverages Amazon S3. CopyTool enables automatic, policy-driven data transfer from Lustre’s parallel file systems to cheaper object storage. More information can be found here.

Rick Wagner from San Diego Supercomputing Center presented progress on his team’s replication tool that allows copying large blocks of storage from object storage to their disaster recovery durable storage system. Because rsync is not a tool for moving massive amounts of data, SDSC created recursive worker services running in parallel to have each worker handle a directory or group of files. The tool uses available Lustre clients, a RabbitMQ server, Celery scripts, and bash scripts. They have been able to achieve peak performance of 15 GB/s on 24 nodes during replication of about 1 PB. The current code base is available at github.com/sdsc/lustre-data-mover.

Li Xi showed the work he and Shuichi Ihara from DDN in Japan have been doing on creating a new type of quota type that supplements existing user and group quota types. Storage administrators need a better tool to account for different scenarios they encounter. User and group configurations do not change as frequently as projects do, so storage managers need a better way to plan capacity in their systems as projects evolve. They have been working on implementing it with different file systems, and Lustre is coming soon, according to Xi. Their current tests show minimal performance impact. His team plans on pushing the solution to Lustre 2.9 release.

Yan Li and his team from University of California at Santa Cruz, an Intel® Parallel Computing Center site, have been working on a method to maintain consistent performance of the storage system as the number of clients increases. Their research focuses on contention management across the system to fairly deliver performance for growing jobs. Their project, called Automatic Storage Contention Alleviation and Reduction (ASCAR), uses client-side rule-based I/O rate control, with rules generated and performance optimized automatically using machine learning and heuristics. Their prototype results are showing up to 36% performance increases of all workloads, working best for write-heavy workloads with the same 36% benefit.

The final presentation for LUG 2016 was by Rick Mohr at the University of Tennessee, which works closely with Oak Ridge National Laboratory. He has been working with Intel’s High Performance Data Division on evaluating progressive file layouts (PFL) for Lustre. PFL allows defining a single file layout that can change stripe layout as the file grows, which can simplify Lustre usage for the novice user. PFL also delivers more striping options for the advanced user, and provides a stepping stone to future Lustre features. Mohr’s work to date shows that performance is on par with traditional Lustre, and in some cases is better. More results information can be found here.

OpenSFS, the group that organizes and presents LUG each year, is undergoing changes, and the directors desire and encourage input from the membership. Anyone interested in contributing their efforts and ideas should contact OpenSFS at admin@opensfs.org.

Besides the code development being done by the developers who presented over the last three days, Intel announced publicly available training on Lustre at learn.intel.com. For those interested in understanding Lustre, several modules are available at their training site.

Read the full report from LUG Day 1 and LUG Day 2.

Download LUG Slide Presentations

Sign up for our insideHPC Newsletter