Petascale Comet Supercomputer Enters Early Operations

Print Friendly, PDF & Email

cometToday SDSC announced that the 2 Petaflop Comet supercomputer has transitioned into an early operations phase.

Comet is really all about providing high-performance computing to a much larger research community – what we call ‘HPC for the 99 percent’ – and serving as a gateway to discovery,” said SDSC Director Michael Norman, the project’s principal investigator. “Comet has been specifically configured to meet the needs of researchers in domains that have not traditionally relied on supercomputers to solve their problems.”

Comet joins SDSC’s Gordon supercomputer as another key resource within the NSF’s XSEDE (eXtreme Science and Engineering Discovery Environment) repertoire, which comprises the most advanced collection of integrated digital resources and services in the world. Researchers may request allocations on Comet via XSEDE. More information is available here.

Comet was designed to provide a solution for emerging research requirements often referred to as the ‘long tail’ of science, which describes the idea that the large number of modest-sized, computationally-based research projects still represents, in aggregate, a tremendous amount of research and resulting scientific impact and advance.

One of our key strategies for Comet has been to support modest-scale users across the entire spectrum of NSF communities, while welcoming research communities that are not typically users of more traditional HPC systems, such as genomics, the social sciences, and economics,” said SDSC Deputy Director Richard Moore, a co-PI for the new system.

A key strategy for Comet is to reach large communities of users via Science Gateways. A Science Gateway is a community-developed set of tools, applications, and data that is integrated through a web-based portal or a suite of applications. Gateways provide scientists access to many of the tools used in cutting-edge research – telescopes, seismic shake tables, supercomputers, sky surveys, undersea sensors, and more – and connect often diverse resources in easily accessible ways that save researchers and institutions both time and money. Moreover, researchers can focus on their scientific goals without having to know how supercomputers and other data cyberinfrastructures work.

The variety of hardware and support for complex, customized software environments will be of particular benefit to Science Gateway developers,” said Nancy Wilkins-Diehr, an associate director of SDSC and co-director of XSEDE’s Extended Collaborative Support Services. “We now have more than 30 such Science Gateways running on XSEDE, each designed to address the computational needs of a particular community such as computational chemistry, atmospheric science or the social sciences.”

Comet is a Dell-integrated cluster using Intel’s Xeon Processor E5-2600 v3 family, with two processors per node and 12 cores per processor running at 2.5GHz. Each compute node has 128 GB (gigabytes) of traditional DRAM and 320 GB of local flash memory. Since Comet is designed to optimize capacity for modest-scale jobs, each rack of 72 nodes (1,728 cores) has a full bisection InfiniBand FDR interconnect from Mellanox, with a 4:1 over-subscription across the racks. There are 27 racks of these compute nodes, totaling 1,944 nodes or 46,656 cores.

In addition, Comet has 36 GPU nodes, each with four NVIDIA GPUs (graphic processing units) and two Intel processors, and will soon have four large-memory nodes, each with four Intel processors and 1.5 TB of memory. The GPUs and large-memory nodes are for specific applications such as visualizations, molecular dynamics simulations, and de novo genome assembly.

Comet users will also have access to 7.6 PB (petabytes) of Lustre-based high-performance storage, with 200 GB/s bandwidth to the cluster. It is based on an evolution of SDSC’s Data Oasis storage system, with Aeon Computing as the primary storage vendor. This system is split between a scratch file system, and an allocated file system for persistent storage. There are significant improvements in these second-generation Data Oasis file systems, beginning with a ground-up design based on ZFS-backed storage for both performance and data integrity. ZFS continually monitors and repairs low-level blocks of data stored on disk, avoiding the silent data corruption that can occur with storage as large as Comet’s. Comet will have a second level of data reliability as well, since the first-generation Data Oasis servers are being consolidated and re-deployed to create a ‘nearline’ replica of the active file systems.

With its latest Lustre file system SDSC is leading the way” said Jeff Johnson, co-founder of Aeon Computing. “SDSC and Aeon Computing collaborated on the design of this new Lustre file system and it leads as one of the first large-scale Lustre file systems that make full use of ZFS direct to disk drives without any hardware RAID technology.”

Comet will feature a new 100 Gbps (Gigabits per second) connectivity to Internet2 and ESNet, allowing users to rapidly move data to SDSC for analysis and data sharing, and to return data to their institutions for local use.

Comet replaces Trestles, which entered production in early 2011 under an earlier NSF grant to provide researchers not only significant computing capabilities, but to allow them to be more computationally productive. Trestles and Gordon are the leading Science Gateway systems in the XSEDE portfolio, with more than 1,200 users per month accessing those systems through the popular CIPRES phylogenetics gateway alone.

Trestles users have spanned span a wide range of domains, including astronomy, biophysics, climate sciences, computational chemistry, and material sciences, and we expect that Comet will attract researchers from many more domains,” added Moore.

In this video from LUG 2015 in Denver, Rick Wagner from SDSC presents: SDSC’s Data Oasis Gen II: ZFS, 40GbE, and Replication.

The second generation of SDSC’s Data Oasis Lustre storage is coming online to support Comet, a new XSEDE cluster targeted at the long tail of science. The servers have been designed with Lustre on ZFS in mind, and also update the network to use bonded 40GbE interfaces. The raw storage totals 7.7 PB and are again based on commodity hardware provided by Aeon Computing, maintaining our focus on cost. Most of work in preparing Comet’s storage has focused on performance, and using this design we have achieved 7.5 GB/s per server sustain read bandwidth over a bridged Ethernet to IB fabric. I will describe the work required to get there, include the selective application of patches to the Lustre code, ZFS tuning, and the interplay of NUMA architecture with Lustre SMP node affinity and CPU partitioning. The second major feature of Data Oasis is the reuse of the first generation servers as Durable Storage, i.e., a backup file system for the second generation hardware. This is being implemented using Robinhood, along with some custom software to facilitate the initial transition from one generation of hardware to the next. All of the aforementioned is in progress, and will be updated according to our experiences as the deployment of Comet proceeds.”

Sign up for our insideHPC Newsletter.