Sign up for our newsletter and get the latest HPC news and analysis.
Send me information from insideHPC:


NNSA Unleashes Advanced Computing Capabilities to Serve Researchers at Three National Labs

Building on the experience of previous procurements, the National Nuclear Security Administration’s (NNSA) acquisition of next-generation Penguin Computing clusters is bolstering “capacity” computing capability. The new clusters under the administration’s Commodity Technology Systems (CTS) program will meet the evolving HPC needs of the NNSA’s three national security labs (Tri Labs)—Lawrence Livermore National Laboratory, Los Alamos National Laboratory, and Sandia National Laboratories.

Delivery of the systems procurement begins this month (April), and it will bring a total of 7 petaflops of computing power to the Tri Labs. The CTS procurements will run for three to four years. These machines will serve researchers in the Advanced Simulation and Computing (ASC) program, a cornerstone of nuclear stockpile stewardship, the program to ensure the safety, security, and reliability of the nation’s nuclear deterrent without underground explosive testing.

Commodity computing plays a critical role in stockpile science by providing broad access to researchers. Matt Leininger, the Deputy for Advanced Technology Projects at LLNL, points out that the majority of workloads run on the Tri Lab CTS machines are “lots and lots of small to medium-sized jobs running from a few cores to hundreds or even a few thousand cores. Scientists mostly run parameter studies to quantify the uncertainties in larger scale simulations. Predictive simulation involves both scaling the science and providing accurate error bars to those predictions.”

So, the scientists run simulations based on their parameters, change the parameters and run them again, perhaps hundreds of thousands of times. With the number of researchers at the three labs doing their science, CTS has to accommodate a significant number of simultaneously running workloads, and do it cost-effectively in a scalable manner.
CTS—Scalable Capacity Computing

tundraCTS-1 is based on Penguin Computing’s Tundra Extreme Scale cluster platform. CTS-1 integrates the latest technologies from Intel, including the Intel® Omni-Path Architecture (Intel® OPA) and Intel® Xeon processors E5-2695 v4 based on the Broadwell architecture. Intel OPA is the company’s next generation High Performance Computing (HPC) and scale-out systems fabric, and a key component of the Intel® Scalable System Framework. CTS-1 joins the Bridges system at Pittsburgh Supercomputing Center (PSC) in one of the early deployments of the new fabric.

Over the procurements of commodity clusters beginning in the mid-2000s with the Tri Labs Linux Capacity Clusters program (TLCC), these machines have evolved to allow us to maintain the advanced computing resources we need,” said Leininger. He manages the acquisition of systems under the new CTS program, the successor to TLCC. “The Labs adopted InfiniBand Architecture Single Data Rate (SDR) in previous clusters, even before TLCC, when it came out in early 2000.”

InfiniBand Double Data Rate (DDR) was part of the first procurement. In the second procurement, the Tri-Labs used Intel® True Scale fabric adapters on InfiniBand Quad Data Rate (QDR). “We saw phenomenal scalability with Intel True Scale,” added Leininger. “For CTS-1, we selected Intel OPA over the alternative. This decision was based on our performance benchmarking and assessment of any technical and schedule risks. Our expectation is that Intel OPA will continue to provide improved scalability in the new clusters.”

Next-Generation Technologies for Next-Generation Cluster Computing

The Labs developed the concept of a scalable unit (SU) to simplify the cluster design for multiple systems of different sizes. A SU is a basic “Lego®” building block that can be sited as an individual cluster or combined with other SUs to build multi-SU clusters, involving thousands of nodes. The SU size has evolved over the last ten years, but always involves a combination of nodes for compute, cluster management, gateway, and user login. The current CTS-1 SU includes 192 nodes comprised of 184 compute nodes, 6 gateway nodes, 1 cluster management node, and 1 user login node.

Several TLCC2 systems built on the SU concept were deployed at the Tri-labs. Chama at Sandia, Luna at Los Alamos, and Zin at Lawrence Livermore were the largest. All three appeared in the top 100 of the Top500 list in November 2012, the year of their deployments.

All the clusters in the CTS are built on commodity technologies and software utilizing the same architecture. “Besides the hardware architecture, we use the same operating system and software stack across the three sites,” said Leininger. “So we have common hardware and software which allows us to leverage each sites cluster computing expertise. Each lab has unique applications, and we run the machines pretty hard, but in slightly different ways.”

According to Leininger, each lab finds unique issues running their codes, so the Tri-Labs work with the vendor partners to fix and optimize the systems across the labs. “By doing this over and over, TLCC, and now CTS, have produced the most scalable and well run cluster systems that the three labs have ever seen,” he stated. “For the types of workloads we’re running, we expect to get a two to three times throughput increase from CTS with Intel Xeon processors and Intel OPA over the TLCC2 systems.”

Designing for the Highest Node Density

The CTS-1 procurement was very focused on both performance and risk reduction,” said Sid Mair, Penguin Computing’s Senior VP of their Federal Systems Division, a group dedicated to serving the needs of government computational needs. “Intel is the standard in HPC. So we chose Intel, because of the reliability and performance of their products, and their ability to deliver their new fabric in the timeframe needed.”

Intel just announced the Intel® Xeon® processor E5-2600 v4 product family used in the CTS-1 system, Intel’s first processor within Intel® Scalable System Framework. Based upon the “Broadwell” microarchitecture, the processor’s microarchitecture improvements with its increased core counts (up to 22 cores) and faster memory (up to DDR4-2400) offers HPC application performance improvements up to 47%*. The company also reports that Intel® Omni-Path Fabric delivers up to 24% higher messaging rate when used in combination with Intel® Xeon processor E5-2600 v4 product family*.

Intel just announced the Intel® Xeon® processor E5-2600 v4 product family used in the CTS-1 system, Intel’s first processor within Intel® Scalable System Framework. Based upon the “Broadwell” microarchitecture, the processor’s microarchitecture improvements with its increased core counts (up to 22 cores) and faster memory (up to DDR4-2400) offers HPC application performance improvements up to 47%*. The company also reports that Intel® Omni-Path Fabric delivers up to 24% higher messaging rate when used in combination with Intel® Xeon processor E5-2600 v4 product family*.

Mair said the new Intel OPA fabric with its 48-port switch was very beneficial for the CTS design. “With a 48-port switch, we could spread out the nodes wider and use fewer layers across the racks. On 1,000 nodes we needed only two layers of fabric instead of three. And, the bandwidth is so high that in some of the DOE installations they actually do a tapered fabric without impacting the running applications.”

Penguin expects that fewer switches will lead to lower costs and complexity for building out the new systems. “We have fewer cables, less parts, and a faster fabric. That became a very cost-effective, high-performance solution compared to the alternative with a 36-port switch,” added Mair.

The Tundra ES platform is based on the Open Compute Project (OCP) framework, which includes a 12-volt rail system. Penguin and Intel engineers worked together to integrate the CTS cluster with the Intel OPA fabric using ac power for the benchmarks and evaluation by the Tri-Labs. “Since Tundra ES is based on OCP, we recommended to Intel that we collaborate on an OCP 12-volt version of the Intel OPA switch for the framework.” What that means to building large clusters like CTS, according to Mair, is they can create racks with denser compute capability that use a common OCP power distribution. “Now you have incredible scalability in a much, much smaller space. We took the OCP architecture and essentially increased the density by more than 50 percent for HPC applications. We use less floor space, built in redundancy (Telecom quality power), efficient power connection, and less cooling—a 12-volt Intel OPA option creates a significant benefit for OCP-based clusters. And, that’s what was presented to the CTS-1 program office.”

Flexible Framework Across the Labs

Each lab has a unique data center infrastructure, requiring different kinds of cooling for their systems. According to Leininger, some systems use only air-cooling and others a mixture of water-cooling for the CPUs and air-cooling for the other system components. “CTS is a common architecture. We buy multiple clusters of different sizes, but they’re essentially the same. So they need to accommodate the cooling needs in the data centers. CTS-1 has the same metal bending in all the racks, but Penguin designed conversions that easily fit into the structure to adapt the hardware to the cooling at the particular site.”

In addition to requiring flexible cooling options,” added Dan Stuart, VP of Technology, Penguin Federal Systems Division, “each installation had other physical constraints such as weight density and available power. We designed each Tundra ES system to meet specific cooling, weight, and power requirements for the specific installation location.”

With CTS1 installed in April, the NNSA scientists can continue their stewardship research and management on some of the most advanced commodity clusters the Tri Labs have acquired, ensuring the safety, security, and reliability of the nation’s nuclear stockpile.

Sign up for our insideHPC Newsletter

Resource Links: