Intel® Architecture Deployment at Texas Tech University Relies on Intel® HPC Orchestrator

Print Friendly, PDF & Email
Alan Sill, Senior Director of the TTU HPCC, discusses Intel HPC Orchestrator

Sponsored Post

When it came time to perform a substantial upgrade of the TTU IT Division’s High Performance Computing Center (HPCC) at Texas Tech University (TTU), the challenge was to easily manage such a large expansion, which would effectively double the center’s parallel computing capacity. Since the existing 10,000+ core system includes several equipment generations, as well as the latest technology from Intel, HPCC staff at TTU chose Intel® HPC Orchestrator for the task. In this article, Alan Sill, Senior Director of the TTU HPCC, explains why.

Alan Sill, Senior Director of the TTU HPCC, discusses Intel HPC Orchestrator

Alan Sill, Senior Director of the TTU HPCC

Six years ago, in early 2011, our high-performance computing deployment included a cluster that was at that time ranked number 111 in the world. But time moves on, and if you’re on the Top500 list in one iteration you’re very likely not on it in another. Within existing funding and physical infrastructure constraints, we realized we had a long gap to make up to get back onto the Top500 list. We needed to define a series of economical but substantial upgrades with a clear, consistent underlying technological roadmap and vision. As the first step, we aimed to double our existing computing capability with a path that could be upgraded further in the future and the roadmap we defined led to our choice of Intel® Omni-Path Architecture.

The HPCC at TTU operates under the auspices of the Associate Vice President for Information Technology and TTU Chief Information Officer, Mr. Sam Segran. Given the importance of obtaining maximum value for the university for any such infrastructure investment, our approach to any provider claim is to test it with real benchmarks. This first step in the upgrade would provide us with the direct ability to test applications at scale before proceeding with further upgrades.

During the design of the new cluster addition, we compared available reports on various technologies for high-speed interconnect and we couldn’t find evidence for substantial differences in actual user applications between on-loading and off-loading fabric options, so we chose Intel Omni-Path primarily for cost reasons, and we’re very happy that we did.

Our new cluster, with only 8,748 cores of Broadwell Xeon processors so far, occupies just four racks, but delivers more than 250 Tflops of measured Linpack performance.

This compares to the 189 Teraflops theoretical capacity for over 10,200 cores of the older, mostly Westmere and Ivy-Bridge Xeon based cluster. With the old system in continued operation, the combined capability is more than 2x the original 189 TFlops, exceeding our goal of doubling the parallel computing capability in this initial step on our roadmap.

A number of other performance metrics have been beyond our expectations as well. For example, each 2.1-GHz core in the new cluster performs typically 30% to 40% faster in real user applications compared to the 2.6- to 2.8-GHz cores in the previous cluster. This is partly attributable to the newer, denser chip designs and the rest we attribute to the non-blocking 100-Gbps Omni-Path fabric, which is a distinct improvement over the previous mixed 20- and 40-Gbps Infiniband fabric for the older cluster.

Introducing Quanah

Featuring Intel HPC Orchestrator, Quanah represents an important step on TTU’s HPC roadmap. Credit: TTU

We have three datacenter locations, and this new cluster fits into our largest. Named for Quanah Parker, a well-known and locally admired Native American historical figure here in Texas, the cluster currently consists of one login node and 243 worker nodes, configured with a non-blocking fat tree architecture. The design can accommodate 288 nodes in its present configuration, expandable to 1,152 nodes using 48-port core and leaf switches. For many reasons, we’ve been fans of non-blocking architectures for a long time – for one thing they greatly simplify scheduling for parallel jobs by avoiding the need to schedule jobs to fit into locally well-connected islands of connectivity within the fabric.

Each Quanah node has two processors yielding 36 cores per node, along with 192 GB of memory per node, giving a total of 45.6 terabytes of memory cluster-wide. The internal management node, Charlie, has been named for the rancher Charles Goodnight, another notable historical figure and friend to Quanah Parker. Our immediate plans, given the success of our initial trial, are to proceed with replacement of the old Infiniband-based cluster through expansion of Quanah, retiring nodes of the older cluster as needed with the goal of eventually reaching a much larger instantiation of the new resources.

We began the commissioning process for the new cluster with OpenHPC, a collaborative HPC community effort, but switched to the OpenHPC-based Intel® HPC Orchestrator product once it was announced. We have been working closely with the Intel® HPC Orchestrator team since then to pass on our early experience. One of the things you find during assembly of any cluster is that it’s like shopping in the grocery store – you can pick a range of ingredients and think they’ll make a really nice dish but you then find that you need sufficient expertise to put them together properly. And it’s the same with computing.

Our experience of Intel HPC Orchestrator began when we made sure to have front row seats at the Intel Developer forum when it was announced at SC16. We got there bright and early for the announcement session, because Intel said that with HPC Orchestrator they wanted, among other things, to address the integration of multiple Intel hardware and software technologies, which matched our needs. As an academic shop, despite our previous experience, we don’t have a limitless budget either in terms of staff time or money, and so the only two conditions where we can justify the expenditure on software tools are when they can help us save staff time spent on maintenance or shorten the calendar time to deploy a given solution. Intel HPC Orchestrator held the promise of integrating our choices of technologies more smoothly, and it’s a promise that we’re now working closely with the Intel team to realize.

A coming together of technologies

Combining all the software and hardware technologies within a cluster to the point that you can have a seamless upgrade and operational experience is definitely a work in progress for high performance computing in general. One way to ease the amount of knowledge and effort required is to use the open source OpenHPC HPC middleware stack, which has all the components you need in a pre-integrated package. For use on generic hardware, you can get started and learn a lot by working with OpenHPC.

Intel also offers a commercially-supported version of OpenHPC called Intel® HPC Orchestrator. Where we look to Intel HPC Orchestrator is in the areas of smoother integration with our actual choices of mostly Intel hardware and fabric. Our goals are less downtime between upgrades, better performance, and better and smoother overall system utilization.

We are essentially an all-Intel shop right now with respect to Quanah, since we use Dell products built on Intel technologies. The individual component choices—processors, fabric, storage, etc.—were made separately, but the combination provides us with an opportunity to work closely with Intel using HPC Orchestrator to make sure that everything works together smoothly.

It is a significant amount work to do these integrations—for example, to make sure that MPI software components, Omni-Path fabric drivers, storage access software, compilers and schedulers all work together and are upgraded when needed at the same time—and we’ve started a series of monthly calls with the Intel HPC Orchestrator development team to ensure that they have our input into that process.

Of course, all of the advances we strive for are dependent on the availability of funding. We return a ratio of several to one in terms of funded research we support compared to university investment, but we’re not the biggest or only computing shop around. I spend a lot of time studying the practices of computing centers that are much larger, not necessarily to try to reach to their scale, but to understand how best to run our own datacenters for best productivity and efficiency. There are some excellent universities and supercomputing labs in the country and internationally, and one of the things they appear to do is allocate a significant amount of their support resources to users who aren’t the stereotypical science or engineering group that’s really good at supercomputing. This allows much greater overall productivity.

It really is striking that they devote a lot of time and effort to what they call the “broad tail” of users and fields that you don’t normally associate with HPC. We’d like to do this do too, and the only way we can afford the time is if we have a supported, integrated, smoothly operating set of software and hardware. And as much as we computer types enjoy tinkering with stuff, you have to get the systems into operation if you want to spend your time working with users who need your time and effort to get them onto the path to success.

The choices we’ve made are all intended to try to arrive at a configuration for our central resources that will be as simple as possible for us to operate, while providing the maximum amount of computing power that we can afford.

This will allow us to spend more time with users, and we hope in turn, allow them to achieve better overall research success.

Alan Sill is Senior Director of the High Performance Computing Center at Texas Tech University and Co-Director of the multi-university NSF Cloud and Autonomic Computing Center. ‎

 

 

Comments

  1. Diana Cavazos says

    Texas Tech University’s High Performance Computing Center professional staff work diligently to support academic research on campus. By providing administrative support, I’m proud to be a part of a dedicated team of people serving users with varied levels of experience in HPC.