Sponsored Post
Sharing a common architecture, Intel® HPC Orchestrator and OpenHPC are changing the face of HPC by providing a cohesive and comprehensive system software stack. Dr. Robert Wisniewski, Chief Software Architect Extreme Scale Computing at Intel Corporation, discusses the advantages of this approach and how to leverage it to bring together HPC and the cloud.
When Intel first began developing the architecture behind Intel HPC Orchestrator, we did so with the vision of being part of an open community that would offer customers the ability to choose what components or what instances of components best met their needs. Furthermore, we wanted to ensure OEMs had the opportunity to add value through a differentiated product. A competing tension, however, was to provide a common stack for user applications and ISV codes to avoid the time-consuming porting effort.
These tensions were resolved by modularizing the stack while keeping the northbound interfaces – that is, the interfaces up to the application and out to the administrative controls – consistent. This allows OEMs or customers to choose components while still conforming to conventions. It is important to note that we use the term “conventions” rather than “standards” because the latter tends to connote long, drawn-out efforts to reach consensus. Instead, what we are trying to use conventions so OEMs and customers can easily replace components, and so we can allow applications to run whether the stack is picked up from OpenHPC or whether it comes from any of the OEMs that Intel is partnering with to provide a supported Intel HPC Orchestrator stack.
We have been successful from an architecture point of view, but there are still challenges because in some areas well-defined conventions have not yet been widely adopted. As an example, resource manager components generally interact well because vendors such as Cray and IBM provide compatible runtime interfaces that have been accepted by the resource management community. However, the compatibility between resource managers and provisioning components (which load software on bare metal machines) is not currently as well-defined. Another example of an area that could benefit from broader acceptance of conventions is fabric management.
Interfaces and Contributions
A founding principle of OpenHPC was that it was intended to be a meritocracy. Much like Linux, we want it to be a vibrant community where those who contribute are the ones who drive it forward. For its part, Intel is taking an active role in continuing this conversation about conventions as well as contributing components and component enhancements. One of the contributions Intel is looking to make through OpenHPC and Intel HPC Orchestrator is the creation of adapters with common and consistent data access interfaces (DAIs) to help, for example, resource managers/schedulers interface more easily with provisioning components. For the vision of HPC to be successful, it needs to be a community effort, and OpenHPC has to become the gathering point for that interaction.
Progress
OpenHPC is heading in a couple of important directions. The Technical Steering Committee has a process by which contributions can both be requested and submitted for inclusion into the latest version. One of the attractions of OpenHPC is that there is a myriad of vendors, academic institutions, labs, and people in their garage, so to speak, contributing to the community as a whole, and we anticipate new contributions in the area of providing a richer build and test environment as a lot more components come in from different sources. The community is also working on extending the stack to meet the needs of new users who do not have a lot of HPC experience.
Currently, the installation of an OpenHPC stack requires a system administrator-level of experience and understanding, so the community is working on simplified ways to do the install and automatic discovery of the different hardware components.
There is also growing interest in being able to run OpenHPC as a container in a cloud environment. At the 2016 Supercomputer Conference Intel presented a proof of concept (PoC) showing OpenHPC and Intel® HPC Orchestrator running in a cloud environment and leveraging both cloud and HPC capabilities.
The PoC was based on a use case where a hypothetical organization had a lot of data coming from Internet of Things (IoT) sensors into a cloud infrastructure. This organization wanted to run HPC analysis on certain sets of that data. To satisfy the requirements of this use case we used OpenStack components, Nova and Ironic, to carve off a set of nodes and then used Glance to provision the nodes, effectively creating a sub-cluster that could run HPC jobs within the cloud environment. By doing that, the HPC job had access to the data and the resources that were gathering the information, enabling the analysis to be conducted on that machine. We then extended that PoC by combining those newly created HPC nodes with an HPC cluster. This allowed a single HPC job to run across the HPC cluster and the newly create HPC nodes, extending the computing power of the HPC cluster and allowing it access to the IoT data.
Convergence?
Although the concept of converged cloud and HPC has been talked about for a while, it is not clear that convergence is inevitable as, historically, HPC and the cloud have been far apart. Initially, cloud was disk based, loosely coupled (i.e., connected via standard networks) and used to solve “large grain” problems, i.e., jobs that could be broken into relatively large, independent chunks. Instead of just using spinning disks, people then began to use more sophisticated file systems based on memory, which effectively eliminated two orders or more of latency in terms of getting to the data. By moving cloud data analysis into memory, the application was accelerated so that the bottleneck moved to the network, making it look more like HPC.
Early on, HPC was based on tightly coupled SMP, but evolved to a distributed memory architecture or clusters. As the network became the bottleneck, HPC architects invented low-latency high-bandwidth fabrics to support fast data movement. In HPC, there has been a trend during the last five years towards more complex workflows, including multi-tenancy – running more than one job on the same node at the same time. This allows several processing steps or applications to work closely together on a given data set, as well as providing real-time access to simulation data for computational steering for visualization. These types of capabilities are going to be important for some users, but there remain significant differences between jobs and requirements that mean cloud and HPC may not yet completely merge.
Users will continue to want a dynamic cloud environment that can handle cloud jobs. But organizations are increasingly realising that HPC can improve product design and impact their bottom line in ways that were not possible even five years ago. So they are going to want to have both capabilities, and whether we condense them to a single machine or not is probably less important than driving towards an infrastructure that allows cloud and HPC to coexist and complement each other.
Dr. Robert W. Wisniewski is an ACM Distinguished Scientist and the Chief Software Architect for Extreme Scale Computing and a Senior Principal Engineer at Intel Corporation. He is the lead architect for Intel’s HPC Orchestrator, a cohesive and comprehensive software stack, and responsible for the software for Aurora, the world’s largest announced supercomputer. He has published over 71 papers in the area of high performance computing, computer systems, and system performance, filed over 56 patents, and given over 49 external invited presentations. Before coming to Intel, he was the chief software architect for Blue Gene Research and manager of the Blue Gene and Exascale Research Software Team at the IBM T.J. Watson Research Facility, where he was an IBM Master Inventor and led the software effort on Blue Gene/Q, which was the fastest machine in the world on the June 2012 Top 500 list, and occupied 4 of the top 10 positions.