Critical Requirements for HPC in the Cloud

This is third article in a series on Cloud Computing for HPC. This article series is taken from an insideHPC guide that describes the challenges that users face and the solutions available to make running cloud based HPC applications a reality. This week we look at the critical requirements for HPC in the Cloud.

HPC-Cloud-300x214Users in HPC environments have requirements for using a cloud provider that are different than typical enterprise applications. When running an application that is designed to return an answer or solution in the fastest time, performance matters. An in-house data center may serve hundreds or thousands of users for a certain class of applications where absolute speed is not critical. However, those involved in HPC-type applications demand the fastest response. There are a number of considerations for ensuring maximum performance for running HPC applications in the cloud.

Right type infrastructure – Make sure that an HPC infrastructure is designed in from the beginning, not cobbled together as an afterthought. This would include:

  • Latest servers, adequate number of servers – Make sure that any application that is submitted to the cloud can run, based on the characteristics of the application (some are more scalable than others) and the limits imposed by the user (maximum number of cores to be used).
  • Bare-metal for HPC – Users demand the fasted performance, which is achieved by running the application directly on the server’s OS. This is contrary to using a virtual machine (VM), which is another layer of software and can limit performance of using the computer system capabilities directly. The “hypervisor tax” can reduce performance by up to 10 percent, according to some estimates (reference 3, reference 4).
  • Accelerators – GPUs can speed up the processing of certain types of applications. Incorporating a number of GPUs into the computer systems is increasingly important for reducing the time to completion of an application. For applications that can take advantage of one or more GPUs, it is critical for the cloud provider to offer this capability to the user.
  • Low latency and high-speed interconnect – One of the most important parts of an HPC cloud offering is the connection between systems. Much of the HPC software ecosystem can take advantage of hundreds to thousands of cores running simultaneously, and requires system-to-system communication at very fast speeds. The current standard for HPC interconnect applications is InfiniBand (IB) and is a requirement for high-performing applications that can use many cores across many separate computer systems. Tightly coupled applications are often written with an API, the messaging passing interface (MPI) or with shared memory programming models. There are a number of MPI libraries available for use. While loosely coupled applications can get by using a GigE (1 or 10) network, closely coupled applications require the low latency and high bandwidth that IB delivers.
  • Geo-distributed resources – For redundancy, many cloud providers maintain data centers in geographically different areas of the world. This can allow for the work and environments to be transferred to a data center that is up and running, should a significant event bring a data center down.

Service – Although all cloud providers offer some sort of service to their customers, those that would use HPC applications in a cloud environment must be able to rely on the expertise that an organization that has been involved with HPC for many years has.

  • HPC experts – The requirements for those using HPC in a cloud environment will differ from those of an enterprise customer. The details of how to get maximum performance from a given set of hardware will be paramount for the user. Specific libraries and the tuning and setup of the ISV application on the cloud provider’s hardware requires the knowledge from dedicated HPC experts.
  • Smaller company, each customer matters – Although a very large cloud provider can make available many types of instances and could scale into the millions of cores, most likely the provider is concerned with the larger enterprise software environment. Smaller, niche players can dedicate more expert resources to each customer, ensuring excellent customer satisfaction. In addition, if a customer requires a specific setup for their HPC application, a smaller provider will more likely be able to customize the environment for that specific user.
  • Cloud provider admins – Operating a compute farm specifically for HPC applications that can be run on bare-metal servers requires a different set of administrative expertise than a broad-based virtual offering environment. Providers that have previous experience in selling hardware to the HPC community will be able to work with the consumer to solve their most difficult challenges.

ISV licensing agreements – A large portion of the commercial HPC market uses ISV software, which carries licensing costs. In order to provide a smooth experience, licensing with an ISV must be as seamless as possible.

  • License maintenance on cloud provider side – The cloud provider works with an ISV to hold a license and then bills the user for the job completed or the resources used as defined by the license.
  • For those ISVs that don’t provide a cloud license or short-term lease, it is important that the cloud provider has the flexibility and expertise to incorporate traditional licensing mechanisms, either locally or via secure tunnels.
  • Wide range and number of agreements – When running an HPC cloud infrastructure in order to make the environment easy to use, the provider should have as many licensing agreements in place as possible. The customers should not have to manage the licensing process. With a wide range of offerings in terms of ISV software, the experience will be more fluid.
  • Libraries, dependencies in place, cloud provider owns – In most environments, the application requires a number of libraries to use, and depends on a number of factors for productive work. Dependencies (including specific versions) involving libraries and run-time software must be set up and integrated before actual runs take place.

Transparent cost model – Customers using a cloud provider for HPC applications acknowledge that a cost is involved. It is important to make transparent to the users the entire cost of running their applications in the cloud, in order to make valid and detailed comparisons to a comparison of on-premises versus the cloud.

  • CPU granularity billing – Since a physical server contains a number of sockets, each of which contains multiple cores and an HPC application that will run for some time, the actual cost to the user is important. Billing based on actual cores used rather than the entire server will be more reflective of the actual use of the computer system.
  • Prices are public – A number of charges by the provider to the customer would comprise the final bill. It is important that when using a cloud provider for HPC applications that all costs are made public so that the customer can make their own decision as to the benefits of using the cloud versus in-house options.
  • No transport costs – Since the data upload and download for an HPC application may be significant, some providers may charge for this service. The value of using the cloud for running HPC applications is the application itself, so transport costs and billing should be avoided.
  • Storage costs – As with any significant computing application, the amount of data produced can be large. It is reasonable to expect that there will be some sort of storage cost associated with HPC applications. Even if the transport costs are low to zero such that data could be moved back and forth to the customer site from the cloud provider, this would delay the startup of future runs on the same data as that data would have to then transferred back up to the provider. Understanding the storage costs in terms of capacity and time frame is very important.

Simple GUI – Access to a cloud provider’s resources requires that the user interact in some way with those resources. The simpler the interface, the more time a customer can dedicate to using the cloud provider’s resources without frustration over having to learn a new and unfamiliar access model. Obviously, a simple GUI run within a web browser will allow for the maximum level of ease-of-use for new customers.

  • Made for users, not system admins – Users want to get their HPC applications run and not have to deal with complex access methods. Simple GUIs tailored to what the user must do to become productive are always more valuable than having to learn a new system.
  • Command line available – For power users, in addition to a simple GUI, a command line interface may be desirable, especially when having to create scripts for various actions.
  • ISV licenses easy to use – Accessing an ISV license should not have to involve setting path names and linking to remote systems. When an application starts, that cloud provider should be able to access the necessary licenses seamlessly.

Remote visualization – HPC applications will most likely produce significant amounts of data that must be post-processed in some way. Data from the application will be stored on a storage device and then can be either downloaded to the customer site or remotely visualized from the cloud site.

  • Post-processing after the HPC application completes – Visualization of the data that resides on the cloud provider site requires specialized hardware (high-end graphics cards) to produce accurate visualization at acceptable frame rates. The challenge then becomes sending the images over the network, back to the user’s desktop or other client device. The simpler that this process is for the user, the better and the more productive an engineer or analyst can perform. Various techniques and compression technologies can be used. It is important to understand the requirements for the client and the limitations on performing remote visualization. By creating the visualization on the cloud-based server, significantly less amounts of data must be transported back to the user. As the simulation post-processing increases in size, the use of cloud-based visualization can dramatically reduce network requirements.
  • Fast – Experiments have shown that the maximum latency when performing interactive 3D graphics must be in the range of 100 to 120 milliseconds. If the delay from mouse movement to graphics screen update is more than this value, the user becomes frustrated and loses productivity. Thus, assuming an adequate network, the combination of hardware graphics on the remote end and display at the client end must be fast to enable a productive experience.
  • No added software – Various techniques exist to move the image data from the cloud provider to the client. In some cases, software must be installed on the client to display the images that are being streamed across the network. However, a simpler solution exists that enables the display of images in a browser tab, using an HTML5-compatible browser. No added software is necessary.

API available – In some cases, the ability to script the entire session, without using a GUI for access to the cloud provider, can be valuable. Data uploading, job submission and notifications of job completion could all be automated and invoked automatically or with a simple action by the user on their local (client) system. However, this should be in addition to an easy-to-use GUI for a majority of the users. APIs also offer software vendors the ability to seamlessly plug a cloud resource into their software products and interfaces.

Queueing systems – When a user uses a cloud provider for HPC applications, maximum performance from the hardware, network and storage is the goal. However, a public cloud provider is designed for a multi-tenant use case. Thus, it is important that an HPC cloud provider implement the tools and controls to make sure that each application has complete and exclusive use of the cores and sockets in a specific server. Modern resource management systems can assure this, but the cloud provider must create the policies that ensure this.

Next week we’ll explore some of the best uses of cloud based HPC. If your prefer you can download the entire series in a PDF from the insideHPC White Paper Library, courtesy of Penguin Computing.

Download the insideHPC Guide to Cloud Computing for HPC