Taming the Cost of HPC in the Cloud

Print Friendly, PDF & Email

This timely article from our friends over at Univa examines methods for reigning in costs for managing HPC in the cloud, specifically moving beyond cloud automation to cloud expense management. Three proactive approaches for managing costs in the cloud are provided.

HPC in the cloud is here. According to recent research from Hyperion, 74% of HPC users now run at least some workloads in the cloud [1]. While HPC cloud usage is a small part of the ~ $40 Billion cloud IaaS market, it’s among the fastest-growing segments. New workloads demanding specialized GPUs and increasingly capable HPC cloud capacity are helping accelerate cloud adoption.

Most HPC centers have found ways to automate the provisioning of cloud instances to simplify operations and leverage cloud capacity. While convenient, automated cloud bursting and cluster deployments increase the risk of cost overruns. HPC applications often require hundreds or even thousands of specialized machine instances, and without adequate management controls, organizations can find themselves facing hefty charges from their cloud service provider (CSP).

Cloud is Convenient, but Costs are Hard to Manage

On the surface budgeting for cloud services seems simple. Machine instances are charged by the hour making costs linear and predictable. The challenge for customers is that cloud service providers (CSPs) offer multiple rate structures such as reserved instances, on-demand instances, spot instances and spot fleets with costs than can vary between region.

While machine instances represent about two-thirds of total cloud spending, there are additional sources of costs that are more challenging to model. These include object and block storage, file systems and network-related services such as VPCs, gateways, and VPNs. Each service has its own pricing model, and many services have multiple chargeable components. For example a storage service may have a tiered pricing scheme priced based on capacity, IOPS, storage temperature, data availability, and per-transaction costs.

Complicating things further, multi-cloud deployments are increasingly a fact of life. Despite best efforts to standardize on a single CSP, multi-cloud environments arise organically through mergers and acquisitions, collaborations with third parties, SaaS or PaaS offerings unique to a specific CSPs, and LOBs making independent purchasing decisions.

Most Cloud Users will Overshoot their Budgets

Gartner estimates that 80% of IaaS users will overshoot their budgets mostly because they lack necessary process controls to deal with costs in the cloud [2]. Furthermore, according to Flexera’s RightScale 2019 State of the Cloud Report, as much as 35% of cloud spending is wasted [3]. While consuming cloud services is easy, accounting for, authorizing and optimizing its usage is surprisingly difficult. Estimates vary, but informal surveys of Univa customers using cloud for HPC workloads warn that without careful oversight cloud can cost 4x more than on-premise infrastructure.

Existing Approaches to Cloud Expense Management

To proactively manage costs in the cloud, we see clients exploring multiple solutions:

  1. Cloud-specific management solutions – Most CSPs offer cloud monitoring and cost-management services to help customers manage and optimize expenses. While useful, these systems are cloud-specific requiring making them difficult to apply in multi-cloud environments, and most raise alerts rather than taking proactive steps to limit cost overruns. Often these cloud solutions don’t provide cost visibility at the workload or department level,
  2. Workload management controls – Another approach to cloud expense management is controlling consumption of cloud services via workload management policies – for example, making “burstable” queues accessible only to specific users, groups or project teams or allowing cloud-resources to be tapped only when on-premise assets are fully utilized. These policies help limit access, but the workload manager isn’t aware of the cost of resources or month-to-date spending against budgets, so the workload manager alone can’t manage spending by department, project or user.
  3. Cloud Service Expense Managers (CSEMs) – To deal with multi-cloud environments, there are a variety of solutions in the category of what Gartner calls Cloud Service Expense Managers (CSEMs) (4). While capabilities vary, these multi-cloud tools provide:
    • Resource tagging to normalize terminology across CSPs
    • Cost optimization – auto-detecting orphaned or idle resources
    • Budget & discount management reflecting CSP Enterprise Agreements
    • Spending anomaly detection
    • Reserved instance management
    • Dashboarding, events, and notification services

CSEMs are a good step forward, but they too have limitations. Similar to cloud-specific solutions, when budgets are exceeded most tools simply raise an alert and throw the problem to a busy cluster administrator.

While a CSEM understands details about cloud resource consumption, it has no visibility to workloads, resource and data requirements, availability of on-premise resources, workload priorities, and project identifiers – Nor can it determine whether applications are actually consuming the cloud resources requested. This is a big problem since over-provisioned resources are one of the leading causes of cloud over-spending.

Without visibility to workload details, and the ability to automate actions, even the most capable CSEM lacks the information needed to manage cloud-spending against budgets.

Cloud Automation to the Rescue

A better solution is a cloud-automation platform having all the capabilities of a CSEM and a multi-cloud provisioning engine, but that is also application and workload aware.

By fusing metrics from multiple data sources, a cloud automation platform can take automated steps to control spending. For example, before considering whether cloud resources should be provisioned, a cloud automation platform might consider not just whether the application is eligible for bursting, but its priority, whether on-premise resources are available, the cost of resources to be provisioned, and month-to-date spending against budget for the cost center associated with the project. It might also take proactive steps such as provisioning additional cloud services needed by an application, automating data migration, or shutting down idle or orphaned instances to reduce costs.

Univa’s Navops Launch is among the first of a new generation of HPC cloud automation tools that combines multi-cloud provisioning, cloud-service expense management, and tight integrations with Univa Grid Engine, Slurm workload managers both widely used in HPC centers.

Navops Launch can help enterprises migrate HPC workloads to the cloud, control spending, rightsize cloud resources and automate decisions based on real-time cloud, application and workload-related metrics via composable automation applets.

To learn more about Navops Launch, visit http://www.univa.com/products/navops.php

References:

[1] Hyperion Research Opinion – March 2019 – https://d1.awsstatic.com/HPC2019/Amazon-HyperionTechSpotlight-190329.FINAL-FINAL.pdf

[2] How to Identify Solutions for Managing Costs in Public Cloud IaaS – https://www.gartner.com/en/documents/3847666/how-to-identify-solutions-for-managing-costs-in-public-c0

[3] RightScale 2019 State of the Cloud Report from Flexera- https://info.flexerasoftware.com/SLO-WP-State-of-the-Cloud-2019

[4] https://www.gartner.com/en/documents/3847666/how-to-identify-solutions-for-managing-costs-in-public-c0