Cloud Automation: Five Ways to Contain HPC Cloud Spending

This timely article from our friends over at Univa examines how the cloud is attractive for organizations that need fast access to specialized resources, and that want to avoid the cost and complexity of on-premise infrastructure. The article then offer five important ways that Cloud Automation can reduce cost.

While HPC users rely on local infrastructure for most workloads, cloud spending is on the rise. According to Hyperion, 70% of HPC sites are running at least some portion of their workloads in the cloud [1], and growth in HPC cloud is outpacing overall HPC spending growth. Cloud is attractive for organizations that need fast access to specialized resources, and that want to avoid the cost and complexity of on-premise infrastructure.

To manage HPC cloud spending, cluster administrators employ a variety of technique. These include:

  • Cloud-specific tools to manage resource usage and track spending against budgets
  • Using workload managers to automate cloud bursting so that only selected workloads can run in the cloud
  • Using third-party cloud service expense management (CSEM) tools to monitor spending and budgets across multiple clouds

Current solutions fall short

While these measures help, they only address part of the problem. Expense management tools can report, forecast, and alert on cloud usage and spending, but they have no visibility to applications, their business priority, and actual resource and data requirements.

Similarly, a workload manager can determine cloud-eligible workloads, and even be extended to provision cloud instances automatically, but the workload manager has no visibility to details such as the cost of cloud-resources, cost centers and month-to-date spending against budgets.

When a decision is required, often the best these systems can do is raise an alert – throwing the problem to an already busy cluster administrator. For example, if a high-priority workload has a deadline, and the associated cost-center has already exceeded its monthly budget, what should be done? Provision cloud resources regardless and overspend? Compromise the SLA? Pre-empt other running workloads? What about data?

The issues are complicated, and in an ideal NoOps world human administrators can’t get in the middle of every decision. This is why 80% of IaaS users will overshoot their cloud budgets [2].

HPC Cloud Automation

Fortunately, a new generation of cloud automation tools promises to automate these complex decisions common in most HPC centers. HPC Cloud Automation tools provide:

  • Multi-cloud provisioning and monitoring
  • Cloud service expense and budget management
  • Workload and resource-aware automation
  • A flexible automation engine with customizable actions

Cloud automation works by coupling a multi-cloud system with an automation engine that gathers and makes decision based on multiple categories of metrics from multiple cloud providers and HPC workload managers such as SLURM and Univa Grid Engine. It also provides the capabilities of a cloud service expense management tool including allocation and tagging, budget management, spending anomaly detection, dashboarding and reporting.

With visibility to cloud costs, resource utilization, queue information, workload consumption, and workload characteristics across multiple clouds, the automation engine can make decisions triggering built-in or user-defined “automations”.

Five ways that Cloud Automation can reduce cost

Detect and shutdown idle instances – According to InfoWorld, as much as 35% of cloud spending is wasted [3], and a major contributor to this problem is orphaned or idle machine instances. While most cloud service expense managers can detect idle instances, cloud automation provides needed sophistication. For example, before shutting down an instance it may determine that an instance is not idle but is hung and needs rebooting. Before shutting down an idle storage service, it might proactively retrieve needed data or move it to a lower-cost object-store.

Auto-provision the most cost-effective cloud and instance types – Cloud instances are usually provisioned based on the workload requirements. For example, a workload might require a specific number of cores, memory, or a specific type of GPU. Requirements can get complex. For example, a workload may deliver optimal cost-efficiency when no more than two jobs run per instance, and all instances have access to a parallel file system. An automation platform understands these detailed requirements and can ensure that workloads are matched to instance types. This results in workloads that run faster and more efficiently, and a reduction in the time that cloud instances are deployed.

Employ smarter bursting and cloud-scaling – While workload managers often support simple cloud bursting, automation brings more sophistication to scaling decisions. For example, a cloud automation platform can check spending against budgets before a workload scales and either scale-up cloud services or pre-empt lower priority workloads running elsewhere. For embarrassingly parallel workloads (where individual jobs are re-queued if they fail) the automation engine may decide to deploy lower-cost Spot instances or pre-emptable VMs thereby further reducing costs with minimal impact on service levels.

Right-size instance selection based on actual resource usage – HPC users often overshoot budgets because they over-state resource requirements. For example, a workload may require 8GB of RAM and 3 or 4 vCPUs at most, but users don’t want jobs to fail at runtime. They may indicate that the job needs 8 cores and 32GB – providing a buffer and thinking that the job will run faster. If an AWS c5n.9xlarge instance is used (36 vCPUs and 96GB of RAM) in theory, this instance should support nine concurrent jobs. Inflating the resource requirement limits the number of concurrent jobs to four, increasing the costs by 125%! A cloud automation platform can measure and track both resources requested, and resources used, providing the opportunity to right-size resource requirements for significant savings.

Reduce data storage costs with policy-based data movement – The optimal location to run a workload often depends on proximity to data. For example, a life sciences application that references cloud-resident genomes might run most effectively in the cloud. An oil & gas simulation accessing terabytes of local seismic data may be more time and cost-efficient to run locally. Also, storage requirements vary depending on the application. A life sciences application may require storage that supports millions of files (very high IOPS) whereas a parallel CFD simulation may require high-bandwidth parallel scratch storage. A cloud automation platform can make decisions at runtime, considering the workload type, the location of data, the time and cost involved in moving data, and the storage required selecting the most time and cost-efficient option.

Learning more

Univa® Navops is a multi-cloud provisioning, monitoring, and management platform that helps enterprises integrate HPC environments with the cloud, control spending, rightsize cloud resources and automate decisions based on real-time cloud, application and workload-related metrics with composable automation applets.

Learn more about Navops Launch and Univa HPC hybrid cloud solutions at http://www.univa.com/products/navops.php .

About the Author

Robert Lalonde is Vice President and General Manager, Cloud for Univa. Rob brings over 25 years of executive management experience to lead Univa’s accelerating growth and entry into new markets. He has held executive positions in multiple, successful high tech companies and startups. Rob possesses a unique and multi-disciplined set of skills having held positions in Sales, Marketing, Business Development, and CEO and board positions. He has completed MBA studies at York University’s Schulich School of Business and holds a degree in computer science from Laurentian University.

 

[1] https://hyperionresearch.com/wp-content/uploads/2019/06/Hyperion-Research-ISC19-Breakfast-Briefing-Presentation-June-2019.pdf

[2] https://www.gartner.com/en/conferences/apac/infrastructure-operations-cloud-australia/why-attend/event-resources/research-lower-aws-costs

[3] https://www.infoworld.com/article/3344477/why-35-percent-of-cloud-spending-is-wasted.html