Getting Smart About Slurm in the Cloud

This timely article from our friends over at Univa takes a look at how often the popular HPC workload manager Slurm (Simple Linux Utility for Resource Management) is used in the cloud. In a recent insideHPC survey sponsored by Univa, all Slurm users surveyed reported using public cloud services to at least some degree.

It will be no surprise to HPC users, that Slurm (the Simple Linux Utility for Resource Management) is a popular HPC workload manager, but what may come as a surprise is the degree to which Slurm is used in the cloud. In a recent InsideHPC survey sponsored by Univa, all Slurm users surveyed reported using public cloud services to at least some degree, with some spending over USD 250K per month⁽¹⁾. While cloud spending represents a small fraction of HPC infrastructure budgets, it is one of the fastest-growing expense lines for most organizations – hence the need to get a grip on cloud spending.

Cloud is a Winner for Slurm Users

Sustained use of cloud services is expensive compared to on-premise alternatives. To combat this, most HPC centers run local infrastructure and tap cloud-based capacity only when needed. The convenience of cloud makes it compelling – users can access needed compute resources instantly and take advantage of specialized hardware often unavailable in local data centers boosting productivity and reducing wait times for busy on-premise equipment. 70% of Slurm users report

The way that customers deploy cloud-resident clusters and extend Slurm clusters to the cloud varies widely. Most customers use custom scripts, manual processes, or rely on HPC-oriented managed service providers (MSPs) to manage cloud deployments for them.

Cloud Spending: A Slippery Slope

While cloud computing can be economical for short-term use, costs can easily get out of control. According to Gartner, 80% of IaaS users will overshoot their cloud budgets through 2020 because they lack adequate process controls⁽³⁾. Making matters worse, according to a 2019 Rightscale survey, a staggering 35% of IaaS cloud spending will be wasted⁽⁴⁾. Cost over-runs occur because of orphaned instances, users leaving cloud services running when not in use, and users requesting more resources than are necessary.

Cloud provisioning and bursting solutions make it easy to consume cloud resources, but fall short when it comes to tracking and controlling spending by group, project and business unit. Also, existing solutions tend to focus on managing compute costs but fail to address other cost drivers such as block storage, file systems, archival storage, and various network-related charges related to data movement.

Budgeting for cloud services can be surprisingly complex. Cloud services are offered under different fee schedules (on-demand, reserved, spot instances and spot fleets as examples), they can vary by region, and fees can be based on multiple parameters including capacity, IOPS, network traffic and API calls. As if this weren’t hard enough, many organizations deal with multiple cloud providers making costs even more difficult to manage.

HPC centers need to be able to accurately measure and forecast cloud-spending by group and project, throttle cloud usage automatically when limits are reached, and automate actions to avoid wasteful cloud spending.

Cloud-spend Association is Key

Associating cloud-related charges to workloads is a major challenge for Slurm users. While 82% of Slurm users see value in associating cloud spending to workloads, only 25% have solutions in place that can reconcile workloads against cloud billing automatically⁽¹⁾. Most administrators use cloud-specific reporting tools and reconcile cloud billing and workload data manually or using spreadsheets. These solutions are usually “backward-looking” – they can report on over-spending after it occurs, but they provide no mechanism to prevent overspending in the first place.

Solutions for workload management, cloud service provisioning, resource monitoring, and cost-accounting are frequently siloed and poorly integrated. Slurm can’t manage what it can’t measure, and the scheduler has no visibility to budgets, rates for different cloud services, or spending to date by cost-center, group, or project. Because of this lack of information, it’s not possible to automate decisions related to spending and cloud bursting.

Automation to the Rescue

Fortunately, a new breed of workload-aware cloud automation tools can help automate the complex decisions common to most HPC environments. Cloud automation can help users burst when appropriate, manage cloud spending to budget, and right-size instance selection to minimize cost.

Univa® Navops Launch is a multi-cloud provisioning, automation, and spend management platform that helps enterprises integrate HPC environments with the cloud, control spending, right-size cloud resources and automate decisions based on real-time cloud, application and workload-related metrics with composable automation applets.

Unlike simple cloud bursting solutions that provision cloud resources based on static policies or a cloud queue, Navops Launch is application, resource, and budget-aware, and can adjust workload and resource deployments on-the-fly taking into account planned spending by project and department, on-premise capacity, actual cloud resource usage by application, and data access patterns.

With Navops Launch, Slurm and Univa Grid Engine users can maximize and balance the utilization of on-premise and cloud resources, reduce total spending, and boost business performance and end-user productivity by ensuring that the right resources are available for the right application at the right time.

Learn more about Navops Launch and Univa HPC hybrid cloud solutions at http://www.univa.com/products/navops.php.

References

InsideHPC survey results sponsored by Univa
Slurm Elastic Computing (Cloud Bursting) – https://slurm.schedmd.com/elastic_computing.html
Gartner – Ten moves to lower your AWS IaaS costs – https://www.gartner.com/en/documents/3847666/how-to-identify-solutions-for-managing-costs-in-public-c0
RightScale 2019 State of the Cloud Report from Flexera- https://info.flexerasoftware.com/SLO-WP-State-of-the-Cloud-2019

Sponsored Guest Articles

Dell: Omnia Copes with Configuring HPC-AI Environments

White Papers

Energy efficiency drives HPC to the cloud

Featured RSS Feed

More News from insideBIGDATA