insideHPC Guide to Composable Disaggregated Infrastructure (CDI) Clusters – Part 2

Print Friendly, PDF & Email

Sponsored Post

This technology guide, “insideHPC Guide to Composable Disaggregated Infrastructure (CDI) Clusters,” will show how the Silicon Mechanics Miranda CDI Cluster™ reference architecture can be a CDI solution blueprint, ideal for tailoring to specific enterprise or other organizational needs and technical issues.

The guide is for a technical person, especially those who might be a sys admin in financial services, life sciences or a similarly compute-intensive field. This individual may be tasked with making CDI work with a realizable ROI or just finding a way to extend the value of their IT investment while still meeting their computing needs.

Technology Use Case Examples

There are several use case areas for which CDI is appropriate. As composable architecture continues to  evolve, a variety of technology areas, across a vast range of industries, stand to benefit, i.e. AI, HPC,  accelerated data analytics, etc. These workloads greatly benefit from CDI deployment.

It is best practice to keep such systems on-premises. On-premises compute is more cost-effective than cloud-based compute when highly utilized. It’s also important to keep primary storage close to on-premises compute resources to maximize network bandwidth while limiting latency. It’s possible to leverage a range of networking options however typical recommendations are high-speed fabrics like 100 gigabit Ethernet or  HDR 200Gb/s InfiniBand.

Another important consideration is that the size of the data set is just as important as the quality of the  model, so allowance for a modern AI-focused storage architecture should be a priority. Traditional storage  like NAS often can’t keep pace. Bandwidth is limited to around 10 gigabits per second and it’s not scalable  enough for AI workloads. Similarly, the workaround of fast local storage doesn’t work for modern parallel  problems because it results in constantly copying data in and out of nodes which congests the network.

AI optimized storage should be parallel and support a single namespace data lake. This enables the storage  to deliver large data sets to compute nodes for model training. AI optimized storage must also support high  bandwidth fabrics like 100 gigabit Ethernet or HDR 200Gb/s InfiniBand. A good storage solution should also  enable object storage tiering to remain cost effective and to serve as an affordable long-term, scalable  storage option for regulatory retention requirements.

Two common challenges seen today involve both networking and data as compute power has increased.  Many organizations are generating data faster than ever before and ensuring both throughput and uptime  over the network is key. The network can intelligently make decisions that route around issues and optimize  data flow between endpoints making networks smarter than ever before. Along with higher speed  networking comes higher performing storage solutions that can provide high throughput by leveraging  NVMe SSDs as the primary tier of storage while still coupling with spinning disks for long term data retention.

Optimal utilization of infrastructure, especially GPUs, is more possible now than ever before. For many use  cases like AI and HPC workloads, performance is still top priority and on-premises hardware will always  provide peak performance with the ability to burst to the cloud on an as needed basis and with a powerful  CDI infrastructure it’s possible to provide the same level of compute to employees at home that was  previously only available in the office or data center.

CDI and the Enterprise Infrastructure of the Future

CDI is critical to the enterprise infrastructure of the future for many reasons. In this section, we’ll drill down  into each along with supporting details for why they matter.

Scalability

Given increasing business requirements, accelerating collection of data, and the dynamic nature of today’s applications—IT and database administrators are facing difficulty in scoping their future infrastructure needs. This is especially true as enterprises prepare their infrastructure to manage massive, and potentially uneven, AI and HPC workloads.

Some solutions can be limited to specific compute, storage, and network configurations. This can create  bstacles when additional resources are required for specific application but can’t be provisioned on demand. Such obstacles are principally eliminated with a composable infrastructure built with the future in mind.

Administrators are able to dynamically and quickly configure and provision everything from bare metal servers and network resources, to FPGAs and GPUs, to entire racks of equipment to adapt to the need for scalability.

Cloud computing as a concept is coming into its own maturity level where enterprises are discovering the dividing line between what it’s good for and what it’s not. There are always going to be certain processes that don’t work well in a cloud environment.

Most enterprises are not going to build a top performing supercomputer. Rather their infrastructure demands  simply require incrementally more than what the cloud can provide, at least in a cost effective way. CDI is the best solution for the use cases that are too complex or too high performance for the cloud.

Often, outside of proof-of-concept deployments and bursting, the cloud is not cost effective for HPC or AI. In  fact, ROI for the cloud is non-existent for many enterprises.

Utilization rates drive ROI. Enterprises often have performance limitations in the cloud, compared to on- premises solutions where higher utilization generates higher ROI. Lower utilization, on the other hand, favors the cloud. In fact, for the cloud, the higher the utilization rate, the less cost effective it is. For example, a cloud  deployment that needs 24-hour operation of a system is going to be very expensive under a pay-per- use model, whereas a 4-hour operation is much more cost effective.

CDI has the flexibility to imitate a cloud-like infrastructure and be more cost effective over time. Additionally,  when considering other issues such as data locality, vendor lock-in, and cost of data ingress and egress – the cloud will be expensive. In contrast, with on-premises solutions, these issues are not prevalent and it’s  possible to be more agnostic and change directions easier than is possible with cloud deployment. This is  one of the most important arguments in support of CDI versus the cloud.

Flexibility

Composable infrastructure is all about flexibility, but there’s nothing flexible about locking in an enterprise to  a single server vendor’s technology offering, and then limiting the capabilities of the infrastructure from both a provisioning and fabric perspective. CDI provides enterprise IT with the ability to use the equipment vendors of their choice.

Composable cluster solutions are vendor and fabric agnostic in that they do not necessitate any drivers, agents, or software modules on the compute nodes themselves. They manipulate resources across bare metal compute nodes through CDI software so an enterprise can run any higher-order applications on any  hardware, using any fabric. Further, CDI provides the enterprise with the flexibility to leverage the equipment they choose, and then orchestrate that equipment to best fit business needs.

With the rise of AI, there is a need to leverage many identical compute nodes backed by a parallel file system  like Lustre or General Parallel File System (GPFS), especially in the research space where multiple researchers  may combine budgets to purchase a cluster. This presents a design challenge as workloads are becoming  both more complex and more diverse. For those who have experienced this, it has likely led to purchases of  heterogeneous node types or homogeneous nodes packed with components that try to achieve a line of best fit.

The problem with this approach is that ROI is reduced as money is spent on equipment that isn’t needed for  all jobs. For example, not every workload benefits from GPU acceleration. This leads to a cluster that is not  configured to run all jobs optimally. It can also be very difficult to manage a heterogeneous cluster and  support diverse workloads. CDI technology is a proven solution for solving these issues.

Dynamic Provisioning

Another major advantage that CDI affords is rapid dynamic provisioning. When an enterprise requires a software application that is not integrated into its current infrastructure, IT generally must allocate staff  resources to add additional storage, reconfigure servers, or change the networking. A composable solution,  on the other hand, adapts the provisioning of those physical computing, storage and network resources  through management software to fit the needs of relevant applications—quickly making the goal of  software defined infrastructure a reality.

With CDI, enterprises can right-size configurations as workload requirements evolve, bringing together  resources and then reassigning resources in response to changing application and business requirements. In  this way, resources grow into elastic building blocks for delivering optimal environments that are provisioned  and configured to support a specific workload without having to wait for lengthy IT allocation processes.

Composable infrastructure offers benefits for streamlining and accelerating IT deployment in virtually every  category of business. Instead of over provisioning infrastructure to meet IT needs, pools of resources like  compute, storage and network are automatically composed in near real-time.

Effective resource utilization is critical for an enterprise to grow sustainably. Composable platforms are key to advancing this benefit. As an enterprise migrates from a static infrastructure to a dynamic infrastructure  based on CDI, the resource utilization benefits become apparent. In a typical enterprise, resource utilization  can experience a 2-3x gain with CDI. This improvement immediately translates into business and bottom-line benefits.

Resource Pools

Due to its ability to disaggregate hardware components into resource pools, composable infrastructure can  deploy a heterogeneous cluster and assign resources on-demand for specific jobs. This provides the flexibility  to dynamically provision bare metal instances to run jobs on best-fit hardware by abstracting the  physical compute, storage and network hardware to make them available as services that can be accessed as needed. Furthermore, a CDI solution goes above and beyond these classes of resources to include the ability  to compose services from pools of CPU, GPU, FPGA, NVMe, and NICs regardless of the type of underlying  fabric. When an application no longer requires the resources, they’re returned to the resource pool and  become ready for use by other applications.

Over the next few weeks we will explore these topics:

Download the complete insideHPC Guide to Composable Disaggregated Infrastructure (CDI) Clusters courtesy of Silicon Mechanics