Two Key Considerations of a Composable Infrastructure Cluster

Print Friendly, PDF & Email

These days, we’re getting a lot of interest from our clients about composable disaggregated infrastructure (CDI), including what the most critical elements are for CDI-based clusters.

Successful deployments are more likely when clients understand why their design team focuses on certain areas more than others and how design decisions can impact end user experience, so we wanted to outline some key elements of CDI-based clusters.

At its simplest, CDI is a software-defined method of disaggregating compute, storage, and networking resources into shared resource pools. These disaggregated resources are connected by an NVMe-over-fabric (NVMe-oF) solution so that you can dynamically provision hardware and optimize resource utilization. Because it decouples applications and workloads from the underlying hardware, it allows you to redeploy resources to new workloads wherever they’re needed.

In this way, the CDI design provides the flexibility of the cloud and the value of virtualization but the performance of bare metal. CDI offers the ability to run diverse workloads on a cluster while still optimizing for each workload, but there are two key components to consider for an optimized CDI cluster.

Element 1 – Software

The software-defined nature of CDI means the software the cluster runs on must be best-in-class. Beyond that, however, you need to look into the specific areas of focus for the software and what it brings to the cluster.

The two software providers we believe meet the rigors of CDI-based clusters are Liqid and Giga IO. Each has its own fans, often because of the small differences in area of focus. Below is a quick overview of each, but you should work with your cluster design partner to dive more deeply into how the choice of CDI software aligns to your particular use case:

Liqid

Liqid Command Center™ is a powerful resource orchestration software that dynamically composes physical servers on-demand from pools of bare-metal resources. Command Center provides:

  • Policy-based automation and dynamic provisioning of resources
  • Advanced cluster, machine, and device statistics and monitoring
  • Scalable architecture supporting high availability (HA)
  • Multiple control methods, including GUI and RESTful API

This flexibility is paired with powerful improvements in performance, optimization, and efficiency.

GigaIO

GigaIO FabreX is an enterprise-class, open-standard solution that enables complete disaggregation and composition of all resources in the rack. FabreX allows you to use your preferred vendor and model for servers, GPUS, FPGAs, storage, and for any other PCIe resource in your rack. In addition to composing resources to servers, FabreX can compose servers over PCIe. FabreX enables true server-to-server communication across PCIe and makes cluster scale compute possible, with direct memory access by an individual server to system memories of all other servers in the cluster fabric.

Element 2 – Networking Technology

The right high-performance, low-latency networking is the second critical element to an optimized CDI cluster.  That’s because the networking technology of a CDI cluster is a fixed resource with a fixed effect on performance, as opposed to other resources that can be disaggregated. You can disaggregate compute (Intel, AMD, FPGAs), data storage (NVMe, SSD, Intel Optane, etc.), GPU accelerators (NVIDIA GPUs), and more however you see fit, but the networking underneath all those components stays the same.

An optimal network strategy is essential for a CDI deployment in order to ensure optimal performance no matter how you deploy resources to accommodate your workflows. Depending on the use case, we use NVIDIA HDR InfiniBand or NVIDIA Spectrum Ethernet switches. InfiniBand is ideal for large scale or high performance. Ethernet is an ideal choice for smaller clusters. This way, as you expand over time, the underlying network is built to support any future needs in the lifecycle of that system.

A CDI Cluster for Demanding Workflows

One of the reasons CDI is generating so much buzz is that CDI is a compelling option to meet demanding and complex workflows, such as HPC and AI, that require massive levels of costly resources.

The optimal design for a CDI cluster is one that effectively manages the on-premises data center assets while delivering flexibility typically provided by the cloud. This requires significant engineering expertise, though. This usually takes a great deal of time, however, which is why looking for CDI-based reference architectures is a great idea.

That’s why Silicon Mechanics has created the Miranda CDI Cluster reference architecture as the ideal starting place for clients who want to take advantage of CDI. The Miranda CDI Cluster is a Linux-based reference architecture that provides a strong foundation for building disaggregated environments.

Get a comprehensive understanding of CDI clusters like the Miranda Cluster and what they can do for your organization by downloading the insideHPC white paper on CDI.