In this insideHPC technology guide, “How Expert Design Engineering and a Building Block Approach Can Give You a Perfectly Tailored AI, ML or HPC Environment,”we will present things to consider when building a customized supercomputer-in-a-box system with the help of experts from Silicon Mechanics.
When considering a large complex system, such as a high-performance computing (HPC), supercomputer or compute cluster, you may think you only have two options—build from scratch from the ground up, or buy a pre-configured, supercomputer-in-a-box from a major technology vendor that everyone else is buying. But there is a third option that takes a best-of-both-worlds approach. This gives you “building blocks” expertly designed around network, storage and compute configurations that are balanced, but also flexible enough to provide scalability for your specific project needs.
Introduction
Across the artificial intelligence (AI), machine learning (ML) and HPC landscape, organizations are moving from proof-of-concept to production projects that require software and hardware beyond off-the-shelf components or cookie-cutter server infrastructures. Most AI and ML projects demand that computing power, storage capacity and network infrastructure work seamlessly together to help avoid bottlenecks. For example, the fastest processors available won’t matter if your storage network is too slow.
Several companies, including NVIDIA, offer complete, supercomputer-in-a-box systems that harness the power of the NVIDIA HGX™ A100 GPU and its related components. The NVIDIA DGX™ A100 POD, for example, offers a complete system that provides great performance and several options for those looking for the latest features.
In many cases, customers could end up with features or hardware that they don’t need, or fall short in areas where they could use some extra power. That’s where working with the expert design engineers at Silicon Mechanics can help. Alternatives exist to the DGX A100 POD that can provide the same amount of performance, but with the additional bonus of having specific customizations that directly connect with a company’s AI, ML, or deep learning project.
An Example of a Flexible Configuration Using Building Blocks
Silicon Mechanics has developed a specific configuration, the Silicon Mechanics Atlas AI Cluster™, which can be modified to meet several high-end HPC and AI workload needs. While this configuration is designed to be a turnkey system that can be plugged in to get up and running fast, it can also be quickly scaled up to a supercomputer level, and/or customized to optimize specific workloads or data types. Whatever project you have in mind, this Linux, building-block style configuration provides enough power for state-of-the-art AI, ML or deep learning (DL) projects.
This complete, rack-scale system starts with the NVIDIA A100 GPU, which offers amazing acceleration to power scalable applications for AI, data analytics, ML and HPC environments. With up to 20 times higher performance over the previous generation, the A100 and its Ampere architecture can run the largest models and datasets for even the most demanding organizations.
- The Silicon Mechanics Atlas AI Cluster integrated product simplifies DL and AI deployments by using Silicon Mechanics’ Rackform A354NV or Rackform A380A servers (depending on the cluster size), NVIDIA Mellanox networking, and NVIDIA NGC software with the goal of minimizing time to production.
- The Silicon Mechanics’ configuration is based on AMD EPYC™ CPUs and 8X NVIDIA HGX A100 GPUs using both NVIDIA NVLINK® and NVIDIA NVSwitch™. It also supports optional GPU Direct RDMA with up to 8 Mellanox ConnectX®-6 Virtual Protocol Interconnect® (VPI) HDR InfiniBand adapters. In addition, this system supports optional storage nodes based on NVMe storage servers and NVIDIA Mellanox® Spectrum® 2000 Gigabit Ethernet HDR switches.
- Software for the Silicon Mechanics Atlas AI Cluster configuration includes Silicon Mechanics’ AI Stack, Silicon Mechanics’ Scientific Computing Stack, and support for popular HPC frameworks. This includes software such as: Weka, Lustre, S3-compliant object storage, Ubuntu 20.04 operating system, TensorFlow, PyTorch, Keras, R, NVIDIA software tools including: CUDA, cuDNN, and NGC GPU-accelerated containers.
Over the next few weeks we’ll explore Silicon Mechanic’s new insideHPC Guide:
- Introduction, An Example of a Flexible Configuration Using Building Blocks
- Key Consideration #1: Scalability, Key Consideration #2: Storage, Key Consideration #3: Networking
- Taking a Holistic Approach – The Silicon Mechanics Perspective
Download the complete “How Expert Design Engineering and a Building Block Approach Can Give You a Perfectly Tailored AI, ML or HPC Environment,” courtesy of Silicon Mechanics.