In this insideHPC technology guide, “How Expert Design Engineering and a Building Block Approach Can Give You a Perfectly Tailored AI, ML or HPC Environment,”we will present things to consider when building a customized supercomputer-in-a-box system with the help of experts from Silicon Mechanics.
When considering a large complex system, such as a high-performance computing (HPC), supercomputer or compute cluster, you may think you only have two options—build from scratch from the ground up, or buy a pre-configured, supercomputer-in-a-box from a major technology vendor that everyone else is buying. But there is a third option that takes a best-of-both-worlds approach. This gives you “building blocks” expertly designed around network, storage and compute configurations that are balanced, but also flexible enough to provide scalability for your specific project needs.
Key Consideration #1: Scalability
Design flexibility
Whether you’re building a small, proof-of-concept project or aiming for something bigger from the start, you want to protect your investment and be sure that the system will adapt and grow as your project grows. Design flexibility is key here, with the ability to add nodes or racks to the hardware as needed.
Investment
If you use a customized configuration, your initial investment goes further – you don’t need to spend extra money on overhead built into similar, more well-known all-in-one systems that may not be as easy to expand upon incrementally.
Intelligent scalability
It’s important to scale intelligently. It does you no good to have a ton of computing boxes with no ability to feed them the data required for training the model. This approach allows you to pay for what you need and not set yourself up for a very expensive solution that gets bottle necked on either the compute, storage, or networking. This requires intelligent scalability.
Future growth
These are some of the reasons why the flexible Silicon Mechanics Atlas AI Cluster configuration is designed to support future growth. With each storage node and compute node that you add, the performance of the cluster scales linearly, and can be added seamlessly down the road. As your problem set grows, or if compute and storage requirements change, update, or evolve, the system is designed to scale together seamlessly.
Key Consideration #2: Storage
Scalability without limitations
Large data sets are required to deliver accurate AI results. Having this data drives incredibly large storage demands, and managing these data sets requires a system that can quickly scale without limitations.
“AI is akin to building a rocket ship. You need a huge engine and a lot of fuel. The rocket engine is the learning algorithms but the fuel is the huge amounts of data we can feed to these algorithms.” – Andrew Ng, “The Inevitable: Understanding the 12 Technological Forces That Will Shape Our Future”
This often means lots of compute, but it also means being able to feed that compute. Traditional Network Attached Storage (NAS) is bandwidth limited, so AI projects need to leverage an AI-first storage solution to effectively pull in data. Because the compute is so incredibly powerful, you need a storage solution that is purpose built for AI training scenarios.
HPC-focused systems
High-performance computing has similar issues, but can use traditional parallel file systems that are capable of large streaming data sets. While the two storage systems might end up looking similar physically, an HPC- focused system is more likely to use a Lustre solution, versus an AI system that might use an AI-specific storage solution such as that provided by Weka and an S3-compliant object storage tier.
Storage tiering
Storage tiering is another area that companies need to consider with their system, since it helps ensure minimized cost and maximized availability, performance and recovery. However, not all storage tiering is equal. The key to tiering is to keep things as cost-effective as possible—you don’t want to suffer a performance penalty. But keep in mind that not every system needs Ferrari-like storage.An optimized system will help make sure your project has enough space for hot data, balancing the rest with less expensive data storage to meet regulatory or persistence requirements as needed.
Key Consideration #3: Networking
Consider leading-edge technology that can help you get the best possible I/O for all that data. Two examples are below:
NVIDIA GPUDirect®
When moving data through an AI or ML algorithm, or training a neural network, you need the highest data throughput possible. GPUs are able to consume data much faster than CPUs, and as GPU computing power increases, so does the demand for IO bandwidth. NVIDIA GPUDirect® can enhance data movement and access for NVIDIA GPUs. With GPUDirect, network adapters and storage drives can directly read and write to/from GPU memory. This eliminates unnecessary memory copies, decreases the CPU overhead, and reduces latency, all resulting in significant performance improvements. Through a comprehensive set of APIs, customers can access GPUDirect Storage, GPUDirect Remote Direct Memory Access (RDMA), GPUDirect Peer to Peer (P2P) and GPUDirect Video.
Connect-IB InfiniBand
Connect-IB InfiniBand adapter cards from Mellanox provide the highest performing and most scalable interconnect solution for server and storage systems. Maximum bandwidth is delivered across PCI Express 4.0 leveraging HDR 100 or 200 Gbps InfiniBand, together with consistent low latency across all CPU cores. They also offload CPU protocol processing and data movement from the CPU to the interconnect, maximizing the CPU efficiency and accelerating parallel and data-intensive application performance. It supports data operations such as noncontinuous memory transfers, which eliminate unnecessary data copy operations and CPU overhead. Storage nodes also see improved performance with the higher bandwidth, and standard block-and-file access protocols can leverage InfiniBand RDMA for even more performance.
Over the next few weeks we’ll explore Silicon Mechanic’s new insideHPC Guide:
- Introduction, An Example of a Flexible Configuration Using Building Blocks
- Key Consideration #1: Scalability, Key Consideration #2: Storage, Key Consideration #3: Networking
- Taking a Holistic Approach – The Silicon Mechanics Perspective
Download the complete “How Expert Design Engineering and a Building Block Approach Can Give You a Perfectly Tailored AI, ML or HPC Environment,” courtesy of Silicon Mechanics.