Things to Know When Assessing, Piloting, and Deploying GPUs – Part 3 - High-Performance Computing News Analysis

In this insideHPC Guide, “Things to Know When Assessing, Piloting, and Deploying GPUs,”our friends over at WEKA suggest that when organizations decide to move existing applications or new applications to a GPU-influenced system there are many items to consider, such as assessing the new environment’s required components, implementing a pilot program to learn about the system’s future performance, and considering eventual scaling to production levels.

Simply starting to assemble hardware systems and layering software solutions on top of them without a deep understanding of the desired outcomes or potential bottlenecks may lead to disappointing results. Planning and experimentation are essential parts of implementing a GPU-based system, a process that is ongoing and requires continual evaluation and tuning.

Scaling and Implementation

After the assessment has been performed and the pilot program has shown acceptable results, it is time to move on to the full-scale implementation phase. There are many considerations to consider in the move to a production system.

Acquire the necessary new hardware and gather the existing and compatible equipment. If an organization already owns hardware systems close to the pilot program’s systems, reuse them for a production environment. The delivery model can now include using systems, storage, and networking at a public cloud provider.
Ensure that the network infrastructure can handle the higher workloads that a successful system will require as the data grows, the user base grows, and the applications grow.
Plan for future growth and scale. Suppose that your successful pilot program is able to model the full-scale implementation. In that case, a serious implementation plan should include projected data workflows and infrastructure requirements that extend 5-10 years into the future. Fundamentally, the system’s architecture should remain the same if you’ve done the right homework in advance. Adding more resources should ensure that the project can scale transparently and proceed successfully.
Understand and monitor where the bottlenecks are today and scope the possibility that the bottlenecks might move around as the implementation scale grows. For example, the storage system may respond quickly to deliver data in a small-scale implementation, but it might not keep up with additional scaling. Bottlenecks can affect the CPUs (not enough horsepower or clock rate), GPUs (waiting for data), storage systems (inability to deliver the data nor sufficient capacity), or networking (insufficient bandwidth for GPU- based computing).
Perform an in-depth investigation to determine whether the end installation should be housed within a corporate data center or within a cloud provider. For various reasons, this decision affects many levels of the organization from the top down. Not only would the costs be different, but the implementation team needs a clear understanding of how and if a cloud provider can house the required hardware, storage, and networking infrastructure, and any on-premises support needs planning. While every functional level of an organization will have to decide for themselves, careful consideration of the following can help:

On-Premises – If the most recent releases of CPUs and GPUs are needed, and an organization wants to use features of these new components, even pre-release, then on-premises housing might be the correct choice. Be sure to discuss fast networking when exploring the viability of the on-prem vs. cloud options, however, as public cloud providers might not be able to supply the desired instances and the required networking concurrently. In contrast, an on-premises installation can provide this combination. Also, if the costs of the storage requirements and running the servers full-time are high, hosting on-site in a corporate data center might be the correct choice. In other words, account for both CAPEX and OPEX when considering this model. While data security might have been a reason to remain on-premises previously, this could be less of an issue moving forward for some organizations.
Cloud Provider – Many organizations do not have the in-house expertise to install and maintain high-end servers that contain acceleration technology. This expertise is paramount when working with a storage system that relies on a wide range of technologies. For such companies using a public cloud provider might be the most optimum choice. Smaller organizations might not have the ability to access the latest technology and might have to rely on public cloud providers. In most cases, using a cloud provider will increase OPEX, and CAPEX will be relatively minor.

Related Considerations—Storage

The performance of a large HPC or AI system that is based on GPUs or other accelerators usually depends on the utilization rates of the GPUs themselves. Another critical component to the efficient running of these systems is the storage choices. If CPUs or GPUs are starved for data, then expensive resources are not being used efficiently. Feeding the hardware that processes the data should be an upfront design decision, not an afterthought.

In the past, storage hardware relied on spinning disks that relied on mechanical parts to retrieve data. These hard disk drives (HDDs) have been available for decades. While the capacity has increased over time (although less than Moore’s law) and is expected to continue, the latency and bandwidth to and from the HDD to the main memory has not increased as fast. Solid-state drives (SDDs) based entirely on electronics and not physically rotating components change the storage landscape quite quickly. Many organizations’ storage systems have been based on various applications sending data sequentially to a disk or a set of disks. In high-performance environments writing to a single disk drive will create a significant bottleneck and slow down the entire system.

Parallel file systems have been developed and used for quite some time in HPC environments. While a parallel file system reduces bottlenecks, traditionally these file systems have been difficult to install, requiring storage experts to install and monitor them within complex environments. Also, legacy parallel file systems could not tier the storage for new and innovative applications. Tiering refers to putting the more used data closer to the processing units and the less used data on slower, less expensive storage devices.

Different implementations may have varying ratios of CPUs to GPUs. Depending on this ratio and the workloads the requirements of the file system may vary. An implementation with just a few hundred CPU cores assigned to process older data may be able to wait for data to arrive from less performant storage devices. In contrast, other implementations that contain many thousands of GPU cores need data from higher performance devices.

Getting data to the GPUs or other accelerators directly from the storage system needs to take advantage of the latest technology. After all, the GPU controls the input and the output. Directly “talking” to the storage sub-system understandably speeds up performance, as I/O does not have to move through the main CPU. The advantage of this direct setup with a parallel file system is two-fold:

Keeping the GPUs busy
Allowing the CPUs to perform other tasks and not be slowed down with I/O traffic management

For example, with applications that rely heavily on the GPUs, a speedy parallel file system must deliver data to the GPUs directly, often without involving the CPU.

Summary

When organizations decide to move existing applications or new applications to a GPU-influenced system there are many items to consider, such as assessing the new environment’s required components, implementing a pilot program to learn about the system’s future performance, and considering eventual scaling to production levels. Simply starting to assemble hardware systems and layering software solutions on top of them without a deep understanding of the desired outcomes or potential bottlenecks may lead to disappointing results. Planning and experimentation are essential parts of implementing a GPU-based system, a process that is ongoing and requires continual evaluation and tuning.

Over the last few weeks we explored Weka’s new insideHPC Guide:

Introduction, Assessment
Pilot Program
Scaling and Implementation, Related Considerations – Storage, Summary

Download the complete Things to Know When Assessing, Piloting, and Deploying GPUs courtesy of Weka.

Sponsored Guest Articles

Dell: Omnia Copes with Configuring HPC-AI Environments

White Papers

Energy efficiency drives HPC to the cloud

Featured RSS Feed

More News from insideBIGDATA

Things to Know When Assessing, Piloting, and Deploying GPUs – Part 3