Things to Know When Assessing, Piloting, and Deploying GPUs – Part 2

In this insideHPC Guide, “Things to Know When Assessing, Piloting, and Deploying GPUs,”our friends over at WEKA suggest that when organizations decide to move existing applications or new applications to a GPU-influenced system there are many items to consider, such as assessing the new  environment’s required components, implementing a pilot program to learn about the system’s future  performance, and considering eventual scaling to production levels.

Simply starting to assemble hardware  systems and layering software solutions on top of them without a deep understanding of the desired  outcomes or potential bottlenecks may lead to disappointing results. Planning and experimentation are  essential parts of implementing a GPU-based system, a process that is ongoing and requires continual evaluation and tuning.

Pilot Program

Designing and implementing a full-scale system that uses GPUs can be complex, expensive, and prone to  mistakes. A pilot program using a small number of systems with a reduced amount of data can lead to better  outcomes after full workloads are implemented. Testing algorithms and accelerators based on the CPU and  GPU combinations and location of the data are better understood and tuned in a pilot program rather than  in a full-blown system.

  • On-premises vs cloud – Should the pilot program be implemented in an on-premises system or on a system  provided by a cloud provider? With GPUs available on many types of instances across many cloud providers, utilizing these resources for a pilot program usually makes sense. Purchasing a system that contains the  necessary hardware and software for an on-site data center will be a cost-effective addition to an  infrastructure.
  • Flexibility – A pilot program gives flexibility to experiment with CPU and GPU combinations. While  assessing the ratios of CPUs to GPUs may explain how a running system will optimally deliver results to  the end-user, experimentation with combinations of these components will give more confidence to  the organization.
  • Best choices – What specific GPUs and CPUs are best for the software system? There are many CPUs and GPUs available today, each with varying core counts, clock rates, instruction implementation, and  I/O bandwidths. The higher performing products come with additional costs but might not result in  better performance. For example, a system that sends a lot of data to the GPUs for analysis and keeps  the GPUs busy might not require the latest CPU, as CPU performance might not matter for this specific  application. The pilot program can identify not only bottlenecks but also component best choices as well.
  • Applications – Do you use greenfield applications (those that are entirely new to an organization) or  brownfield applications (those that are part of an existing infrastructure)? Many organizations might  already have applications that use GPUs but are looking to either scale applications, improve  performance, or implement new features. A pilot program would be ideal for this scenario, but the  developer would need to “peel” off the code for the new feature or the area to investigate performance. Additional data might also need to be collected for the pilot program. Moving the entire data set to a  public cloud provider is not necessary, and it could be expensive. Only a small portion needs to be moved, which is the portion used to validate the new model or the software algorithms. Greenfield  applications pose a different set of issues: not just the required algorithms, but from where do you get  the pilot data? Do you purchase it? Make it up? Borrow it? These choices will lead to some decisions in  a pilot program that might have future implications. Consider them early.
  • Timing – The length of time for a pilot program is also a required part of planning. Just getting a system running will not lead to conclusive results. Weeks to months may be the optimal amount of time  needed to understand the algorithms’ bottlenecks, correct hardware size, and anticipated storage  needs. As with any new technology, the possibilities are endless. After customers see the opportunities  that an accelerated system shows, specifically with AI applications, they are able to add capabilities to the pilot so that they can gain experience with the new implementation and even discover new  possibilities that had not been considered previously.
  • Education – Learning from a pilot program is an essential part of the full-scale implementation when working with a system that utilizes GPUs. In fact, expect a continuous learning process throughout the  pilot because bottlenecks can occur even in a pilot environment. The GPUs might not be kept busy, or  the system might not scale expected at first. Implementation teams might change their minds about  which components to use, or during their discovery in the pilot phase, they may decide to re-assess what the measure of success will look like at the end of the pilot program. That is why a flexible pilot is  essential for meeting long-term goals.

Over the next few weeks we’ll explore Weka’s new insideHPC Guide:

Download the complete Things to Know When Assessing, Piloting, and Deploying GPUs courtesy of Weka.