Things to Know When Assessing, Piloting, and Deploying GPUs

In this insideHPC Guide, “Things to Know When Assessing, Piloting, and Deploying GPUs,”our friends over at WEKA suggest that when organizations decide to move existing applications or new applications to a GPU-influenced system there are many items to consider, such as assessing the new  environment’s required components, implementing a pilot program to learn about the system’s future  performance, and considering eventual scaling to production levels.

Simply starting to assemble hardware  systems and layering software solutions on top of them without a deep understanding of the desired  outcomes or potential bottlenecks may lead to disappointing results. Planning and experimentation are  essential parts of implementing a GPU-based system, a process that is ongoing and requires continual evaluation and tuning.

Introduction

Traditionally, graphical processing units, or GPUs, were used in processing 3D content/data in gaming. Today,  GPUs have become a key technology for applications in manufacturing, life sciences, and financial modeling. GPUs speed up simulations due to their fast parallel processing capabilities and are now being used extensively in Artificial Intelligence (AI) and Machine Learning (ML) applications. Creating a large-scale environment that utilizes GPUs takes planning, piloting, implementing at scale, and, finally, evaluation.

For an effective production environment using GPUs, organizations need a defined process that contains the following activities:

  • Assessment
  • Pilot Program
  • Scaling implementation for anticipated workloads

Within the assessment and pilot activities, project leaders should expect to complete multiple iterations to achieve a clear understanding or tuning of the components. Figure 1 represents the process and shows that iterating on these steps is key to a successful implementation at scale.

Assessment

A key to creating a large-scale system that responds to user demands starts with understanding the challenges that a company or organization is trying to solve. Depending on the industry and algorithms used,  a number of areas need to be understood when planning and deciding if a GPU-based installation is  right for the workloads.

  1. What is the expected result from a new system design? Customers need to understand what the  business needs and wants. Better decisions? More accurate voice recognition? Assistance with guiding  software simulations? Before designing a working system, the goals must be understood and be  attainable before starting a new project.
  2. What are the fundamental algorithms that will be used and the nature of the software? If the software  chosen is provided by a third-party, and the software has been optimized to use GPUs, then the project  and implementation can proceed more quickly. However, suppose the implementation will use new  algorithms that require software development. In that case, the implementation team needs a deep understanding of the software requirements to determine whether or not GPUs will shorten the  application’s time.
  3. How much data will be used? Even if using GPUs might lower an application’s running time, GPUs may  not be the correct solution if the quantity of data is low. When determining the amount of data that will  be used when setting up a new software system to analyze data, looking back at the amount of  data collected in the past day, week, month, or year is extremely useful. Just as important is looking at  what types of data are helpful for the task at hand. Just because 1 TB of data per day has been collected does not mean that a new system must analyze all of this data. Specific tools may be needed to separate the data being used compared to the data being captured. More data does not equate to  better data.
  4. A key to any system is understanding the type of data. Where it is coming from, and what insights might be contained in the data. The metadata for any system needs to be examined to understand various properties of the data. Implementing software that can open up the metadata is not just a one- time effort—comparing the metadata over a period of time, whether hours or weeks, will give an  insight into not only data growth but trends in data types, complexity, and relevance to the study at hand.
  5. Understanding the type of the data to be used is critical to developing an optimized system. For  example, when analyzing images or videos, one would conclude that GPUs would be the best choice,  while trying to make sense of IoT information may be better served with CPUs. A lot of this will depend  on a combination of the amount of data that is anticipated to be of value and, of course, the type of  data.
  6. Besides the CPU and GPU choices, other infrastructure components need to be understood at the time  of planning. Network bandwidth and storage types are integral parts of a smooth-running system that  must be mapped out early to identify which of these pieces must be upgraded or purchased new.
  7. The eventual sizing of the infrastructure, CPUs, accelerators, networking, and storage capacity should  support the data volumes, the SLAs’ latencies, and the budget needed. A sizing exercise starts at the  result and then works backward to the required hardware and framework. All aspects should be  understood, including network latencies and bandwidths and the storage system’s ability to deliver data quickly to where it is needed. An example of incorrect sizing would be connecting a GPU with 12GB/s  when inferencing but using remote/local storage that can do 2-6GB/s only.
  8. When assessing performance, evaluate pipeline bottlenecks and not just performance bottlenecks. For  example, how do the GPU servers share the data? Do they copy it to local drives? This procedure takes  time and is considered a pipeline bottleneck, not a strict performance bottleneck.
  9. When planning for performance, keep in mind that the required performance on Day 1 is not the same  requirement for Day 2, Day 30, Day 100, and so on. Set several performance requirements (e.g., more  GPUs or more data scientists) because the requirements at each stage will be different. Soliciting expert input helps the implementation team to understand the current situation and the future state in no  uncertain terms. When planning for a new or upgraded system to analyze or learn from existing  processes, getting the help of specialists who have experience in the domain field should be standard  practice. An objective look at the amount of data in the system, the possible algorithms, and various  options will ultimately lead to a better implementation at scale.

Over the next few weeks we’ll explore Weka’s new insideHPC Guide:

  • Introduction, Assessment
  • Pilot Program
  • Scaling and Implementation, Related Considerations – Storage, Summary

Download the complete Things to Know When Assessing, Piloting, and Deploying GPUs courtesy of Weka.