In this insideHPC Guide, “Things to Know When Assessing, Piloting, and Deploying GPUs,”our friends over at WEKA suggest that when organizations decide to move existing applications or new applications to a GPU-influenced system there are many items to consider, such as assessing the new environment’s required components, implementing a pilot program to learn about the system’s future performance, and considering eventual scaling to production levels.
Simply starting to assemble hardware systems and layering software solutions on top of them without a deep understanding of the desired outcomes or potential bottlenecks may lead to disappointing results. Planning and experimentation are essential parts of implementing a GPU-based system, a process that is ongoing and requires continual evaluation and tuning.
Pilot Program
Designing and implementing a full-scale system that uses GPUs can be complex, expensive, and prone to mistakes. A pilot program using a small number of systems with a reduced amount of data can lead to better outcomes after full workloads are implemented. Testing algorithms and accelerators based on the CPU and GPU combinations and location of the data are better understood and tuned in a pilot program rather than in a full-blown system.
- On-premises vs cloud – Should the pilot program be implemented in an on-premises system or on a system provided by a cloud provider? With GPUs available on many types of instances across many cloud providers, utilizing these resources for a pilot program usually makes sense. Purchasing a system that contains the necessary hardware and software for an on-site data center will be a cost-effective addition to an infrastructure.
- Flexibility – A pilot program gives flexibility to experiment with CPU and GPU combinations. While assessing the ratios of CPUs to GPUs may explain how a running system will optimally deliver results to the end-user, experimentation with combinations of these components will give more confidence to the organization.
- Best choices – What specific GPUs and CPUs are best for the software system? There are many CPUs and GPUs available today, each with varying core counts, clock rates, instruction implementation, and I/O bandwidths. The higher performing products come with additional costs but might not result in better performance. For example, a system that sends a lot of data to the GPUs for analysis and keeps the GPUs busy might not require the latest CPU, as CPU performance might not matter for this specific application. The pilot program can identify not only bottlenecks but also component best choices as well.
- Applications – Do you use greenfield applications (those that are entirely new to an organization) or brownfield applications (those that are part of an existing infrastructure)? Many organizations might already have applications that use GPUs but are looking to either scale applications, improve performance, or implement new features. A pilot program would be ideal for this scenario, but the developer would need to “peel” off the code for the new feature or the area to investigate performance. Additional data might also need to be collected for the pilot program. Moving the entire data set to a public cloud provider is not necessary, and it could be expensive. Only a small portion needs to be moved, which is the portion used to validate the new model or the software algorithms. Greenfield applications pose a different set of issues: not just the required algorithms, but from where do you get the pilot data? Do you purchase it? Make it up? Borrow it? These choices will lead to some decisions in a pilot program that might have future implications. Consider them early.
- Timing – The length of time for a pilot program is also a required part of planning. Just getting a system running will not lead to conclusive results. Weeks to months may be the optimal amount of time needed to understand the algorithms’ bottlenecks, correct hardware size, and anticipated storage needs. As with any new technology, the possibilities are endless. After customers see the opportunities that an accelerated system shows, specifically with AI applications, they are able to add capabilities to the pilot so that they can gain experience with the new implementation and even discover new possibilities that had not been considered previously.
- Education – Learning from a pilot program is an essential part of the full-scale implementation when working with a system that utilizes GPUs. In fact, expect a continuous learning process throughout the pilot because bottlenecks can occur even in a pilot environment. The GPUs might not be kept busy, or the system might not scale expected at first. Implementation teams might change their minds about which components to use, or during their discovery in the pilot phase, they may decide to re-assess what the measure of success will look like at the end of the pilot program. That is why a flexible pilot is essential for meeting long-term goals.
Over the next few weeks we’ll explore Weka’s new insideHPC Guide:
- Introduction, Assessment
- Pilot Program
- Scaling and Implementation, Related Considerations – Storage, Summary
Download the complete Things to Know When Assessing, Piloting, and Deploying GPUs courtesy of Weka.