[SPONSORED GUEST ARTICLE] These are the times that try the souls of IT staffs tasked with deploying AI at scale. The neophyte services provider and the distracted support manager will, in the midst of rapid change, shrink from the complexities at hand; but they that meet the challenges head on and successfully deploy AI deserve admiration and gratitude.
With apologies to Thomas Paine*, getting “Big AI” up and running truly is a crisis too much of the time. A recent Harvard Business Review article estimated the AI project failure rate as high as 80 percent.
Effectively implementing AI at scale involves a whole host of design, build, deployment, and management considerations. It requires extensive experience and knowledge with large computing clusters in high performance computing (HPC) and AI technologies; advanced networks, storage and integration technologies; and data center strategies for containing costs in energy-efficient, high-performance operating environments. Much like Paine’s The Crisis bolstered the case for change and rallied people at a pivotal time, today’s IT leaders need inspiration, guidance and great support.
Of the few companies that possess this distinctive mix of experience, expertise, and capabilities, Penguin Solutions is leading the way. With over 25 years of experience delivering HPC clusters at scale and a solid and growing track record in AI deployment and management, Penguin Solutions is a trusted partner for AI-powered firms across the globe.
A conversation with Patrick Ward, senior director of services at Penguin Solutions, helps explain why. Talking with Ward, you quickly get the sense that he’s the one you want with you in the trenches. He has deep technical knowledge and three decades of experience in IT, which give him a strong grasp of what organizations and their end users need. He clearly draws pleasure from his work, approaching new challenges with enthusiasm and a solutions-oriented perspective. What’s more, he exudes empathy, an essential attribute for any effective services professional. He understands the pains caused by systems breakdowns and refuses to accept IT downtime as a normal part of work life.
“Hey, if I’m a user, I just want to submit jobs,” Ward said. “The cluster should just work. This is no different from making a call. You pick up your phone and expect a connection. You don’t ever think about it. Users shouldn’t be worried if their cluster is going to work today. It should just work for them. That’s the experience Penguin provides through our services organization.”
Delivering on that mission is no small feat. Infrastructures for AI-at-scale are large and complex, with almost unlimited opportunities for something to go wrong. A cluster that “just works” is the product of continuous vigilance and comprehensive support services that keep everything running smoothly.
Penguin’s track record on this front is impressive. Building on its established prowess in HPC, the company has developed proficiency in the design, build, deployment, and management of AI clusters at scale. Starting in 2017, Penguin Solutions enabled Meta AI cluster deployments. Then, in 2021, Penguin Solutions provided AI-optimized architecture and managed services for Meta’s AI Research SuperCluster (RSC), which Meta said is “among the fastest AI supercomputers running today.” The RSC is composed of 2,000 NVIDIA DGX A100 systems that together contain 16,000 NVIDIA GPUs and 46 petabytes of cache storage.
Penguin has tens of thousands of GPUs under management across its many customers, which also now include Voltage Park and Georgia Tech. While the total volume is impressive, Ward is equally proud of the more modest clusters that his team deploys and manages.
“Everyone loves to talk about the 20,000-plus core clusters. Those are exciting, but there’s some great applications that are being used even in those lower levels, too, that we’re very proud of,” he noted. “We have many customers with small and mid-sized systems who are driving solutions to challenging computational problems, and those are great customer opportunities with great customer value.”
In many cases, AI adoption involves the largest IT investments that these companies have ever made. By partnering with Penguin Solutions, firms can deploy and achieve maximum workload throughput at speed so that they experience both accelerated computing and accelerated ROI.
Penguin Solutions also is known for its alignment of design and installation services, which helps firms avoid the hidden landmines that can stymie AI deployments.
“If there’s one remarkable area where we stand out, it’s the strong linkage between system design-to-deployment,” Ward remarked. “Those two working in concert, we’re able to stand up very large clusters faster than most.”
Ward has seen too many companies “get into trouble” by trying to handle design and installation in-house. AI differs from both HPC and traditional enterprise compute environments, with unique requirements related to power and cooling, navigating network performance management, implementing non-invasive system stability monitoring, and more. Penguin Solutions excels in these areas, which means that companies can avoid potential pitfalls and get their AI infrastructure up and running in short order.
Penguin Solutions offers services for every aspect of AI-at-scale:
- Professional Services: Advanced and emerging technologies are complex, difficult to configure, and require special care to deploy. Lack of familiarity can quickly escalate into delays and cost overruns, which can be devastating to budgets and project deadlines. Penguin Solutions engineers are certified in the latest technologies, with years of experience in data center infrastructure, HPC, AI accelerators, data storage, networking, and other management software. Their expert configuration, installation, and deployment of AI at scale ensures that systems can be used right away.
- Managed Services: Managing HPC clusters, storage systems, private clouds, and other complex Linux-based environments can quickly consume IT budgets. Penguin Solutions offers a proactive IT services subscription that augments customers’ in-house IT teams to keep the compute environment running at peak performance. These services are powered by Penguin ClusterWare™, a cluster management software platform that can be integrated with familiar DevOps tools like Ansible and Git.
- Data Center Services: Penguin Solutions delivers the best design, deployment, and management experience in the cloud, hybrid-cloud, and fully managed hosting situations. With extensive experience designing optimized data centers, hosted or hybrid, and its strong partnerships with world-class data centers allows Penguin solutions to deliver a variety of solid co-location hosting options.
- Design Services: The broad use of software-defined technologies has made the IT landscape highly fluid, placing new design, integration, and support burdens on in-house IT teams. Choosing the right technology design partner is essential for companies looking to unlock the promise of the latest innovations. At Penguin Solutions, experienced solutions architects help customers harness the power of rapidly changing complex technologies and provide agile design services that help customers solve their biggest technical challenges.
- Project Management Services: Expert project coordination and oversight ensure successful engagements with customers. Penguin’s certified project managers help customers address the full gamut of challenges that can arise, including downward budgetary pressures, “scope creep,” shifting priorities, multiple third-party vendors, complex software stacks, and competing stakeholder expectations. With support from a centralized Project Management Office, Penguin Solutions provides knowhow, tools, and resources suited to the specific needs of each project.
Organizations seeking to deploy HPC-class AI should partner with a vendor with the technology components and the support services to make it happen smoothly, efficiently and in alignment with the organization’s objectives. Penguin has a heritage of know-how with big clusters and big infrastructure. It’s a company and services organization that’s been there and continues to do it better than most.
* Thomas Paine, The Crisis
Speak Your Mind