Sun’s HPC community portal has a YouTube clip of an interview with Tommy Minyard and Kelly Gaither of TACC about Ranger.
Kelly and I were in grad school together…she’s good people.
(Tip o’ the hat to Sun’s HPC Watercooler for the pointer.)
Sun’s HPC community portal has a YouTube clip of an interview with Tommy Minyard and Kelly Gaither of TACC about Ranger.
Kelly and I were in grad school together…she’s good people.
(Tip o’ the hat to Sun’s HPC Watercooler for the pointer.)
Ethernet wasn’t built with AI in mind. While cost-effective and ubiquitous, its best-effort, packet-based nature creates challenges in AI clusters… But fabric-scheduled Ethernet transforms Ethernet into a predictable, lossless, scalable fabric – ideal for AI. It uses cell spraying and virtual output queuing ….
When building large-scale AI GPU clusters for training or inference, the backend network should be high-performance, lossless, and predictable to ensure maximum GPU utilization. This is hard to achieve when using Ethernet for the back-end network. This guide showcases a high-level reference design for an 8,192 GPU cluster, describing how it can be achieved with […]