PSC Accelerates Machine-learning Algorithm with CUDA

Print Friendly, PDF & Email

Researchers at the Pittsburgh Supercomputing Center and HP Labs have achieved unprecedented speedup of 10X on a key machine-learning algorithm. A branch of artificial intelligence, machine learning enables computers to process and learn from vast amounts of empirical data through algorithms that can recognize complex patterns and make intelligent decisions based on them. For many machine-learning applications, a first step is identifying how data can be partitioned into related groups or “clustered.”

HP’s Ren Wu and PSC’s Joel Welling ran the test on the latest “Fermi” generation of NVIDIA GPUs. Using MPI between nodes (three nodes, with three GPUs and two CPUs per node), they observed a speedup of 9.8 times relative to running an identical distributed k-means algorithm (written in C+MPI) on all CPU cores in the cluster, and thousands of times faster than the purely high-level language implementation commonly used in machine-learning research. Using their GPU implementation, the entire dataset with more than 15 million data points and 1000 dimensions can be clustered in less than nine seconds. This breakthrough in execution speed will enable researchers to explore new ideas and develop more complex algorithms layered atop k-means clustering.