Blog Post Looks at Managing High Performance GPU Clusters

HP’s vice president of HPC Marc Hamilton blogs on the importance of monitoring and managing GPUs in a cluster to ensure optimal system performance. HP has updated its has updated its Cluster Management Utility (CMU) just for that purpose.

Twitter follower HPCGuru recently asked what could be monitored besides GPU temperature. Two of the more important things that CMU is configured to automatically monitor are GPU and IOH temperature. Given that the M2070 spec sheet lists a power consumption of 225 watts, it is no surprise that the GPU temperature is something you want to monitor (most x86 CPUs, by comparison, consume between 95 and 130 watts). But the IOH doesn’t stand out as big heat source. As it turns out, when you are driving two GPUs at full speed, along with a QDR IB link, the IOH curiously runs at a consistent hot temperature.