A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers

In this video from the Switzerland HPC Conference, Maxime Martinasso from CSCS presents: Best Practices: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers.

MeteoSwiss, the Swiss national weather forecast institute, has selected densely populated accelerator servers as their primary system to compute weather forecast simulation. Servers with multiple accelerator devices that are primarily connected by a PCI-Express (PCIe) network achieve a significantly higher energy efficiency. Memory transfers between accelerators in such a system are subjected to PCIe arbitration policies. In this paper, we study the impact of PCIe topology and develop a congestion-aware performance model for PCIe communication. We present an algorithm for computing congestion factors of every communication in a congestion graph that characterizes the dynamic usage of network resources by an application. Our model applies to any PCIe tree topology. Our validation results on two different topologies of 8 GPU devices demonstrate that our model achieves an accuracy of over 97% within the PCIe network. We demonstrate the model on a weather forecast application to identify the best algorithms for its communication patterns among GPUs.”

Maxime Martinasso is a computer scientist in the Future System group at CSCS. His interests focus on performance modeling, data science and HPC technology. Previously, Maxime worked as an HPC specialist for one major oil industry company. He obtained his PhD in 2007 from the University Joseph Fourier in France.

Download the Paper

See more talks in the Switzerland HPC Conference Video Gallery

Sign up for our insideHPC Newsletter