In this video from the 2018 OpenFabrics Workshop, Brian Barrett from Amazon presents: Amazon and Libfabric: A case study in flexible HPC Infrastructure.
“Amazon Web Service’s EC2 Cloud Computing infrastructure allows users to dynamically build a variety of compute environments. Continual improvements in compute performance, available accelerators, and network performance have led to EC2 being an attractive platform for many HPC use cases. As network performance becomes a larger bottleneck in application performance, AWS is investing in improving HPC network performance. Our initial investment focused on improving performance in open source MPI implementations, with positive results. Recently, however, we have pivoted to focusing on using libfabric to improve point to point performance. Libfabric provides a number of features that make it ideal for Amazon’s development: changes in libfabric apply to the majority of MPI implementations, libfabric’s interface allows customers to experiment with programming interfaces other than MPI, and, most importantly, the hardware agnostic interface of libfabric allows Amazon room to innovate across an ever-evolving set of hardware capabilities. We’ll talk about our current experiences in getting started with libfabric, capabilities we’d like to add in 2017, and how we think about HPC networking in the Cloud.”