At Virtual SC20: An Update on the Fraunhofer Institute’s Carme, Where HPC Meets Interactive Machine Learning

Print Friendly, PDF & Email

At Virtual SC20, we spent time with Philipp Reusch, Scientific Assistant at Fraunhofer ITWM in Kaiserslautern, Germany. Reusch is closely involved in the development for the institute’s Carme (“kar-mee”), a framework to manage resources for multiple users running interactive AI jobs on a cluster of (GPU) compute nodes. Carme, by the way, is the name for a cluster of moons around Jupyter – sorry, Jupiter – all of them orbiting the planet in syncopation.

Carme started as a research project to enable a group of users to interactively use JupyterLab and Jupyter Notebook on an HPC system. And so Carme was a great name because it’s successive moons around Jupiter, so it fits what we are doing and aiming for.

Machine learning in HPC clusters, Reusch said, presents challenges, and Carme is designed to bring the best in both AI and HPC. On the one hand, it has solid and proven HPC tools, such as batch system and parallel file systems, and on the other hand it has new tools, such as containers and Web IDEs.

To achieve such a combination several questions arise: How to manage existing resources? How to make an application scalable to several GPUs? How to solve the challenge of data storage and continuous upload to the program? How to train users to effectively use the hardware?