What is the best way to manage an HPC cluster serving a multi-user tenant base? We asked David Gignac, Senior Systems Administrator at the Texas Advanced Computing Center (TACC). David is responsible for managing “Alamo,” a 96-node cluster that’s part of FutureGrid, a high-performance grid test bed for new approaches to distributed computing. Funded by the National Science Foundation, FutureGrid comprises 920 nodes distributed across eight clusters at sites in the U.S. and Germany, including TACC. Gignac has managed Alamo for three years as part of the five year FutureGrid study, giving him unique insight into the challenges of managing an advanced multi-user tenant HPC cluster.
insideHPC: What do you do for FutureGrid?
David Gignac: The FutureGrid Project is a distributed test bed for software developers and systems administrators focused on grid and cloud computing. It is designed to better understand the behavior of various cloud computing approaches, and to allow researchers to tackle complex projects. Anyone interested in testing code can join the effort and request FutureGrid resources online. Researchers may request up to five nodes configured with a specific kernel to test distributed file systems. A single request may specify 20 different components of software. To meet their specific requirements, I generate a new image with each request. The crucial part of my job is capturing an image of each configuration, so the user can get back to the place they started when the system is rebooted.
insideHPC: What do you do when you’re not managing FutureGrid?
David Gignac: In addition to FutureGrid, I am also responsible for managing more than 2,600 servers for a variety of other research projects at TACC. As with any network administrator, there are only a certain number of boxes I can realistically manage effectively. When you talk about clusters, the management requirement goes through the roof. I need to have a good solution to help me manage this complexity.
insideHPC: How do you keep up?
David Gignac: I depend on good cluster management applications. When I took on administration for Alamo, I reviewed a number of advanced management suites. With all of my other responsibilities, my top criterion was minimizing the amount of time I spend managing each cluster. I looked at cluster management software from all the major vendors including Bright Cluster Manager, Cobbler/LOSF, Platform Computing products, Rocks and xCat.”
insideHPC: How did you choose the cluster management solution for Alamo?
David Gignac: My decision was based on minimizing the time required to manage the cluster: automatic time-consuming tasks and reducing complexity— balanced with providing a high level of service to our users. Drilling down, I needed a solution that would minimize the number of custom scripts I was required to write and something that would provide maximum ‘at a glance’ visibility into the health and operations of each cluster. In addition, I looked for something that would integrate seamlessly with Alamo’s job schedulers: Moab, Torque, Slurm and SGE; yet is nimble to accommodate simultaneous requests from researchers. In the end, I selected Bright Cluster Manager.
insideHPC: Three years later, how’s it going?
David Gignac: It’s been a great run. Bright and Fedora EPEL distros have saved a tremendous amount of time for me. Bright’s image-based provisioning lets me reconfigure Alamo on the fly to meet the specific needs of each researcher’s compute jobs. I click on a check box and the cluster management suite installs a server, sets up a client and I’m done. Further, Bright’s ease of use and full integration with job schedulers have produced major time savings. I don’t need to spend hours writing and maintaining scripts because everything just works. I get dedicated product support, so I don’t waste time searching forums and message boards for answers. In addition, I can easily reproduce any testing environment in minutes and rapidly deploy a new environment.
insideHPC: What’s next?
David Gignac: Cloud bursting. I think there’s an opportunity to experiment with hybrid cluster solutions. Bright lets me manage on-premise and remote cloud-based clusters seamlessly. It all looks the same through the management suite portal. I want to work with FutureGrid participants to test it in the program’s next two years.
insideHPC: And in your free time?
David Gignac: I certainly have more of that now, in spite of all the clusters I manage. Because of the time savings, I am spending more time making improvements on the clusters.