Over at TACC, Faith Signer-Villabos writes that the new Chameleon Testbed is blazing new trails to HPC in the Cloud.
In the world of advanced computing, computer scientists commonly use supercomputers to explore new technologies. Without supercomputers, the field of computer science would not make progress in developing better and more efficient algorithms, methods, and tools that help advance technology. However, supercomputing environments can have restrictive systems and software in place that cannot be modified to develop new, customized environments.
In the rapidly emerging and flexible computing paradigm of cloud computing, a new system was needed to address the academic research community’s needs to develop and experiment with novel cloud architectures and pursue new applications of cloud computing in customized environments. Chameleon, launched in 2015, was designed to do just that.
In conjunction with the University of Chicago and the Texas Advanced Computing Center (TACC), the National Science Foundation (NSF) funded Chameleon, TACC’s first system focused on cloud computing for Computer Science research. This $10 million system is an experimental testbed for cloud architecture and applications, specifically for the computer science domain.
Derek Simmel, a 15-year veteran of the Advanced Systems Group at the Pittsburgh Supercomputing Center (PSC) “Cloud computing infrastructure provides great flexibility in being able to dynamically reconfigure all or parts of a computing system so that it can best suit the needs of the applications and users,” said Derek Simmel, a 15-year veteran of the Advanced Systems Group at the Pittsburgh Supercomputing Center (PSC). “With this flexibility, however, comes considerable complexity in monitoring and managing the resources, and in determining how best to provision them. This is where having an experimental facility like Chameleon really helps.”
Simmel is also an XSEDE (Extreme Science and Engineering Discovery Environment) expert who works on PSC’s Bridges, an NSF-funded XSEDE resource for empowering new research communities and bringing together high performance computing (HPC) and Big Data. Bridges operates in part as a cluster but also has the ability to provide cloud resources including virtual machines (VMs) and other dynamically configurable computational resources.
According to Simmel, the new Bridges system provided new challenges because it’s a non-traditional system and involves deployment using OpenStack.
The cloud infrastructure software itself (OpenStack) is also evolving rapidly, as computer scientists work to improve and expand its capabilities,” Simmel said. “Keeping up with new developments and changes in the way one operates all the component cloud services is a considerable burden to cloud system operators — the learning curve remains fairly steep, and all the expertise required for a traditional computing facility needs to be available for cloud-provisioned systems as well.”
The infrastructure of cloud computing is as complex as managing an entire supercomputing machine room — all the software and services required for computing, networking, scheduling, monitoring, security and software management are represented in a layer of cloud services that operate between the physical hardware and the virtual systems accessed by users.
PSC started using Chameleon in August 2015 and tested OpenStack for five to six months on Chameleon before the first Bridges hardware arrived in early 2016 (Bridges entered full production in July 2016). In addition to providing bare metal reconfiguration capabilities, a modest partition of Chameleon has been configured with OpenStack KVM to provide a ready made cloud for researchers interested in experimenting with cloud computing. This allowed Simmel and others to experiment with and optimize a piece of software called SLASH2, a PSC-developed distributed file system.
They tested deployment of the SLASH2 distributed filesystem on CentOS 7.x virtual machines provisioned using the Chameleon OpenStack environment. “A primary challenge was to identify and understand what happens to SLASH2 filesystems as the number of client systems mounting the filesystem scales up into the hundreds of nodes, and as data intensive applications and access patterns on some nodes affect the availability and performance for others,” Simmel said.
According to Simmel, the results gathered in their observations informed the SLASH2 developers regarding areas of the SLASH2 system that are sensitive to scaling, to data access patterns at scale, and load management. They were able to implement improvements to SLASH2 to reduce contention and to provide configuration controls to sustain availability as the number of clients rose toward 1000+, as is the case on the new Bridges system. These improvements are now in production on the Bridges /pylon2 filesystem today.
In preparation for the Bridges supercomputer being fully deployed in July, it was very convenient for Chameleon to be ahead of us so we could use their resources to test configurations, deployment scenarios, and the scalability of Slash2 on VMs that were provided by Chameleon,” Simmel said. “It was the right system available at the right time.”
PSC employs OpenStack to provision the Bridges system itself, and also to provide VMs and VM-based services for users. Now that Bridges is in full production mode, the lessons learned on Chameleon for OpenStack and Slash2 deployment are being put to work for the domain scientists using Bridges and Bridges-provisioned VMs to run their HPC simulations and data analyses. “OpenStack is one of the leading software collections in the cloud arena right now – it’s developing very quickly. Chameleon gave us a working OpenStack environment at a time when we needed one. We needed to focus on developing solutions for Slash2 and Bridges rather than tackling the difficulties of getting a large OpenStack system stabilized,” Simmel said.
It’s often a challenge to test the scalability of system software components before a large deployment, particularly if you need low level hardware access”, said Dan Stanzione, Executive Director at TACC and a Co-PI on the Chameleon project. “Chameleon was designed for just these sort of cases – when your local test hardware is inadequate, and you are testing something that would be difficult to test in the commercial cloud – like replacing the available file system. Projects like Slash2 can use Chameleon to make tomorrow’s cloud systems better than today’s.”
Users don’t notice the preparation behind the curtain, Simmel said, but it was helpful to know in advance what challenges PSC would face in scaling Slash2 and in understanding where to prioritize efforts in solving problems. Currently, Slash2 is running in production on Bridges as one of the two primary file systems. “And access to that from VMs was greatly facilitated by our up front testing on Chameleon before the machine was here,” Simmel said.
Its novel that there was another NSF resource elsewhere for us to use for HPC infrastructure development and testing,” Simmel concluded. “In the past, we’ve acquired equipment to experiment with before deploying a production HPC system, but it has been very limited — those machines didn’t allow us to try the higher level testing that we needed to do with OpenStack and Slash2. It was really convenient to have Chameleon available rather than having to home grow our own system. We greatly appreciate the resources and service provided to us by the Chameleon project.”