One of the first installations of a supercomputer by Lenovo following its acquisition of IBM’s x86 business, recently started up at the University of Oxford in the UK. Unusually in the highly devolved structure of the University, the Advanced Research Computing (ARC) machine, a 5,280-core cluster, serves as a central facility serving all departments.
The new cluster, named Arcus Phase B, had been under test since January but was officially unveiled on 14 April. Originally, Oxford had placed the order with IBM but, as Dr Andrew Richards, Head of Advanced Research Computing at the University, pointed out in an interview, the transfer of the x86 business meant that: “Our ongoing relationship is with Lenovo. We dealt directly with Lenovo and have a close working relationship with them.” Scientific Computing World had reported just a month earlier on Lenovo’s determination to pursue the high-end of the x86 market: Lenovo gets serious about HPC.
The UK-based company OCF acted as integrator for the cluster, which consists of Lenovo NeXtScale servers with Intel Haswell CPUs connected by 40GB InfiniBand to an existing Panasas storage system. OCF also upgraded the storage system to add 166TB, giving a total of 400TB of capacity. Existing Intel Ivy Bridge and Sandy Bridge CPUs from the University of Oxford’s older machine are still running and are being merged with the new cluster.
According to Dr Richards: “HPC at university level is more off-the-shelf than it used to be, but we still had special requirements.’ One of those was to add 20 Nvidia Tesla K40 GPUs at the request of the Networked Quantum Information Technologies Hub (NQIT), which is modeling possible quantum computers of the future. This was a co-investment, according to Dr Richards, whereby instead of buying a few nodes for their own dedicated machine the NQIT benefits from being part of a bigger machine. They “had a large capital budget and chose to go with us, so they are supported as a high-priority customer.”
Other researchers using Arcus are involved in preparations for the Square Kilometre Array “looking at how you process some of the large data streams – it’s a software and hardware design problem.”
According to Dr Richards. But among the 120 or so active users per month are some unexpected users of HPC, including anthropologists using agent-based modeling to study religious groups. Opening up to non-traditional HPC users means that demands on the machine are different, Dr Richards explained. They tend to go for data processing rather than straight number-crunching and so they have lots of data in memory as part of running their jobs and need nodes with more memory.
As a facility intended to support the work of all departments, the capital expenditure came from central university funds (with the addition of funding from NQIT) but operational expenditure is not financed centrally. Three out of the four academic divisions in the university have opted to pay a “subscription” so that their staff can access the machine free at the point of use, whereas the medical sciences division has opted for a ‘pay as you go’ model.
The installation of the new system also gave ARC the opportunity to improve the system management software in order “to bring consistency to the user experience.” Part of its request to OCF was for the installation of the Simple Linux Utility for Resource Management – more usually known as the Slurm Workload Manager. This can support GPUs and the heterogeneous mix of different CPUs (as noted above, there are three generations of Intel CPUs within the cluster) – something the previous Torque scheduler was unable to do.
Slurm is open source and is widely used in the Top500 list of the world’s fastest supercomputers. ARC has also gone for the Open Source xCAT (Extreme Cloud Administration Toolkit) cluster management software. According to Dr Richards: “We couldn’t justify the expense of commercial software solutions.” Open Source software seemed the right way to address the problem.
He pointed out that the purpose of ARC was to underpin research and that within a university, “HPC is an at risk environment.” If it has to come down for maintenance or an unexpected outage, then that is something the users will accept. Commercial companies using HPC may well opt for commercial software rather than open source because of the support that is offered as continuous operation of the computer will be vital to the business.
But for the university, one of the service’s remits is to help students build their knowledge of using HPC machines so Dr Richards and his colleagues at ARC sit down with them and get involved with their research workflow, their datasets and how they need to be processed. In this way, students can develop their knowledge of how to use HPC machines so that they can go on to bid for time on national facilities like Archer.
I don’t see our facility as just running a big machine. We’re here to help people do their research,” Dr Richards concluded.