Helping Scientists with System Management Software

In this special guest feature from Scientific Computing World, Tom Wilkie writes that while end-user scientists and engineers fear the complexity of running jobs in HPC, there are software toolkits available to help.

Scientists and engineers want results – fast and without fuss. Hardware suppliers and systems integrators are making computers faster, by increasing the number of cores, adding accelerators or co-processors, and speeding up the interconnects. But these hardware innovations inevitably make HPC clusters more complicated, and so less easy to use. Meanwhile, Governments are trying to push the technology down the path to Exascale computing. In the outside world, data-centric computing is becoming as important as traditional numerical computation (see Why storage is as important as computation page 20), leading companies to try to integrate resources from the cloud as a way of complementing and extending their in-house compute resources. And one issue of growing importance, lying behind the whole of high-performance computing, is that of rising energy bills (see Power and the processor).

Caught in the middle are the companies who supply the software tools that make configuration of a cluster easier, or that manage the scheduling, resourcing, and the workload of compute jobs running on the machines and out into the cloud. In the face of all these pressures, the software tool-makers are innovating furiously.

Innovative software tools

At the International Supercomputing Conference (ISC’14) in Leipzig at the end of June 2014, Adaptive Computing offered delegates a sneak preview of the latest release of its Moab optimization and scheduling software: Moab HPC Suite-Enterprise Edition 8.0 (Moab 8.0). A month later, at the end of July 2014, Bright Computing, which provides cluster management software, announced that $14.5 million of venture capital funding was being invested in the company to develop its products further and to expand its markets.

By September, Altair had a major release of its workload management software, PBS Professional, in beta testing, with a release date for the new software – version 13 – early in 2015. Meanwhile SysFera, which was spun out of the French computing research organization INRIA in 2010, timed its next release for SC14, the supercomputing conference and exhibition in New Orleans in November 2014. SysFera stresses its innovation credentials, with participation in both national research projects, such as SOP (global Services for persOnal comPuter) run by the French Agence nationale de la recherché, and the EU’s Framework Programme’s project: PaaSage on Model-based Cloud Platform Upperware.

The competitive landscape is changing too, according to Matthijs van Leeuwen, CEO and founder of Bright Computing, to the point where Bright is one of the few independent companies left offering cluster management tools. Platform Computing has been absorbed by IBM, while StackIQ appears to be repositioning itself more towards the Big Data market. According to its website, StackIQ’s software “automates the deployment, provisioning, and management of Big Infrastructure [for] fully configured Big Data clusters.”

Managing resources and managing workloads

Van Leeuwen stressed the distinction between software tools that are intended to manage the compute resources – hardware, operating system, middleware, libraries, security, etc – and those that are focused on optimizing the way in which the application software runs on the underlying resources. Bright Computing is distinctive in that it focuses on managing the resources, but offers more than that by ensuring that, once installed, Bright Cluster Manager (BCM) integrates with, manages, and monitors all common HPC workload managers.

Bright Computing is already ahead on the issue of the cloud, according to van Leeuwen: “This is a trend we have seen coming for quite some time. We have made it very easy to stand up a complete HPC cluster inside the cloud and manage it the same way as you would manage your own on-premises cluster.” However, he believes that the software is particularly innovative in the way that it has made it possible to extend an existing, on-premises, cluster into the cloud. For example, if a company has 50 nodes on premise but needs to double its capacity over 30 days to accomplish a particular task: “We make it almost trivial to add another 50 nodes inside a public cloud like Amazon. And to my knowledge, we are unique there – I am not aware of anyone else that can do that.”

And there is a demand for this, especially in the life sciences, Van Leeuwen continued. “All the pharmaceutical companies are running some of their workload in the cloud. Most of them run it separately from their on-premises resources, so they need to take several manual steps to add capacity in the cloud, and to move applications from on-premises to the cloud,” he said. Bright Cluster Manager (BCM), in contrast, gives them the advantage of running across on-premises and the cloud. Ease of use was key, he said: “They don’t have to move the workload over to the cloud manually – they can do it automatically, and even dynamically based on policies.”

While Bright Computing is well known in the HPC market, van Leeuwen sees the recent capital injection as allowing it to develop further into areas such as Hadoop and Open Stack. ‘Our customers pulled us into those markets,’ he said. ‘People think that Hadoop and Open Stack install on bare metal – on empty servers. But they are unpleasantly surprised when they discover that they first need to prepare the servers, install the operating system, configure the operating system, configure a network, and only then are they ready to install Hadoop or Open Stack. And then they discover that it is really hard to do. Open Stack is really a bunch of independent services that need to be configured individually and with each other. There’s millions of ways of getting it wrong.’

With Open Stack as an add-on on top of BCM, however: “We have made it much easier to get started with Open Stack, while still allowing, afterwards, any expert change that you would like – but most users don’t need these expert options.”

Hadoop is simpler to install and is not as extreme as Open Stack but, according to van Leeuwen: “You have still to do all the preparation work yourself unless you use BCM.” He views Hadoop as partially a form of HPC: “A lot of people attracted by Hadoop don’t need Hadoop; they need traditional HPC but they don’t realise it. When you do Big Data analysis, it is very close to traditional HPC data analysis. We can combine traditional HPC workloads with Hadoop workloads. If you want to do that on a cluster that is deployed and built with one of the management tools provided by Hadoop distributions, you are in trouble. If you take Cloudera Manager, for example, that is really focused on managing Hadoop clusters – it takes full control of your cluster, it’s very possessive, and does not allow you to do any other types of workload anymore.” He cites Cray as a solid HPC company that is also moving into the Big Data, Hadoop world. Cray is a significant Bright partner, using BCM to manage all the external service nodes.

One ecosystem for converging technologies

The convergence of Big Data, cloud, and HPC lies behind Moab 8.0, the new release of Adaptive Computing’s optimization and scheduling software tools. Speaking at the preview at ISC’14 in Leipzig in late June, Jill King stressed the extent of the innovation in the new release, with its focus on enabling companies to take “data-driven decisions for competitive advantage.” The software would allow users “to improve results without having to install new infrastructure,” – by a factor of between two and three in terms of performance boost, according to the company. And Adaptive Computing too is integrating Open Stack into its package.

The new release is a single ecosystem, King said, that provides the wherewithal to unify resources: in HPC and Big Data; across private and public clouds; and across virtualised and bare metal machines. As engineers and scientists want to process intensive simulations and carry out Big Data analysis to accelerate insights, this convergence of Big Data and better workflows leads to the concept of ‘Big Workflows’ which is at the heart of the new release, she said. In creating the Big Workflow solution, she continued, Adaptive Computing sought to understand the dependencies and thus remove log jams in the workflows. The result is dynamic scheduling, provisioning, and management, of many applications running across HPC, cloud, and Big Data computing.

Power consumption is directly addressed, she continued. King said that the software can slow the CPUs, by decreasing the clock frequency, and this can be done through a policy set out for all similar jobs. In addition, it offers power states and options including ‘suspend’ and ‘hibernate’ that go beyond simple ‘on/off’ control. The company believes that Moab 8.0 can cut energy costs by between 15 and 30 per cent.

Scalability, reliability, and performance

Power consumption is also one of the items on the agenda for Altair in developing the latest version of its PBS Works software suite, according to Bill Nitzberg, the chief technical officer. “Power really resonates,” Nitzberg said. “We are rolling out and will be demonstrating a whole new power management framework, and set of capabilities.” Operators will be able to look at the status of the system and see how the power increases in the same way that they can see the CPU time increase, while the jobs are running. “In any sort of science, you need to be able to measure,” Nitzberg said, “and we’re rolling out the tools to be able to measure power usage. When people see what they are using, they will change their behavior.” It’s moving along the road to treating power and energy as a resource, he said: “You could even start charging people based on their power usage.”

Just as Jill King stressed the innovation within Moab 8.0 for Adaptive Computing, so Nitzberg said that version 13 of PBS Professional, currently in beta testing and due for release in the first quarter of 2015: “is probably one of our biggest releases.” While the original focus on the Exascale front – providing new ways to deal with scalability, reliability, and performance – has been retained, in the course of the software’s development, its reach has been extended to embrace power management, usability, and improvements to the Windows version.

All those extra nodes and cores that are being added as HPC moves down the road towards Exascale is a challenge that this version of the software is addressing, according to Nitzberg. “We are targeting more than 100,000 nodes, so millions of cores; hundreds of jobs a second throughput – that’s millions of jobs a day.” Altair has replaced PBS Professional’s current communications system with a new, more modern, tree-based connection fabric that will foster fully-multithreaded, fully non-blocking communication between those parts of the software that need to talk to each other.

As HPC machines become ever larger and more complex, there will always be some components that fail, so “We have put in a lot of additional node health-check infrastructure,” Nitzberg continued. But sometimes nodes may have failed without this being detected until a very large job is started and the job then aborts prematurely. However, “with some of the non-blocking stuff, we’re also targeting the ability to start up very large jobs – half the size of the system or larger – first time, despite finding failures we did not know about.” To achieve this, the protocol in PBS Professional for how jobs start has been changed so that very large jobs can start again using different nodes. “It’s a fantastic release that we have architected for Exascale,” he said. “As we scale up to 100 Petaflop, I think we’re in great shape. Going a further factor of 10, I don’t know what I’m going to find.”

Getting the message out about the importance of a piece of infrastructure software such as PBS Professional can be complicated. Nitzberg said: “Where our value is, is that we make the lives of end-users more productive, more natural, easier, and more friendly. To tell that value story, we need to tell the end-users, but they’re not the ones installing the software. It’s not end-user desktop software, so we really end up talking the system-administration teams, the system managers, and then a lot of times to the engineering managers – the folks whose teams they want to make more productive.”

End users and system administrators

David Loureiro, CEO of the French software company SysFera, also sees both communities as beneficiaries of his software tools: SysFera-DS provides ease of use for end-users, but also information for project managers and IT administrators. For companies that are running a large number of simulations and where R&D is important: “We are providing a way of getting rid of all the complexities of technical computing, because those people are engineers and scientists, not IT specialists or computer scientists,” he said. Through a web-based interface, SysFera-DS can provide “all the things that people want, who are running simulations, performing large-scale computations, and have a work flow of pre-processing data staging, batch processing.” But, he went on, while the software was providing a way to run simulations, it was also “providing to system administrators and project managers a unified interface to see how the projects are running; how the resources are used; and how to provision new resources seamlessly – with the capability of managing large scale multisite infrastructure, that could gather a classical HPC cluster, a supercomputer or even a cloud platform.”

There are two elements to the SysFera-DS package, Loureiro explained. “Through the web-based interface, users can submit jobs, manage their data, manage projects, get some monitoring and reporting, visualise their data remotely: All the web-based solution is proprietary,” he said. “But you may need to manage other resources and to manage all this heterogeneity, we do provide, within SysFera-DS, software called Vishnu, which is open source. This middleware is here to abstract all the complexity.”

Customers feel that buying new HPC resources implies that the end-user will have to learn a new way to access resources and, he said, that is putting a lot of pressure on administrators, because they have to standardize how to use applications. “We are providing through our software a way to do this, so they can concentrate instead on the core business, and so the set-up time is really short when new resources are put into production.”

Ultimately, Loureiro envisages removing the complexities for end-user scientists and engineers so as to create “an HPC desktop” for them. He pointed to initiatives, both in France and internationally, to integrate ‘the HPC way of doing things’ with the work of the independent application-software vendors, to remove the mystique from HPC and thus widen its uptake among end-user scientists and engineers – providing HPC as a service and helping the end-user to concentrate on their core business. “We are not helping people to parallelize an application; we are helping them to be the most efficient they can with their existing infrastructure. People have to take time setting things up rather than using them.”

Loureiro concluded: “The problem we are trying to solve is that the users have applications, but it is really hard to run them on the right machine at the right time. They need something that will allow them to concentrate on what they really want to do.”

This story appears here as part of a cross-publishing agreement with Scientific Computing World.

Sign up for our insideHPC Newsletter.

Sponsored Guest Articles

Hammerspace Unveils the Fastest File System in the World for Training Enterprise AI Models at Scale

White Papers

Energy efficiency drives HPC to the cloud

Featured RSS Feed

More News from insideBIGDATA