In this special guest feature, Rosemary Dr Rosemary Francis from Ellexus describes why the customized nature of HPC is not a sustainable path forward for the next generation.
More and more industries are turning to HPC to be able to do their jobs. As the need to access, move and process big data becomes a mainstream activity, so supercomputers are needed to allow people to work. Often, though, the people using them don’t know it’s ‘HPC’ at all.
There are pros and cons to this evolution. The main gain is that together, with the input of more industries and funding, we can continue to work towards faster compute, faster storage and that distant horizon, exascale. The HPC industry can serve a much wider pool of users and grow with them.
The downside is that many of our systems and tools are inaccessible to non-expert users. For example, deep learning is bringing more and more scientists closer towards HPC, but while they bring their knowledge, they also bring their high expectations for what they believe IT can do and not necessarily an understanding of how it works.
This isn’t sustainable, either for the new users of our systems who want to get going or for the IT managers who will be inundated with calls for help. But not only do we need to make everything – our tools, analytics, systems – more accessible, we need to help to educate those entering our domain.
Consider this: at the moment, all HPC set ups are custom. Do they have to be? Every HPC company has a good reason not to use this software setting or that hardware vendor, but it means that every cluster is a bespoke offering. That is not scalable or sustainable. A lot of industry tools require specialist knowledge to use them and it’s not possible to roll them out to wider audiences. This holds back healthy workforce migrations: you don’t want good people to go, but you don’t want everyone to stay either. Workforce stagnation leads to a drop in skills and efficiency.
Cloud platforms could help make HPC more uniform but they also pose new challenges. AWS might have a large market share, but lots of customers are turning to smaller cloud vendors for specific needs or set ups. Reasons cited eco the reasons for needing a different scheduler or a mix of storage solutions. This repeats the patterns of vendor patchwork that you get today.
Whether or not our tools can be made more uniform, the skills challenge must be addressed. If we do create a more accessible HPC industry, we will be rewarded by a new wealth of knowledge and interest from the emerging workforce. There has always been competition to recruit talent into the HPC sector and as the sector grows and investment in HPC increases, we’re going to need more people on board.
As time runs out on Moore’s law, perhaps the same scaling issues also apply to the number of developers in the world. There is certainly a long lead time on ordering new ones. Coding courses help, but the step from taking a course to being a standalone programmer with the expertise and experience to write code that works is huge. Who is helping developers to cross that chasm?
These questions need answers: how can we make sure that HPC is included in the effort to bring up a more tech savvy generation? How can we deliver an education that covers hardware and the full-service stack on a range of architectures: apps to racks? Take distributed computing; apps are often thin client architectures with big data back ends, but who is explaining that to children?
These are large questions indeed, but as our industry edges slowly but steadily towards a wider range of sectors, we will need to solve them. There are initiatives already. Raspberry Pi carries out HPC rack simulations and projects aimed to children and young adults. Events such as ISC run student cluster competitions. But this needs to go much further. It’s up to us to knock down the walls surrounding our sector.
Dr Rosemary Francis is CEO and founder of Ellexus, the I/O profiling company. Ellexus makes application profiling and monitoring tools that can be run on a live compute cluster to protect from rogue jobs and noisy neighbors, make cloud migration easy and allow a cluster to be scaled rapidly. The system- and storage-agnostic tools provide end-to-end visibility into exactly what applications and users are up to.