In this special guest feature, Joe Landman from Scalability.org writes that the move to cloud-based HPC is having some unexpected effects on the industry.
I am planning on getting back to a more regular cadence of writing. I enjoy it, and hopefully, I don’t annoy (all) readers.
A brief retrospective on what has been over the last decade first.
First, having experienced this first hand, its important to talk about the use pattern shifts. In 2010, cloud for HPC workloads wasn’t even an afterthought. Basically the large cloud provider (AWS) and the wannabes were building cheap machines and maximizing tenancy. Their (correct) argument was gain scale/users, and more use cases and differentiation would follow.
For traditional HPC vendors, this meant that on-prem was not only still alive and growing, but doing so without anything approaching a competitive scenario. Competition was strictly for on-prem, with much of the “cloud” or “grid” or “utility” computing going on in the hosted data centers. Everyone ran or contracted for their own DCs.
The speed race was all about time to solution for a problem set. Not time to solution for building a system. That could still take weeks to months.
During this time, users played with cloud solutions, and until about 2017 or so, found most of them wanting. I had heard some serious complaints about performance from many users. I was not however, complacent in my thinking. I knew that eventually, the CSP (cloud service providers) would gain a toehold.
This said, some use cases, usually non-performance sensitive high throughput processing cases, found a home in the cloud. It turns out that the important metric to design for, for these groups is cost per throughput. If you can minimize that, that enables them to achieve their goals in the manner that they wish to occur.
The conclusion I had drawn from that is that different metrics apply to different HPC user communities. Price performance, or how much your performance will cost, isn’t the only metric. Nor is it a particularly good one.
My company (and numerous other smaller companies) died or were acquired during this time. The market consolidated as the HPC market started becoming aware of this competition.
Today, we have machines capable of HPC operations in clouds. In some cases, they are configured well, in many cases they aren’t. The latter group are still usable to some extent, but this has resulted in an interesting dynamic playing out. More on this below.
Second, the speed of setup of resources has become important. This has always been an issue to some degree, but the cloud has fundamentally transformed expectations of time-to-operations, configurability, etc.
A typical HPC environment often takes a while to build to provide services to users. For HPC users used to a batch environment, this involves a whole supply chain aspect, an facilities and installation aspect, a post install configuration aspect, and a testing aspect. All of this prior to use. The cloud has effectively eliminated that for the user. That is, when you purchase a cloud “HPC” product, you can achieve productivity in time scales measurable in hours to days, where previously weeks to months was common.
It cannot be overstated how important this is. There is literally an opportunity cost associated with the non-compute/storage/network parts of HPC, that you pay while you are working on setting this all up. Or pay while you are waiting for it to be set up.
This does come at an economic cost. But that is all part of a reasonable economic model … do you want this now, or later, and what is the cost to you for later vs now? Amortise this across the time spent waiting, and you realize that its not a terrible cost.
Third, HPC workloads have changed … expanded. I’ve been talking about this for decades. HPC goes down market. To service more possibilities. Currently AI and its sub-areas in use and being developed (ML/DL, etc.) are computationally intensive and can leverage accelerators well.
So HPC as a workload has expanded to include these, and many similar efforts. I don’t see this reversing course. I see this expanding, rapidly.
On-prem the way we did it in 2010 isn’t going to solve this today in 2020. We’d need a new model. I know a couple of models for this, and I can’t comment on some of them, as they are directly related to my day job. Since I am not paid to blog here, no adverts either, I won’t discuss them. I will simply point back to the fact that status quo for on-prem isn’t a model that will work well.
Fourth, and continuing the point from the first section, architectural changes continue. Not just machines with a substrate processor and accelerators. Application level architectural changes. Application design as a coupled set of services/functions distributed amongst a network.
HPC has in the preceding N years (for N ~20+years) been mostly about single process, or single (HPC) system (from an administrative view) minimizing time to solution for a single computation. MPI and other tools figured strongly into this. MPI as one may recall, doesn’t have a great track record when ranks fail. You need checkpoints in order to survive longer runs on many computational elements.
Today, and in the future, virtual network meshes, often private/encrypted, are in use on shared computational resources. Service meshes and health checks are employed to manage individual functions and services. Failure of part of a network means spinning up replacement parts quickly and automatically. Root cause analysis (RCA) may be done post mortem, or not.
Applications are written in (gasp) python, usually with calls out to C/C++/Fortran libraries for performance. Though my personal preference is for Julia (more in future posts), python is adequate as a glue language.
More important than these are how cloud system constraints are fundamentally changing the architecture of the underlying system choices. Looking at various cloud service providers offerings, you realize that you do not get the full capability of the underlying system. You get a quality-of-service (maybe) managed version of that across compute, IO, and network. That is, if you need X number of units of some resource capability from machines instances that supply Y, you need more than X/Y number of machines to provide it.
This is similar to issues with on-prem systems, though often the value of Y is a small fraction of what an on-prem system is capable of. This is a point for a smart competitor to provide differentiation.
Another aspect of this is that some CSPs have machine instances where they have made a specific choice about how to allocate resources that work well in general cases, but poorly for HPC applications. I’ve worked with a few of them, and they’ve discussed their choices with me. I know how to fix these things, but it would require a change in their process/design, and they are unwilling to do so. This is another point for a smart competitor to provide differentiation.
Fifth, system and job management … cluster operating systems are on the decline. Now we have container management systems. Job management systems have coalesced to a number which are container aware or container exclusive. Jobs themselves, which were batch files in the 2010s, are now largely “self contained” containers.
The latter is a good thing. A side effect of the switch in the 80s/90s to dynamic linking was dependency-radius hell. Modules were created as a way to manipulate environments to help users manage this. Containers are a somewhat more modern view, providing a “complete” set of dependencies with the application. This is because, in 2020, memory and disk are relatively cheap, as compared to 1990s when they were not.
That is the fundamental point of this post. Economics and more importantly, what users value, have changed for the HPC market over time. They will continue to change. There is nothing wrong with that.
The challenge for vendors is to figure out where the users want to be, and to enable them to get there better than the rest, while making sure the economics are viable. This is non-trivial.
The above is also why there has been significant consolidation in the market. My own employer was purchased by another, while the latter has laid out a plan to become an *-as-a-Service provider. The Cray brand name is tremendous, and hopefully will continue on.
AMD is re-emergent as a CPU force. Having used 2x Ryzen 5 APUs (a term I coined in 2005 writing white papers for AMD) recently, and am convinced that they are onto something really good with this architecture and product. Extremely good performance, excellent price. My next laptops/desksides/servers will likely be AMD based.
NVIDIA dominates all in GPUs for HPC. AMD still doesn’t have a CUDA API compatible system (I know about ROC/HIP/…). I still think that is important. Intel is about to launch Xe graphics, and has launched oneAPI. Neither of which I’ve looked into depth on. We’ve seen significant complacency from Intel when they didn’t have significant competition. Now that they do, things are getting better for HPC users.
ARM. According to many pundits, this is the year of ARM. Or the decade of ARM. Or something.
Color me skeptical. Its all about the toolchain, and the least common ISA that the toolchain gets to leverage. I know my employer has shipped and will ship ARM machines. I just don’t think they will become commonplace. Happy to be wrong, though I don’t think I am.
Similar comments for RISC-V, though I think it probably has a better shot if the ISA doesn’t become as fractured as ARM has been in the past.
Storage … near and dear to my heart. Very few people followed my architectural example from the 2010s. Which means in 2020s, they are just catching up now with performance that my systems had back then. smh.
What continues to be a problem and will get worse over time is data motion. I’ve been saying this for more than 25 years. Data motion is hard. Data motion is expensive.
This isn’t just network speed, but that is part of it. It’s 2020 now, and 200Gb HDR IB is here and working. I do know of a few people still specing out on-prem 10GbE links (seriously???) in this day and age, when faster (bandwidth and latency) systems are available at effectively the same cost. IMO, this would be considered malpractice, as this infra remains unchanged as your data grows. If you don’t design for what you need in 18-36 months, you’ve hobbled yourself, significantly.
Wetware, meatspace, however you prefer to call it. These are eyes and experience in applications, domain knowledge HPC, or people whom can be trained/learn the domain. I suspect that this will be one of the hardest problems to solve.
This isn’t just a data scientist or engineering problem. I am talking also about people who understand the computational flow, how the elements tie together, and recognize problems when they arise. That is, domain knowledge and HPC system specialists. There aren’t many good ones out there, they don’t usually have certifications. They have battle scars. They have solving difficult problems under less than ideal circumstances with severe deficits of time, budget, people. We are out there, and we need more of us.
Basically, I think the rise of the CSPs in HPC will open up new capabilities for users. But they will also open up new attack surfaces for smart competitors. Architectural changes for applications, systems, etc. will continue to push a holistic minimization of time to solution, with the idea that longer times to solution increase cost of solution. People who understand HPC, and how to make it work in the context of users goals will likely be more highly prized.
This should be an interesting decade to be in HPC.
Joe Landman is the editor at Scalability.org. His day job is Director of Cloud Solutions and DevOps at Cray.
Good to hear from you in this article Joe. SPOT ON.