Sign up for our newsletter and get the latest HPC news and analysis.
Send me information from insideHPC:


Slidecast: Cycle Computing Powers 70,000-core AWS Cluster for HGST

Jason Stowe

Jason Stowe, CEO, Cycle Computing

Has Cloud HPC finally made it’s way to the Missing Middle? In this slidecast, Jason Stowe from Cycle Computing describes how the company enabled HGST to spin up a 70,000-core cluster from AWS and then return it 8 hours later.

One of HGST’s engineering workloads seeks to find an optimal advanced drive head design, taking 30 days to complete on an in-house cluster. In layman terms, this workload runs 1 million simulations for designs based upon 22 different design parameters running on 3 drive media Running these simulations using an in-house, specially built simulator, the workload takes approximately 30 days to complete on an internal cluster.

Full Transcript:

insideHPC: Welcome to the Rich Report, a podcast with news and information on high performance computing. Today, my guest is from Cycle Computing. We have Jason Stowe. He is the CEO of the company. Jason, how are you doing today?

Jason Stowe: Rich, it’s great to be here. I’m doing great. Thanks for having me.

insideHPC: Welcome back. I know you guys were at the AWS Invent Conference this week. How did that go?
Reinvent was pretty awesome actually. There was a pretty surprising number of folks. I think it doubled in size approximately from the prior year. So it’s very clear there’s a lot of folks paying attention to cloud now. And we’re obviously really excited about the capability of using that to accelerate science, to accelerate engineering. We’re definitely pretty psyched about the level of participation at the conference. There are companies from bio sciences to manufacturing, a lot of universities, a lot of government research sites. It was actually a very impressive breadth and diversity between the US and globally, in terms of the number for people that were here and where they are from and what industry. It was shock to me.

insideHPC: My buddy tried to get a ticket a month ago and the thing was completely sold out, so I believe it. Wow.

Jason Stowe:It’s not quite up to supercomputing, supercomputing is a little bigger [laughter]. We’re getting big but it’s getting there, and I definitely think– between that, and I know some of the other cloud writers that have very large events as well. There’s a lot of momentum behind the three public majors: AWS and Google and [Asher?], so I think there is a lot of really exciting stuff going on in this space and we’re just happy to be here at the right place, at the right time.

insideHPC: Well in the meantime, you guys have a pretty exciting announcement this week about a world record. And I brought up your slides, why don’t we go through that and then we’ll do a Q&A at the end?

Jason Stowe: You got it, that sounds like a plan. Essentially at Cycle, we really believe that access to cloud cluster computing is going to be the single largest accelerator of science and engineering, and risk management over the next ten to 20 years. I think the removal of constraints on the kinds of questions that you can ask by making it really efficient and really cost effective, for someone to grab a large amounts of capacity and ask questions that normally they wouldn’t be able to ask, because basically, it wouldn’t fit on the cluster that they have in-house. It’s going to be a huge change for the industry as a whole and this work load is really an example of that. So if you go to the next slide, so basically a western digital subsidiary named HGST.

HGST was basically transforming a particular workload that had to do with drive head design. So as heads flowed above the platter, the heads of themselves are made of a significant numbers of materials and different configurations. And so on a science side, they had a-30 day workload if they ran it in-house that essentially explored a million different permutations of these potential designs since this is a parameter sweep across 22 different parameters, and three different media types for the platters themselves. And each shift, these guys are really amazing. And they’re really just trying to enable an increase in troop for their engineering folks. They want the folks to be able to get engineering and science done significantly faster than they were able to do before.

And a lot of the work that we’ve been doing with them over the last year and a half, two years has just been amazing to watch. There’s a great YouTube video by Steve Hynes and production HPC and the cloud at Reinvent last year. And this is really a case study of kind of the next level where they’ve gotten to. On the business side though, essentially what they’re trying to do is just innovate, right? They want to be able to get iterations done of designs and make them better and make them faster. They currently have the highest capacity drives which are helium-filled, kind of hermetically sealed ten-terabyte drives. And the way they figured how to do that was by doing computational fluid dynamics of the platter spinning. This workload is more of a material science than engineering workload – just making it so that those drive heads are efficient as possible.

And so if we go to the next slide, we found out about this workload last Wednesday actually. So, we actually didn’t know this existed a little over a week ago. On Tuesday of last week we had no idea we are going to do this. On Wednesday, we found out about the workload, and over the weekend we run what we’re calling the Gojira around it. And Gojira is the synonym for Godzilla, so this was a monster run. It ran across three different regions. So we basically did the 70 and 3/4 years of computing required for this drive head design. This parameter sweep was essentially seven decades of computing. It was a million drive head designs and we actually ran that in eight hours instead of the 30 days that it would take in-house. And so you can imagine the throughput increase in terms of design. And they had an in-house application called MRM that had a kind of a MATLAB post-process on it, and they used our software and we used the open source tool Chef to help with software configurations, so those were some of the applications that were involved.

And essentially across three regions, we got 50,000 cores in the first 23 minutes and topped out at around 70,908 cores across those three regions. If you essentially use the Intel LINPACK Benchmark on each node and edit it together, so kind of a half cousin in R peak and R max, you get 729 teraFLOPS R peak, which is more than the number 63 on the top 500 list published last June. And essentially the infrastructure cost, all of this was done for $5,594.

insideHPC: Wow.

Jason Stowe: It’s radical acceleration and increase in throughput for these guys, and this is the kind of stuff that Steve and David’s team, they’re just doing an amazing job of leveraging external capacity to increase the throughput of their engineers. It’s unbelievable.

In the next slide, we talk a little bit about the value of timing, but there’s a few points around this and I’d love to discuss with you Rich around. But basically, we can actually get this up and running in a day or two at this large scale, this is really a push button operation at this point. I remember probably when I talked to you, we did that 10,000 core run four years ago, with Genentech. And that was a real Autoband activity, it took months of preparations. Now, the last 156,000 core run we did took about a month of preparation, just four weeks of getting ready to grab every spare core we could get out of AWS spot [inaudible]. And now to be able to do this in two to three days with no notice and then publish it, talk about it the next week with you – it’s a crazy acceleration and really increases the agility. It makes it so people can react to new data that they get and basically, “Oh jeez, I got to go redo this design and push the button and get it done,” and that was kind of a really big deal.

And the last couple of things are just really about capacity, so you can get 50,000 cores in 23 minutes. That means the kinds of problems you can ask and solve are 50,000 cores in scale now. And it’s normally a couple orders of magnitude bigger than what most of us have in our internal clusters. And essentially being able to get the result back so much faster allows you to do an iterative design, increase your throughput. And the fact that Ivy Bridge processors are now– obviously, there’s the new CFOUR instance types this week and AWS have even newer processors. But the Ivy Bridge actually did a great job of giving us great FLOP count against the problem, at an exceptionally reasonable cost.

So it was a really big deal from a timing perspective and it was one of the things that we really wanted to showcase around this use case, is that you can get up and running quickly, you can approach bigger problems fast and you can increase your throughput in a way that you could never do before.

So in terms of what’s different about this run, there was the new scale if you go to the next slide. And there’s new industry, new agility and the new processor that we were using. So on the scale side, we essentially had a much larger customer than we’ve announced before. So this is a Fortune 500. And now R&D basically has the scale they need to ask the question that will actually change their design process, rather than the one that fits on the cluster. And this is also manufacturing and I think that’s a big deal because we’re entering the early majority. We’re no longer in the early adopters and the innovators crossing the chasm inside the tornado process here. We’re no longer in the earliest part of the market. Life sciences was definitely that in 2008 and 2009, 2010 – they were the first into the cloud. But manufacturing generally moves a lot slower, Steve’s obviously really innovative. But from a practical standpoint, this is really a clear indicator that cloud is getting broadly adopted for technical computing.

And again getting those courses quickly as we did and being able to be as cost-effective on a FLOP count basis, we’re really different than the last run. We actually had 50% more FLOPS per core out of the Ivy Bridge notes than we did when we were in the 156,000-core run before. And that’s really the story. There’s a lot of great pictures. If you go to the next slide, here’s the 70,908 cores running and 728.95 teraFLOPS associated with it. There’s just a very large set of processing power that’s just available to you if you dip in and get it. And really we’re just excited to be a part of this.

That’s amazing. You’re really talking about a petascale capability there if you had measured it in peak FLOPS, right? Last I checked, there were maybe 35 of those in the entire planet – machines of that grandeur right? And you spun this up in minutes or hours, whatever it was.

About 30 minutes we got about 2/3 or 3/4 of the core count. By the time an hour had passed, we had a majority of the 70,000-core. It was fairly fast – a lot faster than waiting for it to arrive, get it off the loading dock, and rack, stack and then cabling. I can tell you that.

Well that’s an important point, Jason. Because when you think about how long it takes somebody like Oak Ridge to get tightened up and fully productive, it is months if not years. And a lot of smart people work in long hours.
Yeah, there’s definitely an economy of scale you’re taking on. There are of course trade-offs. [crosstalk] my inner HPC nerd – I worked at the Pittsburgh Supercomputing Centre part of the Mellon Institute in the comp cam group there under Charlie Brooks back in the early 90s on a KT3D. And so my inner nerd is saying, “Well, the interconnect is not as fast,” but that’s true. But at the same point, a lot of the problems that are in the newer sciences in analytics, in design optimization, et cetera, are really geared towards throughput. They’re not geared toward….

insideHPC: Capability?

Jason Stowe: That’s not to say– capability is still important, we still need those really large faster to connect machines for what they do, but there are new classes of workload where you want to find the material that has the right solar panel property. And so it makes sense to create 156,000-core for that but it’s loosely coupled because of the problems loosely coupled. I don’t want anybody to think that this is going to replace the capability machine tomorrow. That’s not the goal, this is a different problem set.

But we do see more and more use cases where, even on the inner loop, if you have an MPI job that maybe runs on one machine or four machines or eight machines, there’s still a lot. The outer loop is almost always a three-point orientation, so there’s the evaluation of different initial conditions on the simulation or the evaluation of different designs. These kinds of patterns are things that the needle-in-the-haystack problems, if you will, are things that we see really over and over again.

insideHPC: Yeah, you think about a company the size of Western Digital’s. Certainly if they had the will, they could pay 30 million for a machine of this size, right? And probably another ten for the building to power and cool it, right? But the 40 million outlay versus, what, 5500 bucks [laughter].

Jason Stowe: Yeah, it’s kind hard to argue.

insideHPC: [laughter] Yeah, yeah.

Jason Stowe: Especially for something you’re going to run once a month.

insideHPC: Yes, right.

Jason Stowe: They’re going to run this once every three, four weeks. And they’re going to pull the trigger on it, and then be done on it.

insideHPC: You would never– it just doesn’t pencil out. All right, so what’s the net out of this? You’re proving a really industrial use case here within a very short amount of preparation time. Is Cycle ready to say Come get your FLOPS?

Jason Stowe: We definitely feel like we are. We’ve been seeing a really rapid adoption increase across multiple different industries, so now we have a significant number of– the majority of the top ten farmer basically. A large number of insurance and finance so we have hedge funds and banks and very large– we have a Fortune 100 insurance company that hopefully we’ll be able to name at some point in the future, but they’re running regulatory workload that’s tens of thousands of cores at the end of every month. These kinds of workloads are just getting more and more common, and we’ve got it so it’s pretty push-button to do it.

But as much as I like talking about Cycle obviously, I’m really excited about where we are and we’ve been doubling in size and seeing all kinds of different customers use cases come through, and the fact that manufacturing is now in the door to me indicates that we’re in the majority. Now this is no longer early adopters, but the thing I think is most important about this is actually the higher order bit. Again, everybody should be sizing the question, To what is actually going to change their business? They should not be asking the question if it’s on their cluster, they should really be asking the question that will actually make it so your research is that much better, your designs are improved, your risk management is better. Those use cases are things where extra compute is now cost-effective and able to be consumed at vast scales compared to what we’re normally able to do.

The message out of this is really: don’t think about capacity, think about the science. Think about the engineering, go invent and discover. Do better work, and we’ll build the infrastructure to answer that question. Whatever that question is, we can actually achieve the scale required in order to be able to answer it. Now obviously this has got a throughput orientation, but I think from a practical standpoint, we’re seeing really awesome scientific and engineering results come out of these kinds of computers, and I’m just excited that it’s kind of proliferating.

insideHPC: Well Jason, help me put this in perspective. If there was no Cycle Computing in the world, how hard would it be to do what you guys just did this week?

Jason Stowe: That’s a great question. So as I mentioned, we did the first 10,000-core and it was several months of preparation. We have folks that we’ve interacted with that come to us after they’ve spent 12 months trying to get a production system up and running with several thousand cores in it – like four to six. We just had one of those come through actually not too long ago. The net-net I guess is that there’s a lot in terms of security automation, making sure the encryption keys are managed properly, doing proper data scheduling, dealing with different deployments across virtual private networks so that the BPC capability– we actually ran this entire workload inside of a BPN, so it’s all across the virtual private cloud, which means the different AZs that we used in this environment – we used three regions – we’re using multiple sets of data centers, and the networks affiliated with them – and when we did that, we were actually routing the work centrally and putting them into each of the different subnets and grabbing the results back out separately.

That’s a really complicated thing we do well and get working in first place, honestly. So from a practical standpoint, without us, you basically would have a lot of either duct tape and bubble gum of different tool chains together or you’d have a lot software to write. I think over the last eight years, we’ve spent about $19 million on building our product and engineering. We’ve got about 120 manyears of software written to basically do these problems. Getting it up and running for a one use case, it’s probably a doable thing in six to 12 months, and getting it working really well and being very, very reliable, and what have you. Getting it done for all of your workloads, getting it done across regions – those are the things that are definitely more challenging, but where there’s a lot of benefit because you can achieve scale. And I think that’s where we, I think, we closed the gap – we make that very easy.

insideHPC: Well this is very exciting, Jason. I guess congratulations are in order.

Jason Stowe: Thanks a lot Rich. I definitely look forward to seeing you in New Orleans. I imagine you have a very full roster.

insideHPC: [chuckles] I think I have 50 interviews scheduled or something ridiculous.

Jason Stowe: Well, I’m going to probably be watching almost every single one of them [laughter], so I really appreciate you fitting me in and covering everything.

insideHPC: Well I’m one of your biggest fans, as well. So thank you for coming on the show, Jason.

Jason Stowe: Thanks for having me Rick. All right, take care.

insideHPC: You bet. All right folks. That’s it for the Rich Report. Stay tuned for more news and information on high performance computing.

View the Slides * Download the MP3Subscribe on iTunes * Subscribe to RSS 

Sign up for our insideHPC Newsletter.

Resource Links: