Transcript of Green HPC Episode 4: Stop pampering your processors!
This is the transcript for episode four of the Green HPC podcast series, Stop pampering your processors!. You can find out more about the series at the Green HPC podcast series home page, and you can listen to the audio of this episode, find out more about the speakers, and get access to links and presentations that they’ve suggested at the episode 4 homepage.
[0:01] This is Green HPC, an exclusive podcast series from insideHPC.com, the web’s fastest growing source for HPC news, information and commentary.
This episode proudly sponsored by Convey Computer
[0:11] Convey’s hybrid core technology reduces the number of servers in your HPC environment without sacrificing performance. One rack of Convey’s servers replaces up to eight racks of other servers, which means you can rack up 92% lower power and cooling costs.
[0:32] Overall, if you look around the country we’ve seen, in the high-end space in almost every single computer center that is supporting big computers, a building boom.
[0:41] [musical interlude]
[0:50] This is insideHPC’s podcast series on green HPC. Welcome. I’m your host, John West, the editor of insideHPC.com. The quote you heard at the top of today’s show was Horst Simon talking about the building boom in data centers hosting high-end computing around the world. This building boom is one of the big external effects of the increasing standardization of HPC clusters around mass market components that are cheap to buy. Computers are cheaper, so you can buy more of them as building blocks when you’re putting together your cluster.
[1:21] But cheaper to buy, the capex part of a project’s life cycle if you’re an accountant, is different from cheaper to run. In fact, the operating expenses associated with high-performance computing have grown tremendously. Just in my own career, I’ve gone from a computing center with a single 100-kilowatt supercomputer to a data center that has several megawatt-scale machines all running at the same time.
[1:44] As we’ve heard in previous episodes, energy awareness is growing in computing, but that’s not necessarily because we want to help save the planet. Here’s Horst Simon again.
[1:54] I don’t think that the HPC community – and I may be too cynical here – is green at heart. The HPC community has been about computing, and wants to do more computing, will continue to do more computing. So, the HPC community is I believe interested in energy efficiency simply because by reducing the electricity bill you can buy more computers and you can do more computing.
[2:18] Horst frames this discussion about reducing the operating expenses of running a high-end data center for scientific computing as a hunt for opportunities that takes you from bits, the power technology in the hardware that generates the solutions, to the buildings that host the hardware itself. We have the link to Horst’s "Bits to Buildings" presentation from Salishan ’09 up on the Episode four home page at insideHPC.com.
[2:43] But today, on the Green HPC podcast, we’re going to focus on the buildings part of Horst’s spectrum, and in particular, we’re going to talk about how you are probably over-designing your machine rooms and pampering your processors. Picture your machine room. It’s cold, right? And clean. If it’s not new, it’s certainly well kept, probably a tour stop for company execs, investors, or generals and congressmen, if you work in the public sector.
[3:10] When you think about it, your computer has a nicer office than you do. The trouble is we don’t think about it very much at all. All that pampering takes time, people, and planning; and all of that takes money.
[3:22] Now picture something along the other end of the spectrum: a tent out behind your building hosting running servers in the out-of-doors. This isn’t just a thought experiment. Microsoft’s Christian Belady and his colleague, Sean James, ran a test like this last year with a couple of decommissioned servers that they put in a rack under a large metal-frame tent behind their data center in the fuel yard.
[3:45] No filtered air, no filtered power, no backup or UPS, no cooling, no heating – just a tent, some servers, and the great outdoors.
[3:56] You can check out pictures of this experiment on the episode four home page at insideHPC.com, along with a pointer to Christian’s own blog post about the experiment with more details and a lot more pictures. Christian and Sean ran five servers from November to June with 100 percent uptime – zero failures. During those eight months, water leaked on the servers and a wind storm blew a section of fence into the rack, but the equipment continued to run without incident.
[4:24] So, in talking to Christian about this demonstration, the first thing you notice is he is always at pains to say that they didn’t ever intend for everyone to run their computers outside in tents.
[4:36] We just did it to illustrate the extreme, and once you illustrate the extreme, then all of a sudden now people are like, well, OK, maybe it’s not as crazy to think about just using outside air even in our current data centers, which will be much more well protected, and that, yes, servers should be able to live.
[5:00] So then, I asked him, ‘Did the experiment have the desired effect? Did outside air economization become a more normal part of the data center discussion?’
[5:10] The industry was already starting to toy with the idea, and there were cases where people were starting to pull in outside air.
[5:17] But, what it really illustrated was that not only can we pull in outside air, because what happened is people, like in California, they were experimenting using outside air. But, it was still within the very narrow range of data center environments today, which made them kind of difficult to use all year-round or could only be used in very specific locations on the globe so the adoption was maybe not as broad as it could have been.
[5:52] By showing this experiment, it’s all of a sudden started the dialog to really look at this as maybe a technology with much broader adoption in the future and a much more global impact, and, truly, a big efficiency gain, whereas before it was more of a niche gain. Now, we could look at it as an impact across the globe.
[6:26] OK, so back to the tent city computing experiment: running computers in the out-of-doors in a tent through the winters. How did it end? It was dirty.
[6:36] We actually had the vendor take a look at… We eventually did have some failures, and they were failures of our power supply. We sent it back to have a root cause analysis of what failed in the power supply, and the vendor said, "We don’t understand," and was surprised.
[6:56] But, the whole thing essentially was coated with dirt and mud. What had happened was, again, because it was outside pulling in whatever, the power supply just got full of dirt. It was a vacuum cleaner, essentially. Well, it took almost a year for it to fail. So, again, if we put it into a more benign environment where we do a better job filtering, certainly we would expect better results.
[7:31] What we are going to talk about today then are some of the opportunities that you have specific to the rooms where you are currently hosting computers, and ways that you can start to think about the facilities and the computers together as a complete system, a system to be managed all at once rather than two separate activities in your organization, or maybe even two separate activities in two different organizations.
[7:54] We talked with HP’s Steve Cumings, who we heard from in Episode 2, to get his take on the idea of thinking about and managing the facilities and the computer hardware together.
[8:05] You have to start on a basis of energy reduction and making sure you’re being as efficient as possible within the IT hardware itself, because as you heard, customers no longer care about just performance per dollar. They’re also very interested in how much it costs them to power that footprint and how much capacity they have within their data center.
[8:25] So, a second part of the IT is how you manage it and what you do in terms of capping the IT and making sure that it’s in its lowest possible power state for the amount of performance needed. When you spend a lot of time talking to customers you quickly realize that, and they realize that, the rest of the data center has to keep up with the IT. Previously data centers were basically a house where you pump in as much power and cooling as necessary to keep the IT footprint happy. If you don’t also address the efficiency of the data center, then that leaves a lot of money on the table and a lot of capacity on the table that could otherwise be used.
[9:03] A big opportunity in thinking about the buildings where we house supercomputers is the operating temperature. Your data center is probably cold. Mine certainly is.
[9:13] Generally, today data centers are following ASHRAE guidelines, even if they don’t realize it. ASHRAE is the American Society for Heating, Refrigeration, and Air Conditioning Engineers. They’ve specified that data centers should operate between 20 and 25 degrees C, or about 68 to 77 degrees Fahrenheit. But, as Christian Belady explained to me, this operating range comes from a time when computers didn’t need cold rooms at all. They were cooled by water. Doing that was expensive, so designers said, well, we’ll just pull cool air from the room and blow that over the processors. Of course these rooms were designed for people—full-time operators, and techs, and folks that had their desks in next to these computers—and so they were designed for people comfort.
[9:57] It’s really been interesting to me, and this was even before I came to Microsoft, when I was with a server manufacturer. When you talked to customers, they would feel really nervous if they operated outside of the 77 Fahrenheit.
[10:12] I remember being in a discussion with Lawrence Berkeley Labs. When they were looking at building their new data center, they were really considering not using chillers and just using outside air. We were arguing for an hour about, well, if you go outside of the temperature range, then you take a risk and lower the liability, perhaps more failures. We’re all standing around talking about why we can’t go to higher temperatures, and then finally I asked the question, "Well, what temperature do you want to go to? What would enable you to go use outside air as opposed to having active cooling?" The answer was 80 degrees Fahrenheit. I just sat there and said, "80 degrees Fahrenheit? You’re only three degrees Fahrenheit above, and it’s only maybe a couple of times a year."
[11:03] At the end of the day, what’s happened is this range that’s defined by ASHRAE has become the bible, and people are even afraid of small, little excursions occasionally outside of that range. Contrarily, if you take a look at the telco environment, the telco environment has always historically specced a much broader range of environments in requiring equipment to operate at 40 C, which 40 C is 104 Fahrenheit, and with occasional excursions to 55 C, which is I think… I can’t do the math right now, but it’s on the order of 130 degrees Fahrenheit. So, there is equipment that could do it. It’s just all a matter of what we all hyperoptimize on moving forward.
[11:52] Here’s another interesting example. Everything you see in the data center, what you see in these servers, is also in PCs or vice versa. So, you have processors. They’re essentially going through the same fabs. It’s the same parts. We have a billion of those PCs across the globe. How many of them are in controlled environments?
[12:15] That uncontrolled environment that Christian is talking about includes dust. Have you ever cracked open the back of your home PC to upgrade the RAM after a couple years and had to vacuum out the case just to be able to find where the RAM goes? Yes, I have too.
[12:28] Another part of the uncontrolled environment that servers today are really designed for is straight house power. But, in many cases, large HPC installations are fed by online battery banks and inline generation to clean the power, as if the act of putting these robust servers together in one place somehow made them brittle as Fine China. Christian looks at the extremes that we often go to today to clean the power and clean the air as opportunities to reduce the operating costs.
[12:57] The fascinating thing to me is server designers design servers so that they can handle dirty power and that they could handle maybe even missing a few cycles of AC, because they’re designing the servers to operate under the desk of a cash register in some video chain store. As a result, that doesn’t have filtered power in that store.
[13:24] Well, that same server is used in the data center. The data center guys, on the other hand, assume that the servers cannot deal with anything, that if there’s a loss in the AC cycle or anything that they can’t deal with these transients. So, both are designing independently of each other, just making assumptions.
[13:47] All right. Now hopefully you’re convinced that you are indeed spoiling your supercomputers rotten. Their offices are way too big, they’re way too clean, and they’re way too cold, and you, in turn, are wasting too much money on them.
[14:01] So, you’ve decided to cut them off and force them to start making do with less. Good for you. You’re a tough-love parent. But wait. How far can you go? If 20 degrees C is unnecessarily plush, what’s a good number? 40? 50? Or is that pushing it too far?
[14:17] In order to save everyone from having to go explore this territory on their own, a plan that is almost sure to either under optimize, or be way too expensive in terms of failed hardware, or even both, what we’d really like to have is the facilities community and the IT computing community get together and do some bridge building to create new standards in the industry that everyone can agree result in sound management of the computing building ecosystem. We’re in luck, because that’s what Christian’s group is trying to do.
[14:48] One of my roles in coming to Microsoft was to see what are the opportunities for helping create this handshake really instead of having this interface and this boundary where there are people not talking.
[15:01] Quite honestly, there are no specs in the industry to define these things. My team’s role is to actually go in and define them, first of all, and then even make it more fuzzy to see what are the opportunities that we can take advantage of. So, online UPS, when you talk to folks, they say, well, we need it because it’s a nice filter for the power. But, have you asked the IT guys, do they need it to be nicely filtered, or is there something we could even do? Maybe there’s just a small tweak we’d do on the server side to clean up the power that had we had the handshake or had we had a spec or negotiated a spec between the two, then we actually could optimize the total solution of the data center, drive costs down substantially. That’s what we’re trying to do now.
[15:52] Convey Computer’s hybrid core technology improves application performance with — are you ready for this? — less servers and less energy. On key HPC workloads, Convey reduces the number of servers required dramatically, without
impacting performance. One rack of Convey servers, powered by energy-efficient FPGAs, replaces up to eight racks of other servers. This means you can rack up lower infrastructure costs, save on utilities, and increase performance up to 25 times. For more information, visit Convey at ConveyComputer.com.
[16:34] Well, that’s it for this episode of the Green HPC podcast series. You can find out more about the topics and people in this episode by going to insideHPC.com and clicking on the link for the Green HPC podcast series.
[16:47] In the next two episodes, we’re going to talk with data center engineers and managers about the specific things they are doing today to reduce energy use in their computing environments. We’ll learn from them what’s worked and what hasn’t worked. Until next time, I’m John West. For all of us here at insideHPC.com, thanks for listening.
[17:06] [closing music]