Sign up for our newsletter and get the latest HPC news and analysis.

The Rich Report: The 16 Terabyte PC – SGI Bets on Exascale

It has been over a year since SGI’s merger with Rackable Systems. The two company’s came from different camps, so I was curious to learn about where they are today and where they’re headed in the HPC space. So I caught up with the company’s Chief Technology Office, Eng Lim Goh, to discuss the company’s new products and their plans for Exascale computing.

insideHPC: How long have you been at SGI?

Dr. Eng Lim Goh

Dr. Eng Lim Goh

Dr. Eng Lim Goh: Over 20 years now. I started as a systems engineer in Singapore working on the GT workstation.

insideHPC: As CTO, what does a typical day look like for you?

Dr. Eng Lim Goh: These days I spend about 50-60 percent of my time outside with customers. That’s particularly important now given the fact that we are a new company, with Rackable having acquired us and then renaming the company “SGI.” So I’m going out communicating not only about the new company, but also about the new line including products in the Internet Cloud space, which are less familiar to our HPC customers. And I’m also going out to our Cloud customers who are not familiar with our HPC line and storage lines.

So, it’s a lot of work to bring the community up to speed on both sides–the Cloud side and the HPC side, and that has been going on for a year now. And I think we have come to more of run-rate like scenario now.

insideHPC: That’s interesting. I remember when I first read about the acquisition. I wasn’t familiar with Rackable, so I looked at a corporate overview video that highlighted all their key customers. The list was a who’s-who of heavy-hitter Internet companies like Amazon, Facebook, and Yahoo, and I thought, my gosh, SGI has become the new “Dot in Dot Com” just like Sun was ten or twelve years ago.

Dr. Eng Lim Goh: That’s very complimentary of you to say. In fact our latest win was with Amazon.com with their EC2 and S3 cloud. They’re one of the biggest cloud providers today and we supply the majority of systems to that enterprise.

insideHPC: That brings up my next question. You have these distinct customer segments: the Cloud/Internet providers and the typical big HPC clusters. They’re both filling up rooms with x86 racks, but how do their needs differ?

Dr. Eng Lim Goh: The differences are as follows. On the Internet/Cloud side, they have the same 500 racks of computer systems in their datacenter, but they run tens of thousands of different applications like map reduce, memcacheDB, and Hadoop that are highly distributed. And then on the other extreme, in the HPC world, you may have 256 racks and you may even be thinking of running just one application across all of that. I’m just talking extremes here, of course. There are overlaps, but given these extremes, you see that the needs are different.

On the HPC side, interruption to services on any node in the entire facility can affect productivity. For example, you may have checkpoint restart, but it still takes time to do the checkpoint and then restart. That is, unless the user has intentionally gone into the code to more seamlessly tolerate a node failure while an MPI program is running. So a node failure can be more interruptive to the HPC world as opposed to the cloud side. On the cloud side, their usage makes them inherently and highly tolerant of node failures And as such the focus is different.

Now let’s look at some of the similarities like power. There is one area where we have been learning a lot from the Cloud side to bring over to the HPC side. Their Internet datacenters are on the order 10 to 20 or 50 Megawatts. While in the HPC space, if you talk about a 50 MW datacenter it is considered extreme. So in this sense, I’d say the Cloud world actually scales bigger.

insideHPC: So they are facing a lot of the same challenges in terms of power and cooling. What did Rackable bring to the table in this area?

Dr. Eng Lim Goh: With regards to power and cooling on the Cloud side, one of the key requirements Rackable addressed was efficiency. In the early days, when datacenters were on the order of a Megawatt, customers had power efficiency specifications at the tray level. And then more recently they were set at the rack level. So if they were ordering 400 racks like one of our cloud customers, they stopped specifying at the chassis level and started specifying at the rack level.

So that gave us the opportunity to optimize at the rack level: removing power supplies in every chassis and doing AC to DC conversion in the infrastructure at the rack level. Later, with our CloudRack design, we removed fans at the chassis level as well.  In fact, some Internet datacenters are demanding that those racks are able to run extremely warm, as high as 40 degrees Centigrade, in order to reduce energy consumption on the cooling side.

So then as they move to even larger scales, with Cloud datacenters that run tens of Megawatts, they are moving to the next level up of granularity and specifying efficiencies at the container level. At that level, we essentially have a modular datacenter, and this is where they started to specify a PUE requirement for each container that we ship. Today the standard requirements are on the order of 1.2 PUE, with more recent acquisitions demanding even more efficiency than that.

So on the Internet/Cloud side, yes, the expertise brought by Rackable was to be able to scale with the customer’s requirements as they went from 1 Megawatt to tens of Megawatts and keep up with these datacenter’s demands for higher and higher efficiencies.

insideHPC: You mentioned container-based datacenters. I came from Sun where we never seemed to make hay with our Project Blackbox. How well is SGI doing with it’s ICE cube container datacenters?

Dr. Eng Lim Goh: We have shipped containers to a number of customers and we also have a couple on Cloud providers who are evaluating ICE cubes for wider deployment.

insideHPC: Are the HPC customers interested in containers, or are they still on the fence?

Dr. Eng Lim Goh: This is where I think the combination of the two companies, Rackable and SGI, have a strong leverage because the HPC world is coming up to where the Internet datacenters are in terms of scale. So when we’re talking about Exascale computing here, and they are specifying 20 MW for a future Exascale system, this is something that the Rackable side is familiar with in terms of power. So for HPC, we are actually drawing a lot on our expertise of delivering to Internet datacenters at that scale and at that requirement for efficiency.

For example, say there is someday a HPC datacenter requiring an extreme PUE number of say 1.1, in addition to meeting other Exascale requirements. So we have drawn from the Cloud datacenter side, where they already have such requirements for an air-cooled container that just takes in outside air through a filter to cool your systems. We have one such system now that has passed the experimental stage and is ready for deployment. And in many places in the world, if we built a system that can tolerate, say, 25 degrees Centigrade, you can get free cooling most of the year. However, for those places averaging higher than 25 degrees C, this wet-cooling system essentially uses a garden hose (I’m simplifying it) type connection to wet the filter just like a swamp cooler. Depending on humidity levels, you can get a five to ten degree Centigrade cooling result.

insideHPC: So that brings up another issue. When you have that kind of scale going on, system management must be a huge undertaking.

Dr. Eng Lim Goh: Absolutely. We have hierarchical systems management tools with a user interface to manage all the way from the compute side to the interconnect side and then all the way to facility power consumption. And of course, at the container level, we have a modular control system that handles temperature, humidity, pressure, and outside air. And that modular system feeds upward to the hierarchical systems management tools.

insideHPC: Since we’re talking about big scale, I think we should dive into the new Ultra Violet product, SGI Altix UV, that you announced at SC09. Is that product shipping now?

Dr. Eng Lim Goh: We began shipping the Altix UV a few weeks ago. We now have a number of orders, so there is a lot of interest in the system.

In terms of it’s use, there are two areas in which the Altix UV is of great interest. On the one hand, you have customers who are interested in big, scale-up nodes. You know, with today’s Nehalem EX you can get two, four, and eight socket systems. If you think in that way, the Altix UV scales beyond that eight socket limit all the way to 256 sockets and 16 Terabytes of memory. So that’s one way to look at the Altix UV. The 16 Terabyte memory limit is because the Nehalem core only has 44 bits for physical address space.

So that’s one of the ways of looking at Altix UV. And the reason people buy that, for example, is heavy analytics where they load in 10 Terabyte datasets and then use the 256 sockets, which equates to up to 2000+ cores, to work on that dataset.

insideHPC: And that’s a single system image for all those cores?

Dr. Eng Lim Goh: Yes. It runs as a Single System Image on the Linux operating system, either SuSe or Red Hat, and we are in the process of testing Windows on it right now. So when you get Windows running on it, it’s really going to be a very big PC. It will look just like a PC. We have engineers that are compiling code on their laptops and the binary just works on this system. The difference is that their laptops have two Gigabytes of memory and the Altix UV has up to 16 Terabytes of memory and 2000+ physical cores.

So this is going to be a really big PC. Imagine trying to load a 1.5 Terabyte Excel spreadsheet and then working with it all in memory. That’s one way of using the Altix UV.

insideHPC: Did you develop a new chip to do the communications?

Dr. Eng Lim Goh: Yes. We are leveraging the ASICs chip that we developed. You can call it a node controller, but we call it the Altix UV Hub (HUV). Every hub sits below two Nehalem EX (8-core) sockets. And this Hub essentially talks to every other Hub in every node in the system and fuses the memory in those nodes into one collective. So when the Linux operating system or Windows operating system comes in, it thinks that this is one big node. That’s how it works.

So all the cache coherency is done by that chip in hardware. Even the tracking of who is sharing what in the shared memory system, it’s all registered in hardware on that chip, and that chip carries it’s own private memory to keep track of all these vectors.

insideHPC: So how does this kind of Big Node change the way scientists can approach their problems?

Dr. Eng Lim Goh: This is a brilliant question. Although the Altix UV is a great tool for large-scale analytics, we are starting to see a lot of interest from the scientists and engineers. There are many scenarios, but let me describe to you one scenario.

If you take typical scientists: the chemists, physicists, or biologists, they do research in the labs and write programs on their laptops to experiment with ideas. So they work with these ideas on their laptop, small scale, but what do they do today when they need to scale up their problems? Today what they have to do is either MPI encode it themselves, or try to get computational scientists in from a supercomputer center or university to code it for them and run it in parallel. And this transition takes weeks, if not months.So what we envision is that the scientist will plug into the Altix UV instead of just waiting. The Altix UV will plug into the middle here by giving the scientists a bigger PC; it does not replace the MPI work.

Let’s look at a very common example. If you take a cube model with 1000 grid points in the X direction and 1000 grid points in the Y and Z directions, and then you march this cube 1000 time steps, that would be a 1 trillion-point (Terapoint) dataset. Now if every grid-point was a double-precision number, this will result in an 8 Terabyte dataset.

At this size, you will typically go to MPI. However, with UV you now have an alternative. We can supply a 10 Terabyte PC to run problems like these. My suspicion is that they will still eventually move to MPI as they run more rigorous simulations. So rather than replace MPI, Altix UV gives them a more seamless research bridge as scientists scale their simulations.

insideHPC: What other ways might they use Altix UV?

Dr. Eng Lim Goh: There is another way to use the Altix UV. We envision using it as a front end to an Exascale system. Imagine your Exascale, albeit tight, cluster with tens or hundreds of Petabytes of distributed memory and you’re using Message Passing or some other kind of API to run a large application. Since this system is going to generate massive amounts of data, it would be good to have a head node that could handle that data for your analysis work. You can’t use a PC any more in the Exascale world; you need something bigger.

insideHPC: So there is a lot of talk these days about Exascale in the next eight or ten years. Where do you see SGI playing a role in that space?

Dr. Eng Lim Goh: I think our role in Exascale will be two-fold. The first will be to use this Big PC concept, with 16 Terabytes going to 64 Terabytes in 2012, and use it as the front end to an Exascale system. We would like the next generations of Altix UV to be the front end of every Exascale system that’s out there. Because if you are already spending tens of millions or hundreds of millions of dollars to build an Exascale system, it’s worth spending a little more so that you can get better use and be more productive with the output of that Exascale system.

Another role for SGI is developing the Exascale system itself. And this is where we are looking at providing a partitioned version of the Altix UV to be the key Exascale system.

So let’s look at Exascale systems now: If you look at what the top research priorities are to achieve Exascale within this decade, you can see that in general those are power/cooling as number one; and how do you get an Exaflop with 20 Megawatts? Number two would be resilience; can the Exascale system stay up long enough to at least do a checkpoint? (laughs) And on these two we are looking closely with microprocessor and accelerator vendors.

But the next two priorities are what we are focusing on ourselves: communications across the systems (essentially the interconnect) and usability. As I’ve described on the usability side, we will be looking at the Altix UV as a big head node.

In the communications area, we believe the interconnect needs to be smarter for an Exascale system to work. Why? Because you cannot get away from global collectives for example, in an MPI program unless you code specifically for Exascale applications to avoid it. Many of the applications that try to run on this large of an Exascale system will have global collectives and will need to do massive communications in the course of running the applications.

insideHPC: So how do you propose to reduce communications overhead in an Exascale system?

Dr. Eng Lim Goh: We sat down and worked out that to cut down that overhead, we need a global address space. With this, memory in every node in the Exascale system is aware (through the node controller) of every other memory in the entire infrastructure. So that when you send a message, a synchronization, or GET PUT to do communications, you do it with little overheard.

But I must emphasize, as many even well-informed HPC people misunderstand, that this global address space is not shared memory. This is the other part of Altix UV that has not been understood well. Let me therefore lay it out.

At the highest level you have shared memory. In the next level down you have global address space and next level down you have distributed memory. Distributed memory is what we all know; each node doesn’t know about it’s neighboring nodes and what you have to do is send a message across. That’s why it’s called Message Passing.

Shared memory then is all the way up. Every node sees all the memory in every other node and hears all the chatter in every other node. Whether it needs it or not, it will see everything and hear everything. That’s why the Linux or Windows can just come in and use the big node.

However, with all the goodness of big shared memory hearing and seeing everything brings you, it cannot scale to a billion threads. It’s just like if you were in a crowded room and and tried to pay attention to all the chatter at once even though it is not meant for you. You would get highly distracted.

So if you go to the other extreme to a distributed memory, you sit in a house with sound-proof the walls and shutter the windows. And as such you see nothing and you hear nothing of your neighbors. The only way you can get a communication across is to send a message by writing a letter or email and send it to a neighbor.

So we decided that a global address space is the best middle ground. In that analogy, global address space sees everything, but does not hear the chattering amongst neighbors. All it wants to do is see everything so that it can do a GET PUT directly, do a SEND RECEIVE directly, or it can do a synchronization expediently. So a hardware-supported, global address space is one way to get the communications overhead lowered in the Exascale world. And this is especially important when you’re talking about a billion threads. Imagine trying to do a global sum on a billion threads. I hope we can code around it, but my suspicion is that there will still be applications needing to do it.

insideHPC: I can tell by your voice that you have great passion for this subject. It sounds like the next ten years are going to be very exciting for SGI.

Dr. Eng Lim Goh: Thank you and I believe so. We sit here working with the industry looking at the state of the land, saying that we need to go to Exascale. And at the same time people are realizing that ok, we first have to do R&D on power, cooling, and resiliency. Sure, SGI is there with the others working on these first set of problems, but we have also been focused on alleviating many-threaded communications overhead for 10 or 15 years already. Moreover, we now also have what we think is a practical solution to the usability problem of an Exascale system, with our big PC head node concept. So in summary, we believe SGI can be a major contributor there.

The Rich Report is produced by Rich Brueckner at Flex Rex Communications. You can follow Rich on Twitter.

Comments

  1. This interview is a lesson on Dr Lim Goh ideas and achievements in a totally new SGI, much better than the old SGI. Rich Brueckner from Flex Communications, who was the central brain in old Sun HPC marketing and events, (see the HPC cooler blog), continues to reveal us personalities and ideas that shape the HPC of today and tomorrow

Trackbacks

  1. […] my recent interview with Eng Lim Goh, SGI’s CTO, we talked quite a but about the Altix UV 1000, the company’s big […]

  2. […] details on how the SGI Altix UV works to provide a single system image, check out my interview with Eng Lim Goh, SGI’s Chief Technology Officer.AKPC_IDS += "14377,"; Posted in HPC, HPC Hardware by Rich […]

  3. […] the SGI Altix UV is effectively the world’s biggest PC. For more background, check out my interview with Eng Lim Goh.AKPC_IDS += "17216,"; Posted in Compute, HPC, HPC Hardware, Video by Rich Brueckner 0 […]

Resource Links: