Slidecast: John Gustafson Explains Energy Efficient Unum Computing

Print Friendly, PDF & Email

51owjrzzWpLIn this slidecast, John Gustafson presents: An Energy Efficient and Massively Parallel Approach to Valid Numerics.

Written by one of the foremost experts in high-performance computing and the inventor of Gustafson’s Law, The End of Error: Unum Computing explains a new approach to computer arithmetic: the universal number (unum). The unum encompasses all IEEE floating-point formats as well as fixed-point and exact integer arithmetic. This new number type obtains more accurate answers than floating-point arithmetic yet uses fewer bits in many cases, saving memory, bandwidth, energy, and power.”

Read a Sample Chapter from the bookView the SlidesDownload the MP3Subscribe on iTunes * Subscribe to RSS

Want the book? For a limited time, insideHPC readers can save 20% on their purchase and receive free shipping by entering promo code AVP17 during check out.

Full Transcript:

insideHPC: Welcome to The Rich Report, a podcast with news and information on high performance computing. Today, my guest is John Gustafson. He is, of course, the father of Gustafson’s law, among other things, but he’s also the author of a new book. It’s called the The End of Error: Unum Computing. John, what can you tell us about this?

John Gustafson

John Gustafson

John Gustafson: I’m going to talk about a new way to do computation on computers – any numeric computation – that is more energy efficient and gets much better answers, which is, I think, what we really most need right now.

insideHPC: Yeah. John, this is exciting. I think you and I have been talking about this for a couple of years, and I remember at the time when we first brought it up that you weren’t able to break it at the time. You were very excited. That seemed to be your next step. What was the genesis of this?

I was always trying to fix all the traditional problems with things like interval arithmetic that promise that they were going to get valid answers and guaranteed answers, and you won’t have to know numerical analysis anymore, and they just didn’t work. Everything had something wrong with it. Everything had a gotcha. And I finally found one that didn’t have any gotchas and that’s when I got so excited.

insideHPC: So tell me about the book. You did two things here, right? You came up with this new method and then you had to go through that whole authoring process, didn’t you?

Yes, that’s right. CRC Press consented to do the book and they did it in 400 pages of full color, which is very generous of them, so I was able to use a lot of color illustrations to make the points clear. I wanted to have enough examples of how it actually works that it wouldn’t look like it was just a special case where only in this case or that case does it work better than existing floating-point arithmetic. It really completely covers the entire field.

insideHPC: I remember us talking about how you wanted to give this to the community, basically, this new way of doing things. Were you able to open source it? How did you go about that process?

Yes. Actually, we’re giving away the development environment which is written in Mathematica. It very clearly has an MIT license on it that gives it away, says you can sub-license it, you can sell, you can give it away for free, you just can’t sue me for anything that goes wrong. [laughter] That’s about the one thing that you can’t do.

insideHPC: But it does get the right answers, right?

That’s the whole point, right? Well, it gives bounds on answers if it can’t get you the exact answer. And that’s the real thing is that you know when– right now, floats have no indication of when they’re exact and when they’re not, and so they’re very treacherous. People don’t realize just how dangerous they are to use for serious calculations.

insideHPC: Well, great. Why don’t we go into it, John? You’ve got a lot of slides here, and I’ll put them up on our screen. I’ll just follow along, and let’s learn some more.

John Gustafson: Okay. The title is An Energy Efficient and Massively Parallel Approach to Valid Numerics. I didn’t want to use the jargon term, but it could have a subtitle of Unum Computing. The big problems facing computing, on the second slide – let me just flip through them. I think we all know what these are – is too much energy and power needed per calculation. We run out of electricity before we run out of floor space or money when we’re building a big super computer. We’ve also got a lot more hardware parallelism than we know how to use. People are proposing ten billion processors, or ten billion core machines for the exascale. I rarely see people use more than a few hundred on a given job. There’s also not enough bandwidth, and this has just been getting worse as long as you and I have been in super computing. People keep complaining that they never get enough IO to keep the arithmetic busy. It’s actually getting worse every year. Rounding errors are really a lot more dangerous than people realize. They think it’s something over in the last few decimal places, but sometimes it can be in the very first decimal place. When you have these rounding errors, it has a subtle effect that when you re-write something to run in parallel, you get a slightly different answer or sometimes a much different answer. And that means people go back to the serial method because that’s the one they trust. They don’t understand what’s going on, they don’t know whether they made a bug or whether they are just getting a rounding error.

And then there’s another kind of aspect to this, which is all the numerical methods that we have. The sampling errors in numerical methods are the other kind of error that I’m trying to eliminate. When you’re doing a physical simulation, you’re just kind of guessing your way through time steps, and you get advice and you always want more accuracy, but it’s not a proof that anything’s going to happen. It’s just guesswork. All these numerical methods that we’ve piled up over the years are really– it’s just like a bunch of recipes. And so, you have to be an expert to know exactly which recipe do you use in which situation. And that’s not a very good situation for programmer productivity either. And there’s a final problem with the IEEE floating-point standards. People don’t realize that it gives different answers on different platforms. There’s nothing in that standard that assures you you’re going to get the same bitwise identical result on different machines. Probably because there’s all these different optional things you can or can not– you don’t have to support. It’s more like the IEEE advice list instead of a standard.

Now if you take a look at – on the next slide – the ones vendors care most about, they definitely care about energy and power and about not having enough bandwidth, and the fact that people aren’t going parallel because they don’t understand what’s happening to their answers. But the IEEE floats give different answers in different platforms. But I find it’s very difficult to convince computer company executives that they need to get better answers. What they do care about is energy and power. So what I was looking for was something where– yeah. You can convince them that we want more battery life or we want to use fewer megawatts from the data center. That, they’re interested in. But they don’t hear anybody complaining that they’re getting wrong answers. Well, that’s because we think it’s kind of a hopeless thing to deal with. And it’s not hopeless at all. We can actually fix both things at once.

In the next slide, there’s just way too much power and heat needed. You see these heat sinks? They’re hundreds, even thousands of times bigger than the chip that they cool these days, so when you talk about a tiny little integrated circuit, you really got many ounces of cooling and equipment and air space that has to go with that. Right now, the DOE has proposed a 20 megawatt limit on an exascale computer just because there has to be a power budget, and making people work within that constraint is both reasonable and good discipline. But I know that when Microsoft puts a big data center up – or Google – they easily spend tens of millions of dollars just on the electricity for a data center like that one shown in the lower right. Everybody worries about their battery life – the cellphones, the tablets. Believe it or not, even [chuckles] something like Angry Birds uses too much electricity because it’s doing things with floating-point that it doesn’t need to.

So this is a very broad application area. When you get, of course, a lot of heat, it means you have to make the machine bigger, or else it will melt. And as soon as you make things bigger, you got less of a super computer because that increases the latency, and latency limits your speed. We’re all up against the speed of light, and the only real way to beat the speed of light right now is to put things closer together, and you can’t put them closer together because of the heat.

The next slide has got a beautiful picture I got from Jack Dongarra, of the Tianhe-2 super computer in China, and I think that’s right now still the number one on the Linpack list, HPL, for top 500.

insideHPC: Yes, it is.

John Gustafson: Yeah. But all these big clusters just get split up into partitions. They run Linpack, they get the press release, and then they’re immediately used for smaller jobs, because very few times do we know how to use all those cores, all those servers, on a single problem, for a single question. And so what used to be capability machines are turning into capacity machines, and that’s not a substitute.

In the next slide, we’ve got a table about not enough bandwidth – the memory wall. This is from a couple years ago, data I got from Intel. The amount of time and the amount of energy needed to do things like the arithmetic. This is kind of counter intuitive because you’d think to multiply two 15-digit numbers together and add another 15-digit number would take a long time compared to writing them down because that’s our human experience but, in fact, it’s just the backwards for modern computers. They can go hundreds of times faster at doing the multiplies and adds than just moving the data around, and that’s where all the energy is too. If you actually have to go off the chip to DRAM, that’s the expensive part. Now you’re taking about nanojoules, not picojoules, and it takes 70 times longer than to actually do the arithmetic. Your intuition is wrong if you think that the work is in the multiplies and the adds and that the goal should be more flops. It’s really data motion that’s counting everything. It got me thinking, is there a way to represent numbers that make every bit count and make them more representative? Now if you see that at the bottom of the slide there – 1200 picojoules at 3 gigahertz – that’s 36 watts, so that’s already about a third of the total power consumption of a chip. And a lot of that is because we’re using 64-bit precision everywhere. We just throw double precision at a problem, say, “I’m not sure. I don’t have any numerical analysis experts on my team. Let’s just use double precision, cross our fingers, and hope for the best.” But that’s one size fits all, it’s overkill, and it really wastes everything, and we got to stop doing that.

In the next slide, it’s got a birthday cake for floating-point, and maybe you should say 1.01 times 10 to the 2nd now because it’s 2015, but believe it or not, our current standards for floating-point come from an idea from 1914. Leonardo Torres y Quevedo, I think, was the guy who proposed having a fraction factor on automatic computing equipment. So, here we are, still using a format designed for World War I hardware capabilities. Pretty ridiculous. Now, back in those days– and you’d think people–you go to the next slide and they’ve got a picture of the Silent Speed Marchant calculator, the desktop calculator. You saw everything you were going with your calculations was right in front of you, and when you rounded, or you ran out of digits, or you overflowed, you knew it, and you could do something to correct for it because you were painstakingly going through every calculation, typing in the numbers by hand. But when you actually go automatic and you start doing millions of calculations per second – or trillions nowadays – then you don’t know what’s going on and you just hope that somebody, a long time ago, did the analysis to figure out that it was safe to do that. Sometimes it is, sometimes it’s not. Now there are flags on the processors. In the IEEE standard, there’s something called the inexact bit, there’s the overflow bit, the underflow bit. There are things like that are stored in the processor. Where are those? Rich, have you ever seen a language that lets you to get to the overflow bit [chuckles] on processors?

insideHPC: No, they don’t dive down that deep, do they?

John Gustafson: No. It’s not in C. You have to got to assembler in order to figure out a way to find out what’s that processor flag doing, and you’d have to do it after every operation. So that hasn’t worked. The other thing about these floats is that they don’t obey the laws of algebra. There’s a big difference between real math and floating-point. And knowing what the exceptions are is hard. There’s something called a NaN – that’s not a number. You need a couple of different examples of not a number, like when you divide by zero or you take the square root of negative one. You either want to stop the computer or you want to keep on going but say that’s nonsense, it’s going to be nonsense from now on, and I will give you nonsense answer. You only really need two different kinds of NaN. Well, floats have billions – depending on the precision, trillions – of kinds of NaN. And they’re all different bit patterns that could be used for real information, but instead they’re redundantly expressing a nonsense number. Yeah. That’s something I found out just in the last few years too. You really notice it when you go down to the low precision, like half precision floats, and you find that you’re wasting 6% of the bit patterns just representing NaN, and you’d like to have some more numbers in there. And as I said before, the IEEE 74 standard is really a guideline because of all the optional rules spoiling consistent results. Let me give you an analogy for printing. If you’re old enough to remember what printouts look– like they do on the left hand side here. 30 seconds per page. Remember those?

insideHPC: Oh, yeah. Big capital letters, fan-fold paper. From a fax machine. That’s how you interfaced with the device.

John Gustafson: Exactly. You waited for the thing to come out and you had 80 characters across and 66 lines down. And all you had was capital letters and numbers, and punctuation, and it all had to be mono spaced. And you waited about 30 seconds for printout. Now when IBM realized they could use lasers to print pages, all a page at a time, their vision was that they would do exactly what you see here with the black, ugly-looking capital letters, but they produced three pages per second. And they actually built laser printers that did that. But what do we have now? Well, we’ve got modern printers print about 30 seconds per page, but they print beautiful stuff with color, full color, full bleed, both sides, and all the different fonts you could possibly imagine. That’s what we did with the speed. Now why is it we can’t do that with numbers? Why is it that we can’t have better representations for arithmetic that make use of this higher speed of computers to do some of the numerical analysis for us and track their own accuracy, and actually relieve some of these other problems?

I have a kind of a funny slide here – a receipt from a retailer. Somebody tried to using floating-point [chuckles]. Oh, god. Can you imagine getting a receipt like that, with 15 decimals in what you owe? But people don’t know when to use fixed point, when to use floating-point, and they just reach for something.

Talking about the floats again, how they prevent the use of parallelism, the associative law is what’s showing here. A plus B plus C plus D. How you group that in floats, it matters a lot. Of course, it shouldn’t matter mathematically, but you’ll get a different answer in general when you put that in parallel, and so people think the parallel answer’s always the wrong one. They don’t trust that one. It’s actually usually has less error in it than the serial answer, so they go back. You think about Intel trying so hard to persuade people to accept multi-core, and even with two cores you can get into trouble like this where you get a change in the answer. Let me show you, my signature page is this picture of the ocean with the volcano erupting on it. You see there?

insideHPC: Yeah.

John Gustafson: A new number format – the unum. I’ll tell you why I used that picture in a second. I call these things unums because it’s short for universal numbers. Sort of the way the word bit is short for binary digit. I guess people have forgotten where that came from. It’s the super set of all the IEEE floating-points, both the 754 standard and the new 1788 interval standard, but it’s a lot more than that. It has all the different precisions and way beyond, both smaller precisions and larger precisions – smaller than half precisions and larger than quad precision floats – and by very, very similar rules for operation. It’s kind of a logical progression in computer history that we’ve gone from integers to floats, and now to unums. The remarkable thing about these is they don’t round, they don’t overflow, they don’t underflow. They don’t do any of the things you usually associate with floats getting you the wrong answers, and they obey the laws of algebra because they’ve got some extra [steels?] that annotate what’s going on. That makes them safe to parallelize. And this is the remarkable thing too, is that even though you’ve added some description to the field, it actually takes fewer bits on the average than a floating-point number. So why don’t we use these? Well, they’re new. And some people don’t like new [laughter]. At Intel, they said, “You can’t boil the ocean,” when I propose something like this, and so I went to the web and googled boiling the ocean and I discovered there’s all kinds of ocean boiling going on. You certainly can [laughter].

insideHPC: Very cool. John, let me ask you, are we talking about intervals here? Is that what this is based on?

Well, intervals are one thing you can represent, but it was classic interval arithmetic. You have a closed number range, like A to B, like two to three, and it’s all the numbers from two to three inclusive. But sometimes you want an open interval, and sometimes you want a half-open interval, and sometimes you want an exact number. And if you just add one bit, it turns out to the representation. You can encompass all of those possibilities with very economic storage, and it’s an option whether you work with intervals or not. You don’t have to, but when you do work with intervals, I can do it in less than half as many bits as current interval arithmetic, and I don’t get near the explosion of range that you do with interval arithmetic. That’s the real problem. People will always try interval arithmetic once, and that’s the last time they ever try it. The answer comes out. “Oh, your answer is somewhere between minus infinity and infinity,” and you say, “Thank you very much. I think I’ll go back to using floats now.” The breakthrough was finding a raft of ways to make sure that does not happen and you still get tight bounds, and you don’t have to be an expert in order to get those tight bounds.

Let me just show you why these things are more economic with three ways to express a big number. Avogadro’s number is like 6 times 10<sup>23</sup> atoms or molecules. Kind of a standard chemist big number there. And if you had to write that with an integer, look how many bits it would take you. It would take like 80 bits. Lots of digits. Now with an IEEE standard float, even though you’ve added a field to say, “Here’s the exponent. Here’s the scale factor,” It actually saves you bits because now you don’t have those nasty trailing zeroes. Next step is the unum, which only takes 29 bits. You see those three extra fields over there on the right? There’s a magenta field called the u-bit – that’s the uncertainty bit – there’s the size of the exponent in bits, and there’s a fraction size. That is self-descriptive bits that track and manage the uncertainty and let you actually automate the process of adjusting the precision to whatever accuracy you’re trying to get. There have been things a little bit like this before, like significance arithmetic and radius arithmetic. I’ve gone through them all, and they all have bugs [chuckles], let me tell you. This is the first one I was able to reduce to a mathematically perfect closed set, even when it’s only got four bits in the total expression for the number, and that’s an awfully small floating-point number, four bits [chuckles]. But it still works.

insideHPC: John, are we limited by that exponent size there? Or that’s just the expression for this particular number?

John Gustafson: You set the environment as to how many bits you want in that green and the gray field throughout the application, or you can change it during the application. But that label then goes with that block of numbers. Sort of like right now we keep track of whether a bunch of numbers are single precision or this whole block of numbers is double precision. Generally speaking, if you just add 11 bits to the end of the number, you can do all of the floating-point types and all the integer types and a whole lot of different fixed point types as well.
On the next slide, I state why unum has used fewer bits than floats. And the exponent’s generally the wasteful part of a floating-point number because you’ve got these huge ranges that they’re capable of expressing like 600 orders of magnitude in double precision. You really usually only need maybe 20. And sometimes your temporary values go up, but immediately the result goes back down. Like you square a number, add it to another, and then take a square root. The dynamic range really isn’t that big in the final result. But you can also get rid of all the trailing zeroes and the fraction. And when you have more common values, like when you use just a floating-point number to represent a half or negative one, then those can take very, very short bits. And when you subtract two similar numbers, it actually cancels a lot of similar bits. These things automatically then shrink. They reflect the fact there’s been cancellation, which informs the user, and it informs the computer that there’s been a loss of significance, but it tracks exactly what that loss is. I got a graph of what the difference is kind of as a– this is kind of a peculiar graph to explain, with these ranges as well as exact points on the next slide. Here’s a bunch of magenta points. What you’re doing with most floating-point representation is you are picking an exact number. It’s a rational number that you are representing with an exponent and a fraction. And in the case of unums, what you’re doing alternating between exact numbers and the interval between exact numbers. So if the last bit – the uncertainty bit – is a 1, then it’s a lot like the dot dot dot you write when you write 3.14… You’re saying that’s not exactly pi. Pi is 3.14… It’s not exact. There’s more digits after that. So all it takes is one bit to say there’s a dot dot dot, and it completely covers the interval between 3.14 and 3.15. I leave that bit off.

insideHPC: What we used to call significant digits. We acknowledge that they’re there.

John Gustafson: Yeah. People have asked me, “Why don’t we just add that bit to the end of the fraction like I do, and not have all that other fancy stuff?” But it’s going to be floating back and forth. You see, just as the exponent floats in the floating-point number, the significant floats, how many significant digits you have floats, too, in a unum. You have to be able to track it. That’s why you need to have all three of those fields. That, oddly enough, gives you representation of all real numbers with a finite number of bits. It’s not that they’re exact, but I completely covered the real number line from minus infinity to infinity. I have special bit strings from infinity and minus infinity because those aren’t really real numbers, but they behave very nicely in this format and still are closed under it. And then, of course, two more for the NaN numbers. You see, I’ve got just one bit string means silent NaN or signaling, and another one means signaling NaN. It means stop the calculation. I’ve got a black NaN and a red NaN there.

Here’s how you do intervals on the next page. They’re called u-bounds because you can make open-ended or closed-ended intervals, and that’s very powerful mathematically, because right now, we don’t have a way of expressing sets of real numbers on a computer [laughter]. You could haul up something like Mathematica or Maple, but if I say, “Show me all the numbers that are bigger than three.” How do you express that? Interval arithmetic would say three comma infinity, but it would include the three. Wait a minute [chuckles]. I said bigger than three [chuckles]. You lied.

insideHPC: You did.

John Gustafson: If you can’t do simple things like intersect, and union, and compliment, then you don’t really have much of a number system. With unums, you have a closed number system under set operations. That’s got all the plus or minus infinities, and the empty set, quiet signaling NaN.

I distinguished between – on the next slide – the three layers of computing because there’s the layer that we understand as humans, and it’s not necessarily printouts of numbers. They don’t have to be decimals. It could be the output you see from a video game or the sensation you get in your steering wheel when you’re driving your car. It’s all kinds of ways we use real numbers. But it’s the point where they’re perceived. We have our own grammar for expressing in exact, like I said, the dot dot dot or the maybe a little plus sign. But now you go into a computer and a computer has its representation, which right now is floats, but I’m proposing what I call the U layer, which is all unums and u-bounds, and there are very high efficiency ways of computing with floats and with real valued intervals, but every computer has a scratch pad hidden in it. They all do, and they don’t tell you what they’re doing, but like when you do a multiply of a double precision number, it actually does the multiplication to 106 bits a fraction, then it rounds and it gives you the 53. That’s all going on behind the curtain. The lack of standardization in that is what causes bitwise different results. What I’ve done is I’ve standardized what I call the G layer, which is real, mathematical, general intervals. And if you do that, it solves all kinds of problems with parallel processing giving different answers. They don’t give different answers anymore. It doesn’t matter how you order the answers because it’s mathematics instead of floats. And yet it allows you to produce a standard that, finally, for the first time, is a real standard, where you get exactly the same answer on all different computers, just like you do right now with integers. And that’s kind of revolutionary. We’ve always wanted that. Never had it.

On the next slide, I’ve got a picture of the Warlpiri. The world Warlpiri are an aboriginal people in Australia – Northern Australia – and when they first had contact with other civilizations they could count two. Anything more than two was many. So they just said, “One, two, many.” I always [laughter] thought that was funny. First heard about them in the Guinness Book of World Records. The world’s smallest [chuckles] society for mathematical counting numbers. Well, let’s see. What can you do with one, two, and many while you’ve also got need one, need two, and need many, so there’s negative numbers. Surely they’ve got a word for none and they’ve got a word for all, and maybe a word for nonsense, junk, and stop. I got it covered [chuckles]. I’ll add few, and some, and now I’ve got the whole real number line represented down there on the bottom. And guess what? It follows the floating-point convention for what it a hidden bit does, and a sign bit, and now I’ve added the uncertainty bit – the magenta one on the end – and I can cover the whole real number line in four bits.

Four bits. And that sucker is closed. It is very neat and clean under floating-point operations. It’s not very accurate [laughter] but it never lies to you. It doesn’t round. On the next one, I’ve got a more of a real sized unum that’s also fixed size. People are frightened by the fact that these are variable size. They say, “We’re going to have to manage strings of bits, and it’s going to be really hard.” They don’t have to be variable size. They can be easily compressed without loss when you want to get rid of all the crud. But if you want to expand them into a 64-bit register, then you can still save energy because you only use a fraction of the bits and you know which bits are being used and you can turn off the other ones. CPU designers are pretty good about figuring how to do something about that. But because you’ve got bits left over, you can have summary bits. Over there on the left, you see, is this number less than zero? Is it a nonsense number? Is it one of the infinities? Things like that. Or is it half of an interval? Is there another number associated with this value? That sounds like a funny thing to do just to have a single bit for all those, but it’s like a warning label on the number, and all you have to do is check one bit. Now on the bottom, I’ve got the circuit required for the question, is this an IEEE 16-bit float that equal to infinity? You actually have to go through that entry of all the input bits in 16-bits in order to figure out if that is exactly the bit pattern for positive infinity.

insideHPC: That’s a lot of energy going through all those gates, right?

John Gustafson: Yes. It’s a lot of clock delays too. And that’s just half precision, but it’s 30 times less energy and space on the chip to do it for an expanded unum here, where you just test the single bit. And you can maintain these unpacked bits very, very easily from one operation to another. In some ways, it’s making things much faster on chip as well as reducing the traffic of bits to and from memory. Once I got these things done– well, let me tell you about the Wrath of Kahan. Now that’s actually the way William Kahan pronounces his name. Berkeley professor. It’s not Kahan, it’s Kan. It’s exactly the Wrath of Kahan. Every time anybody comes out with an idea to fix floating-point arithmetic, he writes a ferocious article showing how all the ways that they’ve messed things up and made them even worse than floats. Because he was very careful in leading the IEEE 754 standards to be as good as it possibly could be. But you’ve got to remember, if you go back to the mid ’60s, the cost of four transistors was about the same as the cost of putting a new set of tires on your car. Can you believe that?

insideHPC: For four transistors?

John Gustafson: Four transistors, replace all your tires on your car. About the same cost [laughter]. Now imagine what kind of circus you’re going to build if a transistor costs as much as a tire. You’re going to be really, really careful with them. And now we’ve got transistors that cost just unbelievably small amounts of money. I guess it’s probably in the trillionths of a penny for some transistors [chuckles]. And we’re still acting like they’re really, really precious and you have to be careful how you spend them. The floating-point unit on a typical chip is less than a quarter of a square millimeter in size now. You can’t even see it. It’s getting smaller than the period at the end of a sentence. And we’re worried about using too many transistors? It’s just silly. The Unum idea is going to take more transistors, but they will typically not be as many used at a time, but sparingly. The overall effect is going to be faster execution and much less memory motion. It’s exactly what you want. You want to put more burden on the CPU and less on the memory bandwidth. And so it accomplishes that. I was excited. I think I talked to you just after August 13, 2013, when I finished the environment and actually was able to take it out for a spin, and the first thing I wanted to know is, can unum survive the Wrath of Kahan, because he’s got all these counter-examples that break everybody else’s number system. They’re designed to show terrible answers when you plug in your ideas.

On the next page, I’ve got a typical challenge. Actually, it wasn’t Kahan that came up with this. It was a French researcher, but he is the one who’s often quoted coded for it. You have this small– it’s just a one-liner program that you can define, and you try it for different values of x, and the correct answer is one no matter what. And sometimes, for some real numbers, you get one. But most of the time, if you put IEEE 32-bit numbers in there, you get zeroes. About a different a number as you can get. So try it again with 64-bit and you get zero. Fail.There’s an old adage that if you want to know something is correct, you just do it in double precision. If you get the same answer, then your single precision answer is right. But here’s a good counter-example to that. It’s a myth that you can just increase the precision and get the same answer to prove– kind of a poor man’s numerical method. You can try it quad precision, you still get zeros, unbelievably. There’s something in JavaScript, or Java, that’s called – what is that – big decimal? Yes. You can try that and even they, with hundreds and hundreds of significant digits, get the wrong answer on this. Let’s bring out interval arithmetic to the rescue. I tried that, and guess what I got? Somewhere between minus infinity and infinity [laughter]. Another epic fail for interval arithmetic. So I say, “Well, I know it will never work, but let’s try those 4-bit Warlpiri numbers that I came up with and see what they say.” I just fell off my chair because it got 1111 correct because they are completely honest about what they know, what they don’t know, which side of zero you’re on, the difference between an open interval and a closed interval, which is exactly what makes this example break. And on the average, I only used six bits per number [laughter]. So I think I’m on to something.

insideHPC: Fascinating.

John Gustafson: I tried on a bunch of Kahan problems and I could not find anything where his acid test would break my math. There’s something on the next page, and it’s a graphical version. Find the minimum of this function that’s got a log in it, but mostly it looks like x squared plus one, and so you plot that, and even with a half a million double precision IEEE floating-point numbers, it shows that the minimum is probably around x equals 0.8. See that over there on the left?

There’s that funny little dip there, and even with half a million points, it doesn’t look like it’s anything to worry about if you didn’t think too hard about the function. Actually, what happens is when you hit the number four-thirds, you’re taking the log of zero, but four-thirds, you can’t get very close to it with floats because that’s not a representable number. So what happens when you try it with unums – and I just used 30 unums or something like that – very low precision. Right at the point where it spans four-thirds, it recognizes that it goes to minus infinity, and you wouldn’t make that mistake. Over on the right. So it’s something like several tens of thousands less effort to get the right answer with unums in this case than it is with floating-point numbers. Yeah.

insideHPC: Spinning the wheels, as we would say, right?

John Gustafson: Yeah. Exactly. If it’s not worth doing, it’s not worth doing well [laughter].

insideHPC: Right. Not worth doing.

John Gustafson: There’s this statement all over the web about taking the power of a number, y to the w power. I guess Kahan uses y and w because x looks like too much like a multiply symbol. But he says no one knows how to do that. And I saw this quote on the web after I solved the problem, and I had already written a chapter in the book on it [laughter]. I was kind of surprised to see that quotation in the web. He and I had been going back and forth with email, and he was very cordial, very helpful. He sent me back what he thought was a reference to a very good way to compute y to the w. He thought it was the best way out there, and he said, “Well, can you do this?” And I sent him this example. You see this 15-decimal number to the power of 0.875. There’s another 15-decimal number. That’s exact. So if you did that with your method, wouldn’t it say it was a rounded answer and it’s not [laughter]? It’s sort of like the joke about you ask an engineer, “What’s two times three?” And they pull out a slide rule, and after a couple of seconds they say, “It’s about 6.0.” [laughter] It’s not about anything. It’s exactly six.

insideHPC: It’s exact. Yeah.

John Gustafson: Well, that’s another problem with floats. They don’t know when they’ve got the exact answer. For some reason, that was the last time I heard from him. He has not been emailing me since, so I’m afraid my dialogue has ceased and I’m really sorry to see it because I couldn’t have done any of this without his help. I really learned numerical analysis from William Kahan, starting from using his HP design calculators. He did a lot of the algorithms for the early Hewlett-Packard scientific calculators, the handheld ones from the ’70s on.
Sure. That’s the industry standard for forever. Yeah.

I think I’ll just flip to these next slides quick because I know we’re taking a while, but I said, “Two can play this game professor.” [laughter] I gave him an example that I can solve easily with unums and he can’t solve with floating-points. And if you get just a little bit of wobble in the place where they intersect, his fixed point methods fall apart badly. But if you want to use unums to imitate floating-point arithmetic because you think you can do something with them, then they can, because all you have to do is use the guess function. The guess function applied to a unum does exactly the rounding that it would do if it were a float. And it’s very easy to convert a unum back to a float. You just throw away a lot of the extra information, and what’s left is a float. So they’re a super set, and they’re a super set in the way they behave too. But you don’t have to do that. If you wanted to convert a big code to run with unum arithmetic, you could start out by doing exactly the same operations and gradually try to remove the guessing. And eventually you wind up with a code that was rigorous. And you could do it one step at a time. The guy named Siegfried Rump had talk about Rump’s Royal Pain because you get, once again, almost the same answer in various precisions, and it’s like 1.17, and with IEEE double precision, I got 1.18 times 10 to the 21st. I don’t know where the 10 21st came from, but wildly wrong. The correct answer is actually negative point eight something, so you don’t even get the correct sign.

insideHPC: Yeah, the wrong side of zero. Interesting.

John Gustafson: I tried it with unums and I got the right answer with the 23 decimals, with the 75th 5 bits per number, and it found the decimals. It found the sides to use automatically. I didn’t have to tell it. It just searched for it, and it raised the exponent side, it raised the right fraction side, reduced them when it could, and came out with the answer. That’s the way computing should be – everything adjusted automatically. The fundamental principle is that you bound the answer as tightly as possible and within the numerical environment, or else you say, “I don’t know how to do this problem.” You don’t guess anymore. We’ve got a lot of mathematical stuff that said, “Oh, this is order h to the n power error” when you do a differential equation. But you don’t know what the error is. You just know it’s the order of something and that it’ll get better if you make the spacing smaller. I can actually do much better than that with a really precise bound that says how big the error is, and measure that, and I count information as the reciprocal of the error bound size. If you define it that way, then it doesn’t matter how you get the answer, it’s how well can you bound it. And now you can talk about information per second, and information per bit, and information per watt. You’ve finally got a measure of what we’re actually computing in terms of knowledge instead of what activities are we doing, like how many flops per second are we doing. That’s why I just don’t think we need exascale computers not to do flops. We need exascale computers to give us better answers, but not flops.

I’ve got an example of polynomials. It’d be nice to be able to do polynomials right. You try that with interval arithmetic and they slop all over. Those gold rectangles I show there are where they produce a bigger bound than they should, and all you’re doing is just a quadratic. Even if you could do it perfectly, because they’ve got closed end points, they’re including a bunch of regions that you don’t want. So when I use them with unums, they give exact answers when– negative two squared is exactly four, for example. And so it doesn’t land on a range near four, but it’s very, very careful about the stuff in between negative two and negative one and a half, and it never lands on four, and it never lands on positive 2.25.

Here’s another example on the next page of a sloppy polynomial and a unum polynomial valuation there in two colors of magenta. What we’ve always wanted was math that we can actually trust on a computer. I’m running out of time here. I don’t really want to unnecessarily go into U-boxes, but this is really the way to do scientific computing, is to treat all of your answer as a multidimensional set of intervals that are one unit and last place wide. And what you get is massive amounts of parallelism when you do that. But that’s exactly what we have got. We got lots of data parallelism. And my calculus considered harmful slide is because calculus and computers really make kind of strange bed fellows. Computers are discreet numbers – stores – and calculus is continuous, and for many people, hard to understand. And that just ensures that you’re going to get into trouble because you try to take a derivative on a computer, and there’s no derivative operation on a computer, so you use differences instead, and it’s wrong. As is changing the problem to fit the tool. We’ve got a lot of wonderful mathematical machinery in calculus, but you don’t have to do things that way. You can actually keep things in discrete form, solve a discrete form on the computer, and let the discrete intervals get as small as possible, but never give up exactness. Never go to the approximation and the limits that calculus wants you to use.

On the next page, I’ve got how the error bounds are so unsatisfying that– here’s trying to just bound the value of Pi by using a quarter circle, and if you try to use the standard textbook error bound, it’s got that expression there shown with the orange background, and you really don’t know what that Greek letter [?] is. It’s somewhere in the interval, and you don’t necessarily know how to find the second derivative. There might not even be a second derivative, so that’s not much of a bound. And so [chuckles] I got no bound, which means I really have no information. But if you use the U-box method– there’s two methods called the Paint Bucket Method and the Try The Universe Method, which means try everything. You can actually bound the quarter circle example without knowing anything. Just no expertise at all. It’s just a really brute force, dumb method that makes use of the power of the computer.

And I’ve got some pictures here. I’m flipping through my slides really quick. I’m up to status connected need a seed at this point. And that’s where you start with just one point there, shown in green, surrounded by some amber trial boxes. And you test them, and the ones that turned red do not satisfy the equation. The ones that turned kind of blue green are okay. It’s sort of like the color of a go light in a traffic light. I’ve not calculated square root of one minus x squared. I just tried to see if it satisfied x squared plus y squared equals one to the precision we have to compute with. Then I’ll use that to make another trial set, and it grows across all the way until it completes the quarter circle. It looks like a fuse burning. It’s very cool to watch on the screen. And now I’ve got all these green U-boxes that tell me exactly what is the shape of the quarter circle. If you were sending this to a graphics processor like you send it to a frame buffer, it turns out you could send after you’ve coalesced– on the next slide, you’ve got to show the compressed final result. It’s like six times fewer bits than what we used to send to a frame buffer right now. So this could really accelerate all the graphics in the world and the video gaming and so on. So instead of the units of least precision, the ULP, being the source of error is actually what we’re computing with, it’s the atomic unit of computation.

How you do stuff like solve fifth degree polynomials, which is, of course, analytically there is no solution. Numerically, there’s lots of errors from rounding. But I plug in a nasty polynomial like this, it gets really close to the axis and it gives me exactly minus one. Exactly two as the solution. It’s just kind of like being a kid in a candy store and you’ve got a tool that’s powerful. All the methods that you wished would work just do, finally. It’s really the power of open closed end points.

Just a couple of examples from classical and numerical analysis, where you use time steps to estimate force – something that’s called particle pushing. You go through time step, time step, time step, and each time you’re guessing the new velocity and the new acceleration, and you accumulate rounding and sampling error, both of them not exactly known. And it might be really far away from the actual behavior which I show in green. The worst thing is you can’t do it in parallel. So what I found was a way to do all those kinds of calculations in parallel with space steps instead of time steps. You bound acceleration to velocity and it can compute all of them at once. I don’t have time to go into it here, but it’s massively parallel even when you’re doing an ordinary differential equation with like one degree of freedom. I don’t think anyone ever saw that one coming. It’s critical to be able to do this with U-boxes. I have a chapter on pendulums done right. They teach you to use a harmonic oscillator, and that’s just to fit calculus. It’s not a harmonic oscillator. It’s an approximation. It’s only a harmonic oscillator if it has almost zero amplitude swings. Everybody who’s ever used a playground swing knows that’s not the way it works. If you really solve it with unums, you get the physical truth versus the forced fit solution used to fit calculus. You can see the orange is what they tell you when you’re first taking physics, is the answer. But without using any calculus and any real sophisticated concepts, but just using the computer and basing principles of time and motion, I got lower and upper bounds there that show the real value for starting with a big swing, like a 60 degree swing. In many examples like this that I go into detail, show you can get the right answer for the first time with less effort and without even having to use calculus. Linear solver’s another example like that. It doesn’t even have to be linear. It works on nonlinear problems too. The correct answer to a set of two equations and two unknowns is shown with those two bright green boxes, and the red boxes on the outside are places where it definitely does not meet the right answer. And that little tiny black dot in the center is what a floating-point calculation will tell you. And it would only give you, effectively, one of the answers, when it’s really a region that solves the problem for a given precision.

That grey rectangle on the outside, that’s what you get with interval arithmetic. Really approximate. You can see that with these U-boxes you get the absolute best possible answer for a given amount of finite precision on a computer. I’ve got some other examples in this. Getting close to the end here. Apps with U-box solutions are photo realistic computer graphics, end body problems, structural analysis, even a perfect guess model that doesn’t use statistical mechanics. So finally, we can actually have provable bounds and all these simulations instead of having guesses.

In revisiting those big challenges from the beginning, too much energy and power, I’ve found I can cut the main energy hog, which is the data transfers, by about half. More parallelism than we know how to use. A few boxes you’ve got all the parallelism you could possibly want. You can dial it up and down according to the accuracy you want. It gives you that trade off dial. As for not enough bandwidth, well we use more CPU transistors to reduce the amount of memory so that improves the ratio now. It goes to finally the right direction. Rounding errors being treacherous? No more rounding error. Automate the precision choice. Multi-core methods? Well, we’ve got algebraic laws back, so you can do things in any order and still get the right answer. It allows you to really parallelize things you couldn’t parallelize before.

As for the sampling errors and physics simulations, I can take the guess-work and put provable bounds on them, and the bounds do not grow exponentially like they do with interval arithmetic. And the numerical methods being hard to use? Well, I’ve got just two methods, and they both work – Paint Bucket and Try The Universe. You might have a choice between which one’s faster, and that’s about it. But they both work.

Finally, the book is out as of ten days ago. It’s called The End of Error: Unum Computing, and I’ve got the link there for the CRC press site. It’s also on Amazon. This is very much aimed at the general reader. All you need is high school math, and it’s actually lots of illustrations, lots of figures. It took me about 6,000 hours to write this book. I started many, many years ago, but really didn’t get the fundamental breakthrough until August 2013. And, of course, as I said, you get the complete prototype environment. That’s for free. They’re not charging for that and you don’t have to buy the book to get it.

insideHPC: Yeah, this is awesome. John, what would it take to build a unum-based computer? Would you have to rewrite the logic or can the compilers do this?

John Gustafson: I’ve got a high-level prototype that demonstrates mathematically how it should behave. The next step is to build like a C library or C++ library and get down to the– it would specify every single shift and hand or operation you need to make logic do these things. Just the way they used to simulate floating-point when you didn’t have floating-point, you’d actually have to write a library to do the operations. And once you’ve got C, then you can start thinking about building an FPGA that does those things and see if it actually gives you some acceleration for these kinds of operations. Maybe it will be a little slower than hardware floats, but because you’ve got the assurance of the accuracy of the answer without the explosion interval arithmetic, you still might want to use that. And the last step is somebody says, “Why don’t we just make this a basic data type inside the processor?” And a lot of my book talks about how you build hardware for unum arithmetic. How you can make this real. I can’t wait to get it started, frankly [laughter].

insideHPC: Let’s say we’ve done all that, John, and you want to build a x type machine with that kind of capability. Because you’ve got all those energy parameters and everything, you’ve got meet those requirements. It seems odd, but wouldn’t this be the good place to start, do you think?

John Gustafson: I’d need to start right away because we’re trying to make it to an exascale machine by 20 whatever. You can pick the number. I hate to see that we get there and we realize that a lot of it was wasted just trying to do IEEE 64-bit floating-point operations as fast as possible. What we really needed was like five decimals and the answer. Let the computer worry about how to get it.

insideHPC: Absolutely. We’ve talked about how you might go about implementing this, but what about community? John, you put this put there for people, right? You made it understandable with high school level math. What do you think are the next steps for building a group of people who can go out and start carrying the message?

John Gustafson: That’s a great question. I don’t necessarily know the answer. I don’t want to dictate the answer. I hope that a cottage industry springs up with people trying to build their own libraries. And because I doubt that we’ll get it right the first time we do it. There’s a few choices to make, not as many as there are with like an IEEE standard. It’s much more like programming with integers. There’s one way to do it, and it’s the right way. But as people find better ways to interface it to existing languages, then I think it’ll guide the way forward. And I’m happy to help anybody who wants to create such an effort. I know of at least two appeals for government grants to put unums into a more real– put them on a real [chuckles] hardware ground, and eventually I think it’s going to happen because it’s just a good idea. I think it takes about ten years to go from having a good idea to actually people using it in general [chuckles]. That’s what I’ve seen historically. We haven’t changed that much over technological history that we are adopting things any faster. It’ll just take a while.

insideHPC: It’ll take time. Yes. John, this has been fascinating and I’d like to thank you once again for coming on the show today.

John Gustafson: I really appreciate the chance to come on. Thanks.

insideHPC: You bet. Okay, folks. That’s it for The Rich Report. Stay tuned for more news and information on high performance computing.

Sign up for our insideHPC Newsletter.

Comments

  1. Jesko Schwarzer says

    Hi John,

    interesting, but obviously I have to buy the book to implement it !?
    I don’t have the book yet, but I assume several operations to determine exact bit count etc. to get right precision and I suspect that current hardware (even FPGAs) are too unflexible due to current architecture to do it (see DSP48 from Xilinx). If you do everything in software you have the flexibility but in hardware this fades away …
    I still don’t know your solution: I would like to dig deeper to get all needed insights …

    /Jesko

  2. Hi Jesko,

    A low-precision unum environment should be easy to do on an FPGA. They can be made fixed-size for simplicity, too.

    The last time I checked, a good FPGA costs about a hundred times as much as my book, so if you’re serious about trying this, you might want to spring for a copy. 😉

    • Hi John,

      why only low precision? I would be happy to have same or better precision than double with same or similar hardware resources …
      Is there a .pdf version of you book available?

      • Both Amazon and CRC Press offer electronic versions of the book.

        You may think you need higher precision, and unums can go crazy high precision if you need them to, but the accuracy-tracking metadata often reveals that only a few spots in the calculation need that precision, and the rest can be done far more economically without information loss.