Entries filed under “HPC Hardware”

Hardware news and announcements in technologies related to HPC.

Swimming in Sensors, Drowning in Data

Dan Olds from Gabriel Consulting Group shares his amazement from a GTC 2012 talk by MotionDSP.

Cleaning up and enhancing video is a tall order, compute-wise… But I just saw a demo of that in a GTC12 session run by MotionDSP. Their specialty is processing video streams from mobile platforms (think drones and airplanes) on the fly. We’re talking full motion, 30 frames per second video streams that are enhanced, cleaned up, and highly analyzable in real time. The amount of processing they’re doing is incredible. Lighting is enhanced, edges are enhanced, jitter is taken out, and the on-screen metadata (time, location, speed, etc.) is masked… The effect is profound. In the demo, what was once just a vague gray ship (which seemed to be vibrating like a can in a paint shaker) was clarified so that you could easily see what kind of ship it was and also see two suspicious figures milling around on deck. To me, it looked like there were enough pixels to enhance the video even further – to the point where we could identify the figures.”

Read the Full Story.

 

Also posted in Events, GPUs, GTC - GPU Technology Conference, HPC, Visualization | Leave a comment

BOXX Demos Deskside Supers using Nvidia Maximus Platform

This week BOXX Technologies is demonstrating their 3DBOXX workstation computers featuring NVIDIA Maximus technology at GTC 2012. Maximus technology enables engineers to complete simulation or rendering plus visualization simultaneously on the same workstation.

In the past, content creation that employed both visual design and physical simulation often resulted in these tasks occurring on different systems or at different times,” said Shoaib Mohammad, BOXX VP of Marketing and Business Development. “Our integration of NVIDIA Maximus technology enables users of Autodesk, NVIDIA iray, SolidWorks, CATIA, Bunkspeed and other professional applications to design and render simultaneously resulting in a faster, seamless workflow essential to increasing productivity.”

Read the Full Story.

Also posted in Compute, Events, GPUs, GTC - GPU Technology Conference, HPC | Leave a comment

How Computer Games Help HPC

In this special guest feature, Tom Wilkie from Scientific Computing World clears his head from all the technical information gathered at yesterday’s GTC2012 keynote.

The latest processor from Nvidia will lead to ‘the democratisation of computing happening in front of us,’ according to Jen-Hsun Huang, president and chief executive of the company.

He unveiled the new chip, known as ‘Kepler’, to an audience of nearly 3,000 scientists and engineers at Nvidia’s GPU Technology Conference in San Jose, California, on 15 May. It was, he said, more than three times as energy efficient as its predecessor.

Nvidia specialises in the graphics processing units, as one of the major suppliers of computer graphics cards to PCs but the technology is now widely used as an accelerator in high performance computers. Kepler was, he said, the most energy efficient GPU ever built and he expected it to advance high-performance computing, computer graphics and cloud computing. In HPC, he said, ‘We know that ultimate performance is limited by energy efficiency and at the chip architecture level we have had to design for energy efficiency and this is a huge step forward.’

Among the applications in HPC that he demonstrated was a massive simulation of the collision between our own galaxy, the Milky Way, and the nearby Andromeda galaxy – an event expected some three billion years or so into the future. The simulation involved a many-body problem of millions of gravitationally interacting stars – a highly intensive computational problem.

But according to Sumit Gupta, head of Nvidia’s Tesla high-performance computing business, supercomputing will be the beneficiary of the other applications for the Kepler chip – in gaming, virtualisation and cloud computing. It is because Nvidia has such a strong presence in these high-volume consumer markets that it is able to produce its processors so cheaply. And it is this aspect, according to Gupta that is leading to the ‘democratisation of high performance computing’ proclaimed by Huang.

‘With the same GPU,’ Gupta said, ‘we can go into many different markets Cloud gaming will be a huge market – we are able to leverage all of these high volume markets and get into HPC at a price point other people cannot.’

Nvidia is launching two versions of the processor: one is available almost immediately that will have single precision and will be suitable for some scientific applications such as seismic profiling. The other, known as K20, will have double precision and enhanced queuing and parallelism but it will not be available until the last quarter of this year.

He pointed out that ‘with Kepler you can build a petaflop system, in just ten racks of servers. Two years ago, Tokyo Tech built a petaflop machine with Fermi [the predecessor to Kepler] and it took them 42 racks.’ To build a machine of similar performance based on Intel’s Sandybridge processor, would take about 100 racks of servers. ‘So Kepler is 10 times better than Sandybridge in terms of petaflops,’ he claimed. He also said that there would be a tenfold improvement in power consumption, with a 1 petaflop Kepler-based machine consuming just 400 kW as opposed to around 3MW with Sandybridge.

‘A petaflop machine of this size means that every university in the world can put one in,’ he said. He estimated that it would cost less than $4M for a petaflop machine, whereas in the recent past people have spent $30M to $40M to get the same performance. ‘There are universities out there that consume 400kW with a 10 rack system but they only get 20 teraflops, so they have this outlay but they are getting a twentieth of what they could be getting.’

But Gupta promised that Kepler was only one step along the road. Although, he said, ‘from my perspective, Kepler is a bigger shift than we have ever done before – much more revolutionary – there is so much innovation for us still to do. It’s a long road.’

This story originally appeared on HPC Projects. It appears here as part of a cross-publishing agreement with Scientific Computing World.

Also posted in Events, GPUs, GTC - GPU Technology Conference, Video | 1 Comment

Video: PGI’s Michael Wolfe on OpenACC & Dynamic Parallelsim in Nvidia Kepler GPUs

In this video, Michael Wolfe from The Portland Group describes the advantages of OpenACC and new Nvidia Kepler GPU features including Dynamic Parallelism and Hyper-Q.

Also posted in Events, GPUs, HPC, HPC Software, NVIDIA GPU Technology Conference, Video | 2 Comments

Nvidia’s Kepler Pushes Parallelism up to Eleven

By Timothy Prickett Morgan • Get more from this author

When Nvidia did a preview of its next-generation “Kepler” GPU chips back in March, the company’s top brass said that they were saving some of the goodies in the Kepler design for the big event at Nvidia’s GPU Technical Conference in San Jose, which runs this week. And true to its word, the Kepler GPUs do have some goodies that will make them considerably more useful for graphics and HPC compute workloads.

The two big innovations baked into the Kepler GPU are called Hyper-Q and Dynamic Parallelism, and they are integral to the company’s plans for the Kepler GPUs to have somewhere between three and four times the performance per watt compared to the prior generation of Fermi GPUs.

Die shot of the Nvidia Kepler GPU

The first architectural change that Nvidia made is a tradeoff between clock speed and core counts that all CPU and GPU makers are wrestling with every day. Power consumption rises with the log of clock speed, so reducing the clock speed a little can have a dramatic impact on overall power consumption on a component.

And so concurrently with the shrink from the 40 nanometer processes used with the Fermi GPUs to the 28 nanometer processes used to etch the Keplers, Nvidia is cranking up the core counts and slowing down the clock speeds, increasing the parallelism and the overall performance of GPU while significantly lowering its power draw and heat dissipation.

There are two different Kepler GPUs in development. The Kepler1 chip, also known as GK104, is aimed at graphics cards and Tesla GPU coprocessors, where single-precision floating point math is what matters most.

Until now, Nvidia has not said much about the Kepler2 GPUs – also known as GK110 internally – except that they will be tuned for double-precision floating point math and will support more GDDR5 memory, will have ECC scrubbing on that memory, will have different packaging aimed at servers, and will cost more money than Tesla cards based on the Kepler1 units. A little more info on the Kepler2 GPUs was divulged today at the GTC 2012 event, thankfully.

Nvidia's SMX architecture for the Kepler GPUNvidia’s SMX architecture for the Kepler GPU

The Fermi GPU had 512 cores, with 64KB of L1 cache per core and a 768KB L2 cache shared across a group of 32 cores known as a streaming multiprocessor, or SM. This was the first time that Nvidia added cache memory to the cores and made them look a lot more like standard CPUs in terms of their memory hierarchy. A Fermi GPU had 16 of these SMs and either 3GB or 6GB of GDDR5 memory.

The initial Fermis only shipped with 448 cores activated in the top-end models, but as yields improved at Taiwan Semiconductor Manufacturing Corp on its 40 nanometer process, Nvidia was able to ship chips with all 512 cores running.

The Fermis burned between 225 watts and 250 watts in a discrete graphics card and Tesla coprocessor cards; they originally ran at 1.15GHz with the 448 core version and were boosted to 1.3GHz with the 512 core variant. The 512 core Fermi GPU could do 665 gigaflops of double-precision floating point math and 1.33 teraflops at single precision.

With the Keplers, Nvidia is moving on to what it calls an SMX, or streaming multiprocessor extreme, architecture. With the Kepler1 chips, Nvidia is putting 192 cores into a streaming multiprocessor group with slightly modified CUDA cores. Eight of these SMX units are on a single GPU chip for a total of 1,536 cores.

The cores have a base speed of 1006MHz with a turbo boost speed of 1058MHz (no, that is not much of a boost), and even given the fact that the GPU has three times as many cores, dropping the clock speed by a third means it only burns 195 watts. It therefore offers much better performance per watt – about three times, according to Sumit Gupta, senior product manager of the Tesla line at Nvidia, who spoke to El Reg ahead of the GPU Technical Conference.

The Kepler GPUs are not just about shrinking the cores and adding more of them running at a lower speed to a GPU to boost performance. That would probably not be enough to take on the exascale computing tasks that Nvidia is wrestling with as it positions its Tesla GPU coprocessors as the preferred compute engines for future supercomputers, even if this would probably be good enough to make graphics chips that could compete against whatever Advanced Micro Devices could come up with.

One new technology that is going to make the Keplers much better than the Fermis is called Hyper-Q, and as the name suggests, it creates a queue for message passing interface (MPI) tasks running on parallel and hybrid CPU-GPU clusters so multiple MPI tasks can be dispatched from the CPU to the GPU in parallel.

This is so obvious in hindsight that you might have already been thinking that this has already happened, but Gupta says that the Fermi GPUs could only handle one MPI task at a time.

Nvidia's Hyper-Q feature for Kepler GPUsNvidia’s Hyper-Q feature for Kepler GPUs

The Kepler GPUs, by contrast, can have up to 32 distinct MPI tasks beamed to them from the CPU and dispatch them to different segments of the GPU to have them run on isolated chunks of the cores.

It is not clear what the granularity is on the Hyper-Q function, but it is probably no coincidence that there are eight SMX units with 192 cores, and it would not be surprising that Nvidia is allowing for 48 cores to run 32 different tasks at once, effectively partitioning an SMX into four units. Those 48 cores are 50 per cent larger than an SM block on a Fermi GPU, which had 32 cores that ran about 35 per cent faster. So the net performance on this SMX sub-block and the SM block would be more or less the same.

Get to work, you lazy core

No matter how Nvidia is doing it, the important thing is that the CUDA cores are not going to be sitting around tapping their feet, waiting for MPI to send them work from the CPU. While seismic workloads can already stress out a GPU dispatching one MPI task to the GPU, there are many workloads that can submit four or eight MPI tasks, says Gupta, and on the current Fermi GPU coprocessors, the efficiency for sparse matrices or finite element analysis can look “really bad”.

On the VGEMM double precision matrix multiply portion of the Linpack Fortran benchmark test, Hyper-Q helps significantly. The VGEMM to peak ration on the Fermi GPUs was at best around 65 per cent of peak theoretical performance, while on the Kepler GPUs it is in the range of 80 to 85 per cent.

On typical workloads, customers were seeing GPU utilization on the Fermis in the range of 25 to 50 per cent, but now customers can expect – thanks to Hyper-Q and depending of course on their code – efficiencies of between 70 and 90 per cent for any particular time slice.

Not only is the Kepler GPU better at juggling work that the CPU offloads to it than the Fermi chip was, but with the Dynamic Parallelism feature of the chip, the GPU can launch work for itself as it deals with nested loops, recursion, and nested calls to libraries.

“The GPU has become more autonomous,” says Gupta, “and this makes the GPU programing a lot easier. If you have to go back and forth to the CPU all the time to run routines, you lose many of the advantages of using a GPU in the first place.” So Dynamic Parallelism gets rid of that.

Nvidia's Dynamic Parallelism for Kepler GPUsNvidia’s Dynamic Parallelism for Kepler GPUs

The idea behind Dynamic Parallelism is not just to make the GPU more autonomous for its own sake, but to allow for the granularity of calculations to reflect the density of the data that is being generated for a simulation. While this may be a a little tough to grasp conceptually, one picture makes it clear why Dynamic Parallelism is a very powerful addition to the GPU toolkit:

Variable granularity is what Dynamic Parallelism does for GPUsVariable granularity is what Dynamic Parallelism does for GPUs

The driving force behind Dynamic Parallelism in the Kepler GPUs is to allow for regions of simulation to be dynamically adjusted. If you do it too coarsely, your simulation yields crap results, and if you do it too finely, you get good results but it takes forever because you are doing calculations on regions of virtual space in the simulation where nothing interesting is happening.

The idea is to do coarser calculations where space is boring and finer calculations where lots of stuff is going on, and more importantly, to allow the GPU to make decisions about the granularity of calculations on the fly. The GPU reacts to the data, launching new threads to do finer-grained calculations where required.

Add it all up, and Gupta says that the Kepler GPUs will appeal to a much broader set of calculation and simulation workloads. “All of these people who were sitting on the fence will now move to GPUs,” declares Gupta.

Well, not so fast. They will once they can get their hands on some Tesla K20 coprocessors using the Kepler2 or GK110 GPUs. These will not ship until the fourth quarter of this year, and these will offer three times the double precision performance of the Fermi GPUs – that’s just under 2 teraflops with two GK110 GPUs on a card­ and the Hyper-Q and Dynamic Parallelism features activated.

In the meantime, Nvidia is packaging up the Tesla K10 coprocessor card for servers, which puts two of the Kepler1 or GX104 GPUs on a single card and offers three times the single-precision math oomph of a top-end Tesla M2090 card using the full-on Fermi GPU.

Nvidia's Tesla K10 GPU coprocessor

Nvidia’s Tesla K10 GPU coprocessor for single-precision math

The Tesla K10 and K20 GPU coprocessors slide into PCI-Express 3.0 slots, which means that at this point in the server cycle, they only work with Intel’s Xeon E5 family of “Sandy Bridge” processors for two-socket and four-socket servers. No other server chip is supporting PCI-Express 3.0 slots at this time.

Old Tesla M2090 versus new Tesla K10Old Tesla M2090 versus new Tesla K10

As you can see, the Tesla K10 can’t do much in terms of double-precision math, but at 4.58 teraflops per card and 320GB/sec of memory bandwidth (that’s with ECC turned off on the GDDR5 memory) feeding those 3,072 cores on the board from the two ranks of 4GB memory (one for each GPU) and 16GB/sec of bandwidth out to the PCI bus, there are plenty of customers doing seismic, signal, image, and life sciences workloads that only use single-precision math anyway. So the Telsa K10s will be fine.

Those doing finite element analysis, computational fluid dynamics, various physics simulations, and financial calculations and simulations that are dependent on double-precision floating point math will have to wait for the Tesla K20 cards using the Kepler2 GPUs. Perhaps not patiently, but with AMD not really doing much with its FireStream GPU coprocessors and Intel not shipping its MIC parallel X86 coprocessors, waiting is the best and pretty much the only option. ®

This article originally appeared in The Register. It appears here in its entirety as part of a cross-publishing agreement.

Also posted in Events, GPUs, GTC - GPU Technology Conference, HPC | 1 Comment

GTC 2012 Livestream Keynote Today, May 15, 2012 10:30am PDT

Nvidia CEO Jen-Hsun Huang will keynote the GTC 2012 conference this morning at 10:30am PDT. You can watch the live streaming video here.

Do not miss the opening keynote, featuring Jen-Hsun Huang, CEO and Co-Founder of NVIDIA. Hear about what’s next in computing and graphics, and preview disruptive technologies and exciting demonstrations from across industries. Jen-Hsun co-founded NVIDIA in 1993 and has served since its inception as president, chief executive officer and a member of the board of directors.

Minimum requirements to watch the website will be 400kb downstream (equivalent to DSL), and the latest Flash Player.

Also posted in Events, GPUs, GTC - GPU Technology Conference, HPC, Video | Leave a comment

Caltech Accelerates Discovery with Panasas Parallel Storage

Today Panasas announced that Caltech’s Center for Advanced Computing Research (CACR) has installed the company’s ActiveStor 11 to deliver high performance parallel storage as part of its newly upgraded HPC facilities.

We needed a high performance storage solution that was big enough and fast enough for our I/O demands, and that would not get in the way of our research. It was key to be able to take file system usability and customer support as a given,” said Sharon Brunett, senior scientist at CACR who was tasked with the overall Panasas ActiveStor selection and installation. “ActiveStor is an extremely reliable parallel storage platform. It has eliminated many of our file system administration and system management hassles, as well as user complaints about lackluster performance and application response times.”

CACR operates large-scale computing facilities and provides support services for numerous campus research groups that require reliable I/O, including the aeronautics, applied mathematics, astronomy, biology, engineering, geophysics, materials science, and physics departments. Read the Full Story.

Also posted in HPC, Storage | Leave a comment

SGI Boosts Big Data Performance with Sandy Bridge

This week SGI announced world record benchmark performance with full support for the newest Intel Xeon processor E5-2400 and E5-4600 product families. The E5-2400 is now the base processor in the SGI Hadoop Starter Kits and is available in the SGI Rackable product line for use in other applications.

Big Data is characterized not just by its volume but also by its velocity and variety. Moreover, Big Data can be in either structured or unstructured forms. These dynamics give rise to a broad range of demands made on a computer system, especially for high performance and comprehensive analytics,” said SGI CTO Dr. Eng Lim Goh. “Our long design relationship with Intel and the incorporation of the more robust Intel Xeon E5 processors, have enabled us to develop the next SGI coherent shared memory platform that scales up even higher, in compute, memory and IO, than our previous generation. The result is a system ideally suited to meet this broad spectrum of existing and emerging Big Data challenges.”

Read the Full Story.

Also posted in Compute, HPC, inside-BigData | Leave a comment

Interview: Cray Not Becoming a Software Company

In this podcast, Mike Bernhardt from The Exascale Report sits down with Cray CEO Peter Ungaro to discuss the company’s continuing mission as a systems company. As reported here, Cray recently announced that Intel has acquired Cray’s interconnect technology, a move that has puzzled some media pundits.

Contrary to misleading rumors – Cray continues Its focus as a systems company. Intel and Cray should be applauded for smart, strategic business decision. A number of publications have voiced their opinions that Cray is shifting its focus and strategic direction to software. Some public comments have even gone as far as stating that Cray is becoming a software company.

Read the transcript (PDF)Download the MP3 * If Dropbox is blocked, download from this Google page.

Also posted in Compute, GPUs, HPC, Network, Podcast | Leave a comment

nCore Schedules Popular Multicore Programming Course for Houston

nCore Design has announced a Programming Workshop on the PGI Accelerator with OpenACC Directives in Houston, Texas June 11-12, 2012. Developed in collaboration with The Portland Group, the two-day interactive workshop provides students with in-depth, hands-on lectures and laboratory exercises.

This is a comprehensive two-day workshop that thoroughly prepares students to be successful with OpenACC and PGI tools,” said Ian Lintault, Managing Director of nCore. “We are thrilled to be able to offer this program in close cooperation with The Portland Group and NVIDIA as the demand for GPU programming increases at a steady pace.”

Register now.

Also posted in GPUs, HPC, HPC Education and Training, HPC Software | Leave a comment

Podcast: Hot Interconnects Conference Seeks Your Papers on Datacenter, Virtualization, and Cloud Networking

In this audio podcast, Patrick Geoffray and Torsten Hoefler from the Hot Interconnects Conference lay out thier final Call for Papers.

Conference themes include cross-cutting issues spanning computer systems, networking technologies, and communication protocols for high-performance interconnection networks. This conference is directed particularly at new and exciting technology and product innovations in these areas. Contributions should focus on real experimental systems, prototypes, or leading-edge products and their performance evaluation.

Submissions are due May 20 * Download the MP3 * Subscribe on iTunes * If Dropbox is blocked, download from this Google page.

Also posted in Events, Hot Interconnects, HPC, Network | Leave a comment

Gearing Up for GTC 2012

In the first of a series of live posts from GTC 2012 in San Jose, Dan Olds from Gabriel Consulting kicks off our special coverage of the GPU Technology Conference.

Next week’s GPU Technology Conference, organized by NVIDIA, promises to again be the best vendor event in the industry.


Instead of trotting out customers to attest to the vendor’s greatness for 2.5 minutes, NVIDIA focuses the content of GTC on what customers are doing, what challenges they’re facing, and how they’re innovating with GPUs. Non-NVIDIA researchers and practitioners actually lead sessions, and they get down to the nitty-gritty of their projects. There is a distinct (and refreshing) lack of marketing gloss.

The Tuesday keynote by Jen-Hsun Huang is not to be missed by anyone interested in the next big, big advance in hybrid computing. Read the Full Story.

Also posted in Events, GPUs, GTC - GPU Technology Conference, HPC | Leave a comment

Video: Appro Supercomputer Solutions

In this video, Steve Lyness from Appro presents: Appro Supercomputer Solutions.

Abstract:
To survive in an ever-changing global environment, creating and delivering innovative products and services are what give any business the competitive edge in today’s global markets. In this presentation, you will learn how Appro, a US based High Performance Computing company met the supercomputing requirements of the University Of Tsukuba Center Of Computational Sciences in Japan. Learn how reliability, availability, manageability and compatibility were essential for the successful 800TF hybrid supercomputing implementation. Learn best practices on improving data I/O performance and memory size limitations configured with Lustre™ File System to offer the best performance per dollar with excellent memory capacity per FLOP. Explore how the University of Tsukuba’s Appro Xtreme-X™ Supercomputer is accelerating large scale parallel code by combining CPU/GPU processing cluster configurations and how this implementation will be used as a pioneer for a competitive advantage for future exascale computing systems.

Recorded at the 2012 National HPCC Conference in Newport.

Also posted in Compute, Events, HPC, National HPCC Conference, Video | Leave a comment

Mellanox Expands Line of FDR 56Gb/s InfiniBand Switches

This week Mellanox expanded its line of end-to-end FDR 56Gb/s InfiniBand interconnect solutions with new 18-port, 108-port, 216-port, and 324-port non-blocking fixed and modular switches. Built with Mellanox’s 5th generation SwitchX InfiniBand technology, are an ideal choice for building small to medium size clusters or for use as a core switch for large clusters.

As servers are deployed with next generation processors and PCIe 3.0, data center managers have an increased need for bandwidth, performance and density in their interconnect solutions,” said David Barzilai, vice president of marketing at Mellanox Technologies. “These new switches were developed as high-speed smart solutions to meet customer demand, while providing unmatched scalability across storage, application and database servers.”

Read the Full Story.

Also posted in HPC, Network | Leave a comment

Nvidia Contributes CUDA Compiler To Open Source Community

In a move to greatly expand the number programming languages that can take advantage of GPU acceleration, Nvidia today announced that the LLVM open source compiler now supports CUDA. The company has worked with LLVM developers to provide the CUDA compiler source code changes to the LLVM core and parallel thread execution backend. As a result, programmers can develop applications for GPU accelerators using a broader selection of programming languages, making GPU computing more accessible and pervasive than ever before.

The code we provided to LLVM is based on proven, mainstream CUDA products, giving programmers the assurance of reliability and full compatibility with the hundreds of millions of NVIDIA GPUs installed in PCs and servers today,” said Ian Buck general manager of GPU computing software at NVIDIA. “This is truly a game-changing milestone for GPU computing, giving researchers and programmers an incredible amount of flexibility and choice in programming languages and hardware architectures for their next-generation applications.”

LLVM supports a wide range of programming languages and front ends, including C/C++, Objective-C, Fortran, Ada, Haskell, Java bytecode, Python, Ruby, ActionScript, GLSL and Rust. It is also the compiler infrastructure NVIDIA uses for its CUDA C/C++ architecture, and it has been widely adopted by leading companies such as Apple, AMD and Adobe.

Read the Full Story. To download the latest version of the LLVM compiler with NVIDIA GPU support, visit the LLVM site.

Also posted in GPUs, HPC, HPC Software | Leave a comment


View All Videos

insideHPC.com is a production of insideHPC, LLC. © 2006-2011 Sitemap