Entries filed under “Events”

Announcements of upcoming events in HPC and reports from conferences, meetings, and workshops.

Hybrid Computing’s Radical Growth

 

Our GTC 2012 coverage continues as Dan Olds examines the growth of the CUDA environment from 150,000 downloads in 2007 to 1.5 million today:

More importantly, there are 35 NVIDIA-fueled hybrid supercomputers on the Top500 list today. The NDUT Tianhe-1A system, with 14,300 CPUs and 7,100 NVIDIA GPUs, held down the top spot on the list in 2010. The upcoming Oak Ridge Titan system will sport more than 18,000 CPUs alongside 18,000 GPUs, and should become the fastest supercomputer in the world sometime this fall.”

Read the Full Story.

Also posted in GPUs, GTC - GPU Technology Conference, HPC, HPC Software | Leave a comment

New Whitepaper: Dynamic Parallelism in CUDA

The details on Dynamic Parallelism were hard to find after the new feature was introduced as part of the GTC 2012 keynote yesterday. Now Nvidia has followed up with a short whitepaper that describes how it works.

Dynamic Parallelism in CUDA is supported via an extension to the CUDA programming model that enables a CUDA kernel to create and synchronize new nested work. Basically, a child CUDA Kernel can be called from within a parent CUDA kernel and then optionally synchronize on the completion of that child CUDA Kernel. The parent CUDA kernel can consume the output produced from the child CUDA Kernel, all without CPU involvement.

Download the whitepaper (PDF).

Also posted in GPUs, GTC - GPU Technology Conference, HPC, HPC Hardware, HPC Software | Leave a comment

Swimming in Sensors, Drowning in Data

Dan Olds from Gabriel Consulting Group shares his amazement from a GTC 2012 talk by MotionDSP.

Cleaning up and enhancing video is a tall order, compute-wise… But I just saw a demo of that in a GTC12 session run by MotionDSP. Their specialty is processing video streams from mobile platforms (think drones and airplanes) on the fly. We’re talking full motion, 30 frames per second video streams that are enhanced, cleaned up, and highly analyzable in real time. The amount of processing they’re doing is incredible. Lighting is enhanced, edges are enhanced, jitter is taken out, and the on-screen metadata (time, location, speed, etc.) is masked… The effect is profound. In the demo, what was once just a vague gray ship (which seemed to be vibrating like a can in a paint shaker) was clarified so that you could easily see what kind of ship it was and also see two suspicious figures milling around on deck. To me, it looked like there were enough pixels to enhance the video even further – to the point where we could identify the figures.”

Read the Full Story.

 

Also posted in GPUs, GTC - GPU Technology Conference, HPC, HPC Hardware, Visualization | Leave a comment

BOXX Demos Deskside Supers using Nvidia Maximus Platform

This week BOXX Technologies is demonstrating their 3DBOXX workstation computers featuring NVIDIA Maximus technology at GTC 2012. Maximus technology enables engineers to complete simulation or rendering plus visualization simultaneously on the same workstation.

In the past, content creation that employed both visual design and physical simulation often resulted in these tasks occurring on different systems or at different times,” said Shoaib Mohammad, BOXX VP of Marketing and Business Development. “Our integration of NVIDIA Maximus technology enables users of Autodesk, NVIDIA iray, SolidWorks, CATIA, Bunkspeed and other professional applications to design and render simultaneously resulting in a faster, seamless workflow essential to increasing productivity.”

Read the Full Story.

Also posted in Compute, GPUs, GTC - GPU Technology Conference, HPC, HPC Hardware | Leave a comment

How Computer Games Help HPC

In this special guest feature, Tom Wilkie from Scientific Computing World clears his head from all the technical information gathered at yesterday’s GTC2012 keynote.

The latest processor from Nvidia will lead to ‘the democratisation of computing happening in front of us,’ according to Jen-Hsun Huang, president and chief executive of the company.

He unveiled the new chip, known as ‘Kepler’, to an audience of nearly 3,000 scientists and engineers at Nvidia’s GPU Technology Conference in San Jose, California, on 15 May. It was, he said, more than three times as energy efficient as its predecessor.

Nvidia specialises in the graphics processing units, as one of the major suppliers of computer graphics cards to PCs but the technology is now widely used as an accelerator in high performance computers. Kepler was, he said, the most energy efficient GPU ever built and he expected it to advance high-performance computing, computer graphics and cloud computing. In HPC, he said, ‘We know that ultimate performance is limited by energy efficiency and at the chip architecture level we have had to design for energy efficiency and this is a huge step forward.’

Among the applications in HPC that he demonstrated was a massive simulation of the collision between our own galaxy, the Milky Way, and the nearby Andromeda galaxy – an event expected some three billion years or so into the future. The simulation involved a many-body problem of millions of gravitationally interacting stars – a highly intensive computational problem.

But according to Sumit Gupta, head of Nvidia’s Tesla high-performance computing business, supercomputing will be the beneficiary of the other applications for the Kepler chip – in gaming, virtualisation and cloud computing. It is because Nvidia has such a strong presence in these high-volume consumer markets that it is able to produce its processors so cheaply. And it is this aspect, according to Gupta that is leading to the ‘democratisation of high performance computing’ proclaimed by Huang.

‘With the same GPU,’ Gupta said, ‘we can go into many different markets Cloud gaming will be a huge market – we are able to leverage all of these high volume markets and get into HPC at a price point other people cannot.’

Nvidia is launching two versions of the processor: one is available almost immediately that will have single precision and will be suitable for some scientific applications such as seismic profiling. The other, known as K20, will have double precision and enhanced queuing and parallelism but it will not be available until the last quarter of this year.

He pointed out that ‘with Kepler you can build a petaflop system, in just ten racks of servers. Two years ago, Tokyo Tech built a petaflop machine with Fermi [the predecessor to Kepler] and it took them 42 racks.’ To build a machine of similar performance based on Intel’s Sandybridge processor, would take about 100 racks of servers. ‘So Kepler is 10 times better than Sandybridge in terms of petaflops,’ he claimed. He also said that there would be a tenfold improvement in power consumption, with a 1 petaflop Kepler-based machine consuming just 400 kW as opposed to around 3MW with Sandybridge.

‘A petaflop machine of this size means that every university in the world can put one in,’ he said. He estimated that it would cost less than $4M for a petaflop machine, whereas in the recent past people have spent $30M to $40M to get the same performance. ‘There are universities out there that consume 400kW with a 10 rack system but they only get 20 teraflops, so they have this outlay but they are getting a twentieth of what they could be getting.’

But Gupta promised that Kepler was only one step along the road. Although, he said, ‘from my perspective, Kepler is a bigger shift than we have ever done before – much more revolutionary – there is so much innovation for us still to do. It’s a long road.’

This story originally appeared on HPC Projects. It appears here as part of a cross-publishing agreement with Scientific Computing World.

Also posted in GPUs, GTC - GPU Technology Conference, HPC Hardware, Video | 1 Comment

Video: The Future is Parallel, and the Future of Parallel is Declarative

In this video, Simon Peyton Jones from Microsoft Research presents: The Future is Parallel, and the Future of Parallel is Declarative.

If you want to program a parallel computer, it obviously makes sense to start with a computational paradigm in which parallelism is the default (ie functional programming), rather than one in which computation is based on sequential flow of control (the imperative paradigm). And yet… functional programmers have been singing this tune since the 1980s, but do not yet rule the world. In this talk I’ll say why I think parallelism is too complex a beast to be slain at one blow, and how we are going to be driven, willy-nilly, towards a world in which side effects are much more tightly controlled than now. I’ll give a whirlwind tour of a whole range of ways of writing parallel program in a functional paradigm (implicit parallelism, transactional memory, data parallelism, DSLs for GPUs, distributed processes, etc, etc), illustrating with examples from the rapidly moving Haskell community, and identifying some of the challenges we need to tackle.

Recorded at the 2011 YOW! Australia Software Developer Conference.

Also posted in HPC, HPC Software | Leave a comment

Video: PGI’s Michael Wolfe on OpenACC & Dynamic Parallelsim in Nvidia Kepler GPUs

In this video, Michael Wolfe from The Portland Group describes the advantages of OpenACC and new Nvidia Kepler GPU features including Dynamic Parallelism and Hyper-Q.

Also posted in GPUs, HPC, HPC Hardware, HPC Software, NVIDIA GPU Technology Conference, Video | 2 Comments

Nvidia’s Kepler Pushes Parallelism up to Eleven

By Timothy Prickett Morgan • Get more from this author

When Nvidia did a preview of its next-generation “Kepler” GPU chips back in March, the company’s top brass said that they were saving some of the goodies in the Kepler design for the big event at Nvidia’s GPU Technical Conference in San Jose, which runs this week. And true to its word, the Kepler GPUs do have some goodies that will make them considerably more useful for graphics and HPC compute workloads.

The two big innovations baked into the Kepler GPU are called Hyper-Q and Dynamic Parallelism, and they are integral to the company’s plans for the Kepler GPUs to have somewhere between three and four times the performance per watt compared to the prior generation of Fermi GPUs.

Die shot of the Nvidia Kepler GPU

The first architectural change that Nvidia made is a tradeoff between clock speed and core counts that all CPU and GPU makers are wrestling with every day. Power consumption rises with the log of clock speed, so reducing the clock speed a little can have a dramatic impact on overall power consumption on a component.

And so concurrently with the shrink from the 40 nanometer processes used with the Fermi GPUs to the 28 nanometer processes used to etch the Keplers, Nvidia is cranking up the core counts and slowing down the clock speeds, increasing the parallelism and the overall performance of GPU while significantly lowering its power draw and heat dissipation.

There are two different Kepler GPUs in development. The Kepler1 chip, also known as GK104, is aimed at graphics cards and Tesla GPU coprocessors, where single-precision floating point math is what matters most.

Until now, Nvidia has not said much about the Kepler2 GPUs – also known as GK110 internally – except that they will be tuned for double-precision floating point math and will support more GDDR5 memory, will have ECC scrubbing on that memory, will have different packaging aimed at servers, and will cost more money than Tesla cards based on the Kepler1 units. A little more info on the Kepler2 GPUs was divulged today at the GTC 2012 event, thankfully.

Nvidia's SMX architecture for the Kepler GPUNvidia’s SMX architecture for the Kepler GPU

The Fermi GPU had 512 cores, with 64KB of L1 cache per core and a 768KB L2 cache shared across a group of 32 cores known as a streaming multiprocessor, or SM. This was the first time that Nvidia added cache memory to the cores and made them look a lot more like standard CPUs in terms of their memory hierarchy. A Fermi GPU had 16 of these SMs and either 3GB or 6GB of GDDR5 memory.

The initial Fermis only shipped with 448 cores activated in the top-end models, but as yields improved at Taiwan Semiconductor Manufacturing Corp on its 40 nanometer process, Nvidia was able to ship chips with all 512 cores running.

The Fermis burned between 225 watts and 250 watts in a discrete graphics card and Tesla coprocessor cards; they originally ran at 1.15GHz with the 448 core version and were boosted to 1.3GHz with the 512 core variant. The 512 core Fermi GPU could do 665 gigaflops of double-precision floating point math and 1.33 teraflops at single precision.

With the Keplers, Nvidia is moving on to what it calls an SMX, or streaming multiprocessor extreme, architecture. With the Kepler1 chips, Nvidia is putting 192 cores into a streaming multiprocessor group with slightly modified CUDA cores. Eight of these SMX units are on a single GPU chip for a total of 1,536 cores.

The cores have a base speed of 1006MHz with a turbo boost speed of 1058MHz (no, that is not much of a boost), and even given the fact that the GPU has three times as many cores, dropping the clock speed by a third means it only burns 195 watts. It therefore offers much better performance per watt – about three times, according to Sumit Gupta, senior product manager of the Tesla line at Nvidia, who spoke to El Reg ahead of the GPU Technical Conference.

The Kepler GPUs are not just about shrinking the cores and adding more of them running at a lower speed to a GPU to boost performance. That would probably not be enough to take on the exascale computing tasks that Nvidia is wrestling with as it positions its Tesla GPU coprocessors as the preferred compute engines for future supercomputers, even if this would probably be good enough to make graphics chips that could compete against whatever Advanced Micro Devices could come up with.

One new technology that is going to make the Keplers much better than the Fermis is called Hyper-Q, and as the name suggests, it creates a queue for message passing interface (MPI) tasks running on parallel and hybrid CPU-GPU clusters so multiple MPI tasks can be dispatched from the CPU to the GPU in parallel.

This is so obvious in hindsight that you might have already been thinking that this has already happened, but Gupta says that the Fermi GPUs could only handle one MPI task at a time.

Nvidia's Hyper-Q feature for Kepler GPUsNvidia’s Hyper-Q feature for Kepler GPUs

The Kepler GPUs, by contrast, can have up to 32 distinct MPI tasks beamed to them from the CPU and dispatch them to different segments of the GPU to have them run on isolated chunks of the cores.

It is not clear what the granularity is on the Hyper-Q function, but it is probably no coincidence that there are eight SMX units with 192 cores, and it would not be surprising that Nvidia is allowing for 48 cores to run 32 different tasks at once, effectively partitioning an SMX into four units. Those 48 cores are 50 per cent larger than an SM block on a Fermi GPU, which had 32 cores that ran about 35 per cent faster. So the net performance on this SMX sub-block and the SM block would be more or less the same.

Get to work, you lazy core

No matter how Nvidia is doing it, the important thing is that the CUDA cores are not going to be sitting around tapping their feet, waiting for MPI to send them work from the CPU. While seismic workloads can already stress out a GPU dispatching one MPI task to the GPU, there are many workloads that can submit four or eight MPI tasks, says Gupta, and on the current Fermi GPU coprocessors, the efficiency for sparse matrices or finite element analysis can look “really bad”.

On the VGEMM double precision matrix multiply portion of the Linpack Fortran benchmark test, Hyper-Q helps significantly. The VGEMM to peak ration on the Fermi GPUs was at best around 65 per cent of peak theoretical performance, while on the Kepler GPUs it is in the range of 80 to 85 per cent.

On typical workloads, customers were seeing GPU utilization on the Fermis in the range of 25 to 50 per cent, but now customers can expect – thanks to Hyper-Q and depending of course on their code – efficiencies of between 70 and 90 per cent for any particular time slice.

Not only is the Kepler GPU better at juggling work that the CPU offloads to it than the Fermi chip was, but with the Dynamic Parallelism feature of the chip, the GPU can launch work for itself as it deals with nested loops, recursion, and nested calls to libraries.

“The GPU has become more autonomous,” says Gupta, “and this makes the GPU programing a lot easier. If you have to go back and forth to the CPU all the time to run routines, you lose many of the advantages of using a GPU in the first place.” So Dynamic Parallelism gets rid of that.

Nvidia's Dynamic Parallelism for Kepler GPUsNvidia’s Dynamic Parallelism for Kepler GPUs

The idea behind Dynamic Parallelism is not just to make the GPU more autonomous for its own sake, but to allow for the granularity of calculations to reflect the density of the data that is being generated for a simulation. While this may be a a little tough to grasp conceptually, one picture makes it clear why Dynamic Parallelism is a very powerful addition to the GPU toolkit:

Variable granularity is what Dynamic Parallelism does for GPUsVariable granularity is what Dynamic Parallelism does for GPUs

The driving force behind Dynamic Parallelism in the Kepler GPUs is to allow for regions of simulation to be dynamically adjusted. If you do it too coarsely, your simulation yields crap results, and if you do it too finely, you get good results but it takes forever because you are doing calculations on regions of virtual space in the simulation where nothing interesting is happening.

The idea is to do coarser calculations where space is boring and finer calculations where lots of stuff is going on, and more importantly, to allow the GPU to make decisions about the granularity of calculations on the fly. The GPU reacts to the data, launching new threads to do finer-grained calculations where required.

Add it all up, and Gupta says that the Kepler GPUs will appeal to a much broader set of calculation and simulation workloads. “All of these people who were sitting on the fence will now move to GPUs,” declares Gupta.

Well, not so fast. They will once they can get their hands on some Tesla K20 coprocessors using the Kepler2 or GK110 GPUs. These will not ship until the fourth quarter of this year, and these will offer three times the double precision performance of the Fermi GPUs – that’s just under 2 teraflops with two GK110 GPUs on a card­ and the Hyper-Q and Dynamic Parallelism features activated.

In the meantime, Nvidia is packaging up the Tesla K10 coprocessor card for servers, which puts two of the Kepler1 or GX104 GPUs on a single card and offers three times the single-precision math oomph of a top-end Tesla M2090 card using the full-on Fermi GPU.

Nvidia's Tesla K10 GPU coprocessor

Nvidia’s Tesla K10 GPU coprocessor for single-precision math

The Tesla K10 and K20 GPU coprocessors slide into PCI-Express 3.0 slots, which means that at this point in the server cycle, they only work with Intel’s Xeon E5 family of “Sandy Bridge” processors for two-socket and four-socket servers. No other server chip is supporting PCI-Express 3.0 slots at this time.

Old Tesla M2090 versus new Tesla K10Old Tesla M2090 versus new Tesla K10

As you can see, the Tesla K10 can’t do much in terms of double-precision math, but at 4.58 teraflops per card and 320GB/sec of memory bandwidth (that’s with ECC turned off on the GDDR5 memory) feeding those 3,072 cores on the board from the two ranks of 4GB memory (one for each GPU) and 16GB/sec of bandwidth out to the PCI bus, there are plenty of customers doing seismic, signal, image, and life sciences workloads that only use single-precision math anyway. So the Telsa K10s will be fine.

Those doing finite element analysis, computational fluid dynamics, various physics simulations, and financial calculations and simulations that are dependent on double-precision floating point math will have to wait for the Tesla K20 cards using the Kepler2 GPUs. Perhaps not patiently, but with AMD not really doing much with its FireStream GPU coprocessors and Intel not shipping its MIC parallel X86 coprocessors, waiting is the best and pretty much the only option. ®

This article originally appeared in The Register. It appears here in its entirety as part of a cross-publishing agreement.

Also posted in GPUs, GTC - GPU Technology Conference, HPC, HPC Hardware | 1 Comment

GTC 2012 Livestream Keynote Today, May 15, 2012 10:30am PDT

Nvidia CEO Jen-Hsun Huang will keynote the GTC 2012 conference this morning at 10:30am PDT. You can watch the live streaming video here.

Do not miss the opening keynote, featuring Jen-Hsun Huang, CEO and Co-Founder of NVIDIA. Hear about what’s next in computing and graphics, and preview disruptive technologies and exciting demonstrations from across industries. Jen-Hsun co-founded NVIDIA in 1993 and has served since its inception as president, chief executive officer and a member of the board of directors.

Minimum requirements to watch the website will be 400kb downstream (equivalent to DSL), and the latest Flash Player.

Also posted in GPUs, GTC - GPU Technology Conference, HPC, HPC Hardware, Video | Leave a comment

Video: Lustre as a Data Acquisition File System at Diamond Light Source

In this video, Frederik Ferner from Diamond Light Source presents: Lustre as a Data Acquisition File System at Diamond Light Source. Recorded at LUG 2012 in Austin.

Note: Most of the videos from LUG 2012 are now posted at the OpenSFS site.

Also posted in HPC, HPC Software, LUG 2012, Video | Leave a comment

Nvidia Brings Eclipse to Nsight IDE

Today Nvidia announced their new Nsight, Eclipse Edition, an integrated development environment (IDE) for developing GPU accelerated applications on Linux- and Mac OS-based systems. With powerful visual profiling and debugging tools, the new IDE enables developsers to write, debug and optimize the performance of GPU-accelerated applications within a familiar open source Eclipse framework.

NVIDIA Nsight is the ultimate development platform for heterogeneous computing,” said Ian Buck, general manager of GPU computing software at NVIDIA. “Whether you’re a graphics or HPC developer, Nsight makes it easy to develop parallel code for GPUs and CPUs using your preferred IDE.”

The company also announced an updated version of NVIDIA Nsight, Visual Studio Edition for Microsoft Windows developers. Formerly known as NVIDIA Parallel Nsight, the new version adds a number of enhancements designed to ease parallel programming on GPU-based Windows systems. Read the Full Story.

Also posted in GTC - GPU Technology Conference, HPC, HPC Software | Leave a comment

Podcast: Hot Interconnects Conference Seeks Your Papers on Datacenter, Virtualization, and Cloud Networking

In this audio podcast, Patrick Geoffray and Torsten Hoefler from the Hot Interconnects Conference lay out thier final Call for Papers.

Conference themes include cross-cutting issues spanning computer systems, networking technologies, and communication protocols for high-performance interconnection networks. This conference is directed particularly at new and exciting technology and product innovations in these areas. Contributions should focus on real experimental systems, prototypes, or leading-edge products and their performance evaluation.

Submissions are due May 20 * Download the MP3 * Subscribe on iTunes * If Dropbox is blocked, download from this Google page.

Also posted in Hot Interconnects, HPC, HPC Hardware, Network | Leave a comment

Gearing Up for GTC 2012

In the first of a series of live posts from GTC 2012 in San Jose, Dan Olds from Gabriel Consulting kicks off our special coverage of the GPU Technology Conference.

Next week’s GPU Technology Conference, organized by NVIDIA, promises to again be the best vendor event in the industry.


Instead of trotting out customers to attest to the vendor’s greatness for 2.5 minutes, NVIDIA focuses the content of GTC on what customers are doing, what challenges they’re facing, and how they’re innovating with GPUs. Non-NVIDIA researchers and practitioners actually lead sessions, and they get down to the nitty-gritty of their projects. There is a distinct (and refreshing) lack of marketing gloss.

The Tuesday keynote by Jen-Hsun Huang is not to be missed by anyone interested in the next big, big advance in hybrid computing. Read the Full Story.

Also posted in GPUs, GTC - GPU Technology Conference, HPC, HPC Hardware | Leave a comment

Video: Best Practices for Scalable Administration of Lustre

In this video, Blake Caldwell from ORNL presents: Best Practices for Scalable Administration of Lustre. Recorded at LUG 2012 in Austin.

Note: Most of the videos from LUG 2012 are now posted at the OpenSFS site.

Also posted in HPC, HPC Software, LUG 2012, Video | Leave a comment

Video: Appro Supercomputer Solutions

In this video, Steve Lyness from Appro presents: Appro Supercomputer Solutions.

Abstract:
To survive in an ever-changing global environment, creating and delivering innovative products and services are what give any business the competitive edge in today’s global markets. In this presentation, you will learn how Appro, a US based High Performance Computing company met the supercomputing requirements of the University Of Tsukuba Center Of Computational Sciences in Japan. Learn how reliability, availability, manageability and compatibility were essential for the successful 800TF hybrid supercomputing implementation. Learn best practices on improving data I/O performance and memory size limitations configured with Lustre™ File System to offer the best performance per dollar with excellent memory capacity per FLOP. Explore how the University of Tsukuba’s Appro Xtreme-X™ Supercomputer is accelerating large scale parallel code by combining CPU/GPU processing cluster configurations and how this implementation will be used as a pioneer for a competitive advantage for future exascale computing systems.

Recorded at the 2012 National HPCC Conference in Newport.

Also posted in Compute, HPC, HPC Hardware, National HPCC Conference, Video | Leave a comment


View All Videos

insideHPC.com is a production of insideHPC, LLC. © 2006-2011 Sitemap