Entries filed under “HPC Hardware”

Hardware news and announcements in technologies related to HPC.

Intel’s Future Haswell Processor to Feature Transactional Synchronization

Intel’s James Reinders writes that the company will be introducing new Transactional Synchronization Extensions (TSX) for the future 22 nm multicore processor code-named “Haswell”. In a nutshell, Intel TSX provides a set of instruction set extensions that allow programmers to specify regions of code for transactional synchronization.

With transactional synchronization, the hardware can determine dynamically whether threads need to serialize through lock-protected critical sections, and perform serialization only when required. This lets the processor expose and exploit concurrency that would otherwise be hidden due to dynamically unnecessary synchronization.

Read the Full Story or download the updated specifications.

Also posted in Compute, HPC, HPC Software | Leave a comment

Video: ORNL – Advancing Research and Science through Supercomputing

In this video, Richard Graham from Oak Ridge National Laboratory presents: Advancing Research and Science through Supercomputing. Recorded at the HPC Advisory Council Israel Supercomputing Conference on Feb 7, 2012 in Tel Aviv.

Presentations will soon be available from the conference site.

Also posted in Compute, Events, HPC, HPC Advisory Council Workshop, Network, Video | Leave a comment

Slidecast: Solarflare ApplicationOnload Engine for On-the-Fly Processing of Network Data

In this slidecast, Mike Smith from Solarflare describes the company’s ApplicationOnload Engine (AOE), a new platform that moves application processing into the network adapter for applications that rely on real-time, high-performance network data.

Our new ApplicationOnload Engine is a new class of product that results directly from interaction with our end-user customers. Our engineers have worked closely with these customers to create a platform that leverages OpenOnload’s proven framework for creating a direct path from applications to the network, and incorporates on-the-fly processing of real-time network data,” said Russell Stern, CEO at Solarflare. “This solution provides not only the lowest latency and highest message rate network I/O performance, but achieves an unparalleled boost in application performance, all while maintaining a seamless, compatible interface with our existing server adapter products.”

Solarflare’s AOE combines a fully featured 10GbE server adapter with a state-of-the-art FPGA that provides a seamless, low-latency network interface to the host server and application processing. According to Smith, AOE is an open platform that utilizes applications developed by Solarflare, its customers, and third-party developers.

Read the Full Story * Download the MP3 * Subscribe on iTunes * If Dropbox is blocked, download from this Google page.

Also posted in HPC, HPC Software, Network, Video | Leave a comment

AMD Doubles Down on Existing Opteron Server Sockets

By Timothy Prickett Morgan • Get more from this author

As El Reg anticipated earlier this week, the new upper management at AMD has come to its senses and figured out that moving to a new core and two new sockets for its Opteron line in 2012 was not a particularly good idea for its own finances, or those of the server makers who it wants to peddle Opteron-based iron. And so, that plan has been scrapped.

Instead, AMD is going to field new 32 nanometer processors based on the forthcoming “Piledriver” core design and jam them into the same G34 and C32 sockets, meaning that HP, Dell, Super Micro, IBM, Acer, and a handful of other box makers will not have to engineer new motherboards and systems.

AMD CEO Rory Read, formerly of IBM and Lenovo, spoke at the company’s analyst day in Silicon Gulch on Thursday and said that the company sees that “proprietary control points” were breaking down and that AMD was chasing “inflection points” in the PC, tablet, and server spaces. He explained AMD would bring its expertise in CPU and GPU design together to crafty system-on-chip (SoC) products that will, presumably, also integrate network and other types of I/O directly on the chip.

“Shift happens, shift is good,” Read stated emphatically, and with a straight face, adding that AMD was being tweaked to become a “market driven company” and not second fiddle in an “unhealthy duopoly.” The task Read sees ahead for AMD is “about stepping out of the shadows and leading.”

But, according to Read and Lisa Su (a semiconductor researcher at IBM and former CTO at Freescale Semiconductor who was hired back in December to be senior vice president and general manager of the new Global Business Units,) what AMD needs to do right now in servers is to step back, ramp up production of Opteron 4200 and 6200 processors and rebuild and extend relationships with server makers as it plots out its future Opteron chips.

Sticking with the existing C32 sockets for the Opteron 4200 sockets and the G34 sockets for the Opteron 6200s is just part of listening to the customer. It also gives AMD some engineering breathing time to come up with interesting, low-power Opteron platforms that are tailored specifically for hyperscale Web, big data, server virtualization, database, and similar workloads where AMD’s Opterons do well.

“Server is a great opportunity for us, and it is clear that our market share is not very high today,” conceded Su. But she also said that the “Bulldozer” core and its different architecture takes time to get its footing. Considering this, introducing new sockets right now was a bad idea technically and economically for both AMD and server makers. “At the end of the day, that wasn’t the right answer for our customers,” Su said.

Back in November 2010, two months before CEO Dirk Meyer was ousted, the plan was to crank up the Opteron 6200s to 20 cores using the new Piledriver core, an improved version of the current “Bulldozer” core used in the Opteron 4200 and 6200 server processors as well as a number of desktop chips.

The plan called for the “Sepang” processor to have up to ten Piledriver cores and plug into the C32 sockets, which are used to make servers with one or two sockets across a single memory space. The “Terramar” Opteron chip was the kicker to the current Opteron 6200 and would put two of these Sepang chips in a single package and scale it up to 20 cores per socket. Both of these chips were implemented in the 32 nanometer silicon-on-insulator (SOI) processes from fab partner GlobalFoundries.

A year later, with microservers taking off (at least in terms of marketing hype), AMDannounced that it would chase microserver builders with a new single-socket Opteron 3000 chip, code-named “Zurich,” that plugged into the AM3+ socket. The Zurich chip is a variant of the Opteron 4200 with four or eight cores activated, one HyperTransport link, and – most importantly – availability in less expensive motherboards.

The Zurich chip, presumably to be called the Opteron 3200, was expected sometime in the first half of 2012 when AMD was talking about it last fall, but it is now going to be launched in the first quarter, as you can see in the roadmap below:

AMD's Opteron server roadmapAMD’s revised Opteron server roadmap (click to enlarge)

For larger Opteron systems, AMD is taking a conservative approach. Rather than adding two more cores to the basic Opteron processor unit, the new “Seoul” processor keeps the core count at six or eight as the new Piledriver core is brought in. The DDR3 main memory stays the same – two channels per socket – as with the current Opteron 4200s, and the chips will not include any additional on-chip I/O, such as the PCI-Express 3.0 links that Intel is putting on its forthcoming “Sandy Bridge” family of Xeon E5 processors for machines with one, two, or four sockets.

The high-end “Abu Dhabi” Opterons will have 4, 8, 12, and 16 Piledriver cores, the same core count as the Opteron 6200s that started shipping last summer, and will sport the same four channels of DDR3 memory per socket.

You’ll notice that AMD is not talking about how many HyperTransport links will be on these future Piledriver-based Opterons or what speed they will run at, so it makes perfect sense to conjecture that they will run at a faster rate – 8GT/sec sounds reasonable to match the expected 25 per cent increase in raw performance that AMD was promising for Piledriver cores in desktop processors.

AMD is also expecting to put out a kicker for the Opteron 3200, dubbed “Delhi” and offering four or eight Piledriver cores.

All of the new Opterons will be etched in GlobalFoundries’ 32 nanometer processes, just like the current ones are. On the desktop processor roadmaps that Su went over, the chips for 2012 and those for 2013 were clearly marked. Not so on the server chip roadmaps, but we placed a call to AMD and were told by a spokesman that all of the chips above will be coming out this year. The Abu Dhabi and Seoul Opterons are due towards the end of the year.

The big change, according to new AMD CTO Mark Papermaster, formerly of IBM, Apple, and Cisco Systems, was that AMD was shifting from a design philosophy that focused in the performance of processor cores, adopted the bleeding edge tech from GlobalFoundries or Taiwan Semiconductor Manufacturing Corp to try to compensate for the process lag AMD (and everyone else) has with Intel.

This lead to execution problems, and more importantly, Papermaster said that the company’s current managers do not believe that the process technology node trumps integration of functions on an SoC and the “experience” that the user has using a device based on AMD silicon.

Su didn’t give out a lot of details on the future Piledriver cores, except to say that it would be able to do more instructions per cycle and would have higher clock frequencies. Many had expected for Bulldozer to do better on the clock speed front.

AMD Opteron core roadmapAMD’s Opteron core roadmap (click to enlarge)

Looking out further into the future, AMD is cooking up a third generation modular core called “Steamroller,” which would have a greater level of parallelism. This could mean a lot of different things, such as adding more threads or cores to the chip or adding more instruction units per core module. Su did not say, and it is likely that AMD is itself not quite sure what it means. And further out beyond that, AMD will crank out more performance in some unspecified way with a modular core design called “Excavator.”

It will be interesting to see what AMD integrates onto its server chips and how fast it can do it. In the meantime, Intel is going to make plenty of hay in the supercomputing market where there are workloads with heavy I/O demands because it can support PCI-Express 3.0 peripherals with the future Xeon E5 processors. It remains to be seen how much of an advantage this will be across the server market at large. ®

This article originally appeared in The Register. It appears here in its entirety as part of a cross-publishing agreement.

Also posted in Compute, HPC | Leave a comment

Cray XE6m Midrange System Weighs in at $31K per Teraflop

While affordable Petascale computing may be a ways off, this week Cray rolled out the Cray XE6m system, a midrange supercomputer that brings the hyperscale technologies being deployed at BlueWaters and Titan down to the rack level. With six blades and 48 sockets using the new Opteron 6200s, the Cray XE6m starts at $200K, or approximately $30,769 per teraflops.

Building on the reliability and scalability of the Cray XE6 supercomputer and using the same proven petascale technologies, the Cray XE6m system is optimized to support scalable application workloads in the midrange high performance computing (HPC) market, where applications require between 700 and 13,000 cores of processing power.

Read the Full Story.

Also posted in Compute, HPC | Leave a comment

Video: EPFL Scientists Develop 3D Chips

EPFL scientists have developed a new generation of 3D computer chips that stacked vertically rather than placed side by side. The technology may someday enable faster, higher bandwidth processing.

EPFL scientist are among the leaders in the race to develop an industry-ready prototype of a 3D chip as well as a high-performance and reliable manufacturing method. The chip is composed of three or more processors that are stacked vertically and connected together—resulting in increased speed and multitasking, more memory and calculating power, better functionality and wireless connectivity.

Read the Full Story.

Also posted in Compute, Computing Research, HPC | Leave a comment

SGI Takes InfiniteStorage to 2.37 PB Per Rack

This week SGI rolled out a new an integrated server and storage platform with extremely high density. To provide up 2.37 PB per rack, the SGI Modular InfiniteStorage platform uses an innovative chassis architecture based on modular drive bricks packed into 4U enclosures.

The SGI Modular InfiniteStorage platform is designed to couple very dense storage and compute capabilities in an adaptable platform, to give cloud and other storage IT customers important new choices for tuning and growing the system to meet their specific requirements,” said Steve Conway, IDC research vice president for HPC. “The SGI Modular InfiniteStorage platform aims to allow IT managers to design customized solutions based on standards-based components.”

Read the Full Story.

Also posted in HPC, Storage | Leave a comment

Video: AMD’s CTO Talks Heterogeneous Systems Architecture

In this video, AMD’s Joe Macri describes the company’s HSA architecture (formerly known as Fusion). Recorded at the 2012 DesignCon conference in Santa Clara.

The architectural path for the future is clear,” Macri declared. That path will be paved with the programming patterns established on Symmetric Multi-Processor (SMP) systems migrating to the heterogeneous world. The architecture will be open, with published specifications and an open source execution software stack, and heterogeneous cores would be able to work together seamlessly in coherent memory, with low latency dispatch and no software fault lines.

A Tip of the Hat goes to Sylvie Barak at IEEE Times for pointing us to this video.

Also posted in Compute, GPUs, Video | Leave a comment

Inaugural XSEDE12 Conference Issues Call for Participation

The XSEDE12 conference has issued its Call for Participation. The event will take place July 16-20 in Chicago.

The inaugural XSEDE12 conference will carry forward many of the successful elements of the TeraGrid conference series, while adding new features. XSEDE will showcase the discoveries, innovations, and achievements of those who use XSEDE resources and those who help build and support them. XSEDE12 also will create a forum for discussion of current needs and future plans among researchers, students, XSEDE staff, and NSF representatives.

Tutorial and Panel proposals are due April 13 and Paper and Poster submissions are due April 25.

Also posted in Events, HPC, Network | Leave a comment

SeaMicro Packs 64 Quad-Core Xeons into 10U

Today SeaMicro got a lot of media attention with the launch of the “first fabric-based Intel Xeon micro server,” the SeaMicro SM10000-XE. While the company has been shipping Intel Atom-based servers for a while now, this unexpected move to puts Sandy Bridge Xeons into the same highly dense form factor.

Today we have announced the lowest-power, highest-density, highest-bandwidth Intel® Xeon®–based server ever built,” says Andrew Feldman, CEO of SeaMicro. “SeaMicro now brings the benefits of micro servers—efficiency and massive density—to small and larger-core workloads and to all parts of the scale out data center. Combining the SM10000 architecture with the Samsung Green DDR3 memory and Intel® Xeon® processors, SeaMicro now sets a new bar for energy efficient compute in the datacenter.”

So how was SeaMicro able to pull this off? Rachel King writes that it was a clever combination of partner technologies:

  • Intel’s Sandy Bridge architecture and Xeon processors
  • SeaMicro’s Freedom Fabric ASIC (optimized to work with large-core and small-core CPUs, shrinks the size of the motherboard to the size of a standard business card)
  • Samsung’s energy efficient Green DDR3 RAM (half the size of a standard memory module)

Before we get you too excited about HPC for this box, it is worth noting that that the device has a shared-nothing architecture. But with the the ability to support 1024 Xeon cores in a rack, the datacenter future is looking bright for SeaMicro.

Read the Full Story.

Also posted in Compute | Leave a comment

Slidecast: Nimbus E-Class Storage Launch Briefing

In this slidecast, Tom Isakovich from Nimbus Data Systems describes the company’s new high-availability E-Class Storage devices based on high performance, high density EMLC flash memory.

The Nimbus E-Class sets a new standard for solid state storage scalability and operating cost economics,” stated Benjamin S. Woo, program vice president, worldwide storage systems at IDC. “Large enterprises and cloud providers must consider the significant infrastructure consolidation possible with all-flash storage systems. By providing both innovative hardware and comprehensive software, Nimbus is well-positioned to not only capitalize on the need for high-performance systems but also the significantly greater trend towards primary storage based exclusively on solid state technology.”

Read the Full Story.

Download the MP3 * Subscribe on iTunes * If Dropbox is blocked, download from this Google page.

Also posted in HPC, Storage, Video | Leave a comment

Intel Brings Bigger Guns to AMD Server Chip War

By Timothy Prickett Morgan • Get more from this author

Analysis If you want to get into the server processor racket, here’s some advice: Don’t bring a knife to a gun fight. And when you whip out your guns, you better have a piece stashed in each of your boots, maybe another high-caliber rifle on your back, and a few knives while you are at it for price-cutting when the bullets run out.

With Intel getting ready to launch its “Sandy Bridge” Xeon E5 processors in March and revving up its 22 nanometer processes to eventually field “Ivy bridge” kickers, Advanced Micro Devices is going to have to engineer some pretty impressive new Opteron server chips. It’ll have to cook up those chips pretty sharpish, in conjunction with its wafer-baking partners, if it hopes to gain ground in the ongoing x86 server chip war – much less hold the hard-fought ground it has attained in high performance computing and server virtualization.

Everybody loves an underdog and most people like to see a bully take one on the chin and go down to his knees. So a lot of companies were rooting for AMD as it was designing the Opteron processors and trying to build an ecosystem of server vendors who would peddle machines based on them in the early and middle 2000s.

Back in the early 2000s, Intel was trying to protect its high-end 64-bit Itanium server business and push its Xeon processors down into the 32-bit volume server space, and AMD brilliantly shot the gap between the Xeon and Itanium to create the 64-bit Opterons, eventually pushing its server market share as high as 25 per cent.

But it has been a long time since x86 server chip juggernaut Intel was hammered – SledgeHammered, to be specific – by longtime rival AMD with its 64-bit, low-power, multicore Opteron processors. Intel shifted to the Core microarchitecture, added 64-bit memory addressing and processing, and a slew of key features such as the QuickPath Interconnect to its Xeon processors and hit back hard against the Opteron upstart. The “Nehalem” Xeon architecture announced in 2009 had everything that Opterons had, and when the Great Recession hit just in the wake of yet another Opteron delay, server makers put most of their effort into build Xeon war machines, not Opteron battlewagons, and AMD has been losing ground ever since.

Because server chip profits help pay the bills at Intel, AMD, IBM, Oracle, and Fujitsu, the loss of market share by AMD is one of the key reasons why CEO Dirk Meyer resigned in January 2011. In hindsight, we can also see that Meyer and the bulk of the management team that handles chip development and manufacturing have been replaced since new CEO Rory Read came aboard last July. AMD has a new CTO – Mark Papermaster, formerly of IBM, Apple, and Cisco Systems – and has replaced its former marketing, products, and operations bosses, and has tapped ex-Intel engineer Rajan Naik as senior vice president and chief strategy officer.

So, AMD is no doubt drawing up new war plans for the x86 server battlefield, but the company has not said much to date about its plans. Perhaps it will enlighten us during its Analyst Day this week. But we can conjecture about what AMD might do by looking at what Intel is about to do in the x86 racket.

A Sandy Bridge not too far

While Intel never publicly promised that the “Sandy Bridge-EP” Xeon E5 processors would launch last fall for shipments in the fourth quarter, the circumstantial evidence – and comments from motherboard and server makers like Super Micro – indicate that this was indeed the plan. But with AMD having its own issues shipping its “Interlagos” Opteron 6200 processors for two-socket and four-socket servers and its “Valencia” Opteron 4200s for single-socket and dual-socket machines, Intel did not have to rush to market. (The speculation is that a SAS controller bug similar to the one in the C200 chipset that delayed the launch of “Sandy Bridge-DT” E3 processors and various PC chips of similar design has been found in the “Patsburg” C600 chipset for the Xeon E5s. Intel has not confirmed this.) Frankly, with Intel turning in the best fourth quarter and fiscal year in its history, in terms of profits and revenues, as 2011 came to a close, despite a PC slowdown and whatever issues stalled the Xeon E5s, it is hard to argue that Intel made the wrong call.

Chip happens

Intel is just starting to talk to press and analysts under embargo this week about the forthcoming Xeon E5s, and it is no coincidence that it is doing so just ahead of AMD’s Analyst Day. (El Reg is reporting this to you from coach on a Delta flight to Portland, Oregon, ahead of a briefing by Intel from its Beaverton chip and server development labs.)

As El Reg exclusively disclosed last May, the plan with the Xeon E5s is to take what would have normally been a chip for general-purpose two-socket workhorses and bifurcate the line into multiple processor and chipset variants to address very precise market segments. This is, of course, what AMD did two years when it created two different two-socket server families: the Opteron 4100s – which could also scale down to single socket machines aimed at small, power-sensitive workloads – and the Opteron 6100s, which could scale up to four processor sockets.

Anything AMD can do, Intel can do. (The market decides if Intel can do it better, or at least well enough to allow IT managers to fall back on the “nobody ever got fired for buying Intel” insurance policy.)

Intel is actually cutting its server market into eight pieces with the Xeon E5 launch. That’s Itanium 9300s and Xeon 7500s and E7s at the high-end (and eventually the “Sandy Bridge-EX” E8s). That’s two segments of the market that share chipsets and memory cards, but that have different motherboards and sockets. At least until Intel finally delivers, as it is rumored to be in the works, the long-promised common Xeon-Itanium socket. That could happen with the E8s, but it is far more likely to happen with the “Ivy Bridge-EX” Xeon E9s years hence. At the low-end, there’s the single-socket Xeon E3 and Atom processors, depending on how wimpy or brawny your workload is. That’s four addressable server segments in total.

The Xeon E5s will also span four different server types and will cover the middle and overlap with the high and low ends. The Xeon E5-2600, as the first of the “Romley” server platforms are expected to be called, will use the “EP” variant of the Xeon E5 chip that plugs into the new “Socket R” CPU socket. This socket is not compatible with the current Xeon 5500 and 5600 processors, but has all sorts of goodies, including two QPI links between the processors, support for unregistered, registered, and load-reduced (LR) DDR3 main memory, and integrated PCI-Express 3.0 controllers on the processor. This is the chip that Intel has presumably been shipping under NDA to selected supercomputer and hyperscale data center customers since last fall. This chip is clearly aimed at two-socket Opteron 6200 machines.

For two-socket machines that don’t need all of these capabilities, Intel is expected to roll out its “Sandy Bridge-EN” chips, rumored to be called the Xeon E5-2400s. These chips will plug into the new “Socket B2″ socket and will sport only one QPI link between processors as well as fewer memory channels, fewer DIMMs per core, and fewer PCI-Express 3.0 slots. This chip is fired directly at two-socket Opteron 4200 iron.

If the rumors are right, then Intel will also ship a variant of the Sandy Bridge-EP chip that will be able to span four processor sockets in a single system image. This chip is expected to be called the Xeon E5-4600 and is obviously targeting the four-socket Opteron 6200.

And finally, Intel will field a Xeon E5-1600 chip, aimed at single-socket servers and workstations and based on the Sandy Bridge-EN chip that will zero in on single-socket Opteron 4200 servers and whatever plans AMD has to revive its single-socket server biz with the Opteron 3000 series, which it said it was working on back in November. The first Opteron 3000 chip, code-named “Zurich” and presumably to be named the Opteron 3200 to be consistent with the 2012 series of Opteron processors, is basically a cut-down Opteron 4200 with six or eight cores that will plug into an AM3+ socket instead of a C32 socket.

In any event, Intel appears to be looking to chase the microserver segment with the Xeon E5-1600 as AMD is looking to pursue with the Opteron 4200 and 3200 chips. The word on the street is that the Xeon E5-1600 will plug into the Socket R socket, but it would make more sense for it to use the lower-cost Socket B2 socket.

Should all of this come to pass in 2012, it is safe to say that Intel has a weapon to match everything that AMD can throw at it – and then some. AMD only has one flavor of four socket machine, and Intel has three if you count Itanium. AMD has only two kinds of single-socket boxes it can bring into the field, Intel has three if you count Atom. AMD has two two-socket boxes, but Intel has four if you count Itanium.

It’s Hammer time, again

It must have been such fun to run AMD when Intel’s server and PC chips were misaligned with the market needs. It must be daunting to come into work every day at AMD and see the lead in process technology, cash, clout, and chip and market coverage that Intel currently has not just over AMD, but over anyone who is making processors for anything larger than a smartphone or tablet.

AMD has been clever in a lot of ways to survive the Intel onslaught despite being behind in process technology. With the Opteron 4100s and 6100s, the company had to do its own full platforms – chipsets and processors – for the first time, which is a lot of change to manage all at once. Moreover, with the Opteron 6200s, AMD took its eight-way server architecture, beefed it up with more and faster HyperTransport links across the CPU sockets, and then double-stuffed six-core processors into a single socket and convinced the software vendors of the world that this was indeed a four-socket, rather than an eight-socket, machine. For systems and application software that is socket-based, this little maneuver cuts software feeds in half.

AMD has also been winning the core count skirmish against Intel and positioning its two-core “Bulldozer” module used in the Opteron 4200s and 6200s as two strong physical threads against Intel’s weaker HyperThreaded cores. However, with a shared scheduler, on workloads that make heavy use of 256-bit floating point instructions, half of the 16 cores in an Opteron 6200 will often sit idle and the net effect is that the performance should be about the same as the forthcoming Xeon E5 with eight cores running 256-bit floating point. AMD has two stronger cores, but only if you want to do 128-bit math or integer work.

So what is AMD to do?

Go back to the drawing board and exploit whatever weaknesses it can find in Intel’s armor, just as always. Or, start a fight on a new battlefield where Intel is not going to be so strong.

Back in November 2010, two months before the management shakeup at AMD, the company said that its plan for this year was to bring out replacements for the C32 socket used for Opteron 4100 and 4200 processors and the G34 socket used with Opteron 6100 and 6200 processors.

The plan calls for the high-end Opterons, code-named “Terramar” and presumably called the Opteron 6300, to have 20 Bulldozer cores based on a next-generation core, code-named “Piledriver”. The low-end will get the “Sepang” Opteron 4300, a ten-core chip that is essentially what gets double-stuffed into a socket to make the Terramar chip package. Rumor has it that AMD will boost memory capacity with these forthcoming Opterons as well as support PCI-Express 3.0 peripherals. The Terrarmar and Sepang chips will be etched in the 32 nanometer processes used by GlobalFoundries, AMD’s spun out former chip manufacturing operations.

Presumably there is a process shrink to 28 nanometers to boost clock speed and therefore single-threaded application performance of these Opteron 4300 and 6300 chips in the works, but AMD has not said yet and will no doubt lay out its plans at Analyst Day this week.

As was the case during the Great Recession, now would be a particularly bad time for AMD to force a socket transition onto its smaller band of server customers, and the new management at AMD must be looking pretty hard at that roadmap, wondering if they can change as little as possible now to buy time to do a lot more radical engineering for the future.

If I were running AMD, I would be looking very hard at that “Bobcat” core that is the alternative to Intel’s Atom and start thinking about servers, and also go back and look at the“Trinity” low-power Fusion chip, which is based on the Bulldozer cores.

When AMD was kicking Intel in the chips in the mid-2000s, Chipzilla relatively quickly (okay, it took years) shifted over to the Core laptop chip architecture for its PCs and servers and not only saved its chip business, but blunted the AMD attack. Intel has copied most of the ideas that made the Opteron better or different and is now using its wafer-baking process technology and its ability to set market prices to force AMD to compete mostly on lower price for roughly equivalent performance and features.

This is not an enviable position to be in for AMD, obviously. But there’s always the ARM option, and AMD could do something radical like buy Applied Micro or Calxeda and turn the x86 chip war into a two-front war for Intel to have to fight. ®

This article originally appeared in The Register. It appears here in its entirety as part of a cross-publishing agreement.

Also posted in Compute, HPC | Leave a comment

New Whitepaper: Boost RAM Bandwidth by 20% with a Single Command

Colfax International has published a new whitepaper by Stanford’s Andrey Vladimirov entitled: Terabyte RAM Servers: Memory Bandwidth Benchmark and How to Boost RAM Bandwidth by 20% with a Single Command.

Colfax International produces servers capable of supporting up to 1 TB of RAM and up to 4 Intel Xeon CPUs. This paper reports the memory bandwidth benchmark of these servers obtained using the STREAM code. Our benchmark includes comprehensive statistical data: the mean, standard deviation, extrema and the distribution of bandwidth measurements. The distribution of measurements reveals several modes of RAM performance, including an above-average bandwidth mode. By default, the mode realized by any given benchmark depends on an unpredictable runtime pattern of thread and memory binding to the physical cores. The paper shows how to optimize memory traffic for bandwidth and consistently achieve the fastest mode. This is done by controlling the code’s thread affinity, and results in a bandwidth increase around 20% over the average unoptimized performance.

Download the whitepaper (PDF).

Also posted in Compute, Computing Research, HPC | Leave a comment

Interview: Nvidia Updates Cuda Platform to 4.1

This week Nvidia announced the latest update to their Cuda platform for parallel computing. To learn more, I caught up with Will Ramey, Nvidia’s Sr. Product Manager for GPU Computing.

insideHPC: When we talk about a new Cuda platform, are we talking about the Cuda Toolkit plus SDK? Does this new update have a version number?

Will Ramey: Yes, this release is a new version of the CUDA Toolkit and SDK code samples, as well as updated drivers.  The version number for this release is 4.1

insideHPC: Specifically, what components comprise the platform?

Will Ramey: There are 3 key components to this release (version 4.1):

  1. The CUDA Toolkit is a comprehensive development environment for C and C++ developers building GPU-accelerated applications.  Version 4.1 of CUDA Toolkit includes a compiler for NVIDIA GPUs, math libraries, and tools for debugging and optimizing application performance.  You’ll also find programming guides, user manuals, API reference, and other documentation to help programmers add GPU acceleration to their applications quickly.  More info at: http://developer.nvidia.com/cuda-toolkit
  2. The CUDA Driver provides a system-level interface for CUDA applications to communicate with the GPUs, and is included in the NVIDIA drivers installer.
  3. NVIDIA also provides an SDK with over 100 GPU Computing SDK code samples, as well as white papers to help developers quickly add GPU acceleration to their applications.  More info at: http://developer.nvidia.com/gpu-computing-sdk

Developers need to install the CUDA Toolkit to build CUDA applications, and the latest NVIDIA drivers so their applications can communicate with the GPUs in their system.  Developers can also choose to install the SDK code samples to learn from the large collection of examples.

To run CUDA applications, end-users only need to install the latest NVIDIA drivers.

insideHPC: What is new within the updated platform?

Will Ramey: In addition to the new LLVM-based compiler that delivers up to 10 percent faster performance, there are a number of significant new features in this release:

  • New & Improved “drop-in” acceleration with GPU-Accelerated Libraries
    • Over 1000 new image processing functions in the NPP library
    • New cuSPARSE tri-diagonal solver up to 10x faster than MKL on a 6 core CPU
    • New support in cuRAND for MRG32k3a and Mersenne Twister (MTGP11213) RNG algorithms
    • Bessel functions now supported in the CUDA standard Math library
    • Up to 2x faster sparse matrix vector multiply using ELL hybrid format
  • Enhanced & Redesigned Developer Tools
    • Redesigned Visual Profiler with automated performance analysis and expert guidance system
    • CUDA_GDB support for multi-context debugging and assert() in device code
    • CUDA-MEMCHECK now detects out of bounds access for memory allocated in device code
    • Parallel Nsight 2.1 CUDA warp watch visualizes variables and expressions across an entire CUDA warp
    • Parallel Nsight 2.1 CUDA profiler now analyzes kernel memory activities, execution stalls and instruction throughput

  • Advanced Programming Features
    • Access to 3D surfaces and cube maps from device code
    • Enhanced no-copy pinning of system memory, cudaHostRegister() alignment and size restrictions removed
    • Peer-to-peer communication between processes
    • Support for resetting a GPU without rebooting the system in nvidia-smi
  • New & Improved SDK Code Samples
    • simpleP2P sample now supports peer-to-peer communication with any Fermi GPU
    • New grabcutNPP sample demonstrates interactive foreground extraction using iterated graph cuts
    • New samples showing how to implement the Horn-Schunck Method for optical flow, perform volume filtering, and read cube map texture

insideHPC: How do the new components ease code development?

Will Ramey: The new LLVM-based compiler compiles code faster than the old compiler, increasing developer productivity.  As you might expect, the compile-time saved varies by application, but we’ve seen some large applications compile more than 60 minutes faster than with the old compiler.

The NVIDIA Visual Profiler has been completely re-designed to streamline developers’ performance analysis workflow.  The new automated performance analysis feature quickly identifies bottlenecks and opportunities to improve application performance, and is integrated with the “Best Practices” documentation guiding developers through the process of optimizing their applications.  Developers can now achieve the full potential of GPU acceleration in their application with significantly less effort.

The new image & signal processing functions in NPP makes it easier for more developers to accelerate more of their algorithms on the GPU.

The new tri-diagonal solver in cuSPARSE allows developers to just call the pre-optimized version in the library instead of having to write their own.

insideHPC: How do the new components help speed developer code?

Will Ramey: The new LLVM-based compiler includes several new optimization techniques that allow the compiler to generate more efficient code.  This is another case where the performance improvement will vary depending on the application, but we’re seeing up to 10 percent performance improvement across a variety of applications.

Using the new RNGs in cuRAND, image & signal processing functions in NPP, tri-diagonal solver in cuSPARSE, etc. all help developers quickly take advantage of pre-optimized routines that take full advantage of hundreds of cores on the GPU.

insideHPC: If I had the most current version of Cuda yesterday, what’s new that I can download today?

Will Ramey: Today you can download the new CUDA Toolkit, SDK code samples, and drivers.  Available for Linux, MacOS and Windows.

 

Also posted in GPUs, HPC, HPC Software, Tools | Leave a comment

Agenda Published for Israel Supercomputing Conference

The HPC Advisory Council has published the agenda for the Israel Supercomputing Conference coming up on February 7 in Tel Aviv. Featuring speakers from AMD, Intel, IBM, NetOptics, Mellanox, ORNL, ScaleMP, and Tel Aviv University, the one-day event will cover advanced HPC topics from around the world.

I’m looking forward to attending this event. The HPC Advisory Council recently reached a milestone of over 300 institutional members.

Thanks to its long history of high-tech breakthroughs by innovative small companies, Israel is often referred to as “Startup Nation.” What better place could there be to launch the 2012 HPC Event Season?

Also posted in Events, HPC, HPC Advisory Council Workshop, Network | Leave a comment

Advertisement

National HPCC Conference Advertisement

View All Videos

insideHPC.com is a production of insideHPC, LLC. © 2006-2011 Sitemap