Entries filed under “HPCAnswers”

Archived postings from Chis Aycock’s blog at HPCAnswers.com. Preserved here for posterity.

How are FPGAs programmed?

As mentioned previously, the greatest hurdle to FPGA adoption is the developer’s perception of usability. Usually a computer engineer must design the hardware via a description language such as Verilog or VHDL. This process involves defining the transfer of data between registers, which is a distinct departure in the practices most software engineers use. A newer approach is to model the behaviour of the entire system via SystemC, which permits a higher level of abstraction. The C-based approach may make FPGAs more accessible by users.

An Oxford spinout known as Celoxica has a design suite called DK, which permits the user to write software in Handle-C. This language is a subset of C with extensions to describe parallelism. DK can generate VHDL or Verilog from the user’s Handle-C code, thereby making FPGAs just as useable as CPUs. These kinds of tools may come to be essential for further FPGA adoption.

Comments Off

Are GPUs the next wave in HPC?

AMD’s recent purchase of ATI was accompanied by an announcement that AMD will introduce “Fusion,” a combination CPU and GPU intended for general-purpose computing. This is on the heals of the work from PeakStream and RapidMind in the arena of stream programming, which attempts make software development on GPUs easier for non-graphics applications. It certainly appears that GPUs are leading the wave in vector processing, which is of course complimentary to multi-core architectures.

For a while it looked like FPGAs would be the major source of hardware acceleration. Indeed, AMD’s Torrenza initiative is very attractive to vendors like DRC, whose solutions permit Xilinix chips to communicate with the CPU via HyperTransport. However, programming here is different because designers must use a hardware-description language rather than merely port their existing applications. Such a constraint will put off a large number of potential customers. I believe that FPGAs will be relegated to hardware creators who want to test their designs; I do not see FPGAs as the future of HPC.

The Cell BE is another competitor in the accelerator space. IBM is using its own chip as a co-processor for the Roadrunner computer. Mercury, which makes a Cell-based 1U, has the MultiCore Plus SDK for programming these processors. I believe the Cell’s adoption in non-IBM systems (aside from gaming, etc) will be about as wide-spread as Itanium’s.

The only competitor left in this space is ClearSpeed. The Advance requires significantly less energy that vanilla GPUs and is a true vector processor (unlike SSE, etc). My only reservation here is that Advance is being produced by a start-up.

As much as I’d prefer to see someone like ClearSpeed succeed over GPU-based general-purpose computing, I’ve seen enough in this industry to understand commoditization, volume, and market penetration. I believe a more likely scenario is that CPU + GPU will indeed become standard in blade-based clusters aimed at technical computing applications. Perhaps Advance will have its own niche, but even Quadrics and Myricom are introducing Ethernet-based high-performance networks as part of their survival strategy. Maybe Advance can target Torrenza and Geneseo.

Comments Off

What is Marlet?

When computing with really large data sets, such as in the earth or life sciences, it is usually easier to pass the function rather than the data. Marlet is a work-flow language for distributed data analysis; it is based on the principles of functional programming and allows the user to operate while abstracting the underlying system. The user provides abstract functions that are converted to concrete functions at runtime when concrete data is available.

While similar in spirit to Google’s MapReduce, Martlet is more general in that it does not require a specific programming methodology. And despite the fact that the research into Martlet was originally geared towards grid computing, it is feasible that it could be applied to other interests in web services or possibly even large corporate data centers.

As a personal note, Martlet was created by Daniel Goodman, my old officemate at Oxford. He explains that the name for the project comes from the type of bird featured on the crest at Worcester College.

Comments Off

What is Geneseo?

Geneseo is the codename for an enhancement to PCI Express. Lead by Intel and IBM, this technology will make it easier for co-processor vendors to produce accelerator cards. Among other improvements, Geneseo includes new semantics that reduce the overhead of signaling and synchronization, which makes more efficient interaction between the application software and the accelerator chip.

Intel’s work on Geneseo appears to take aim at blunting AMD’s work on Torrenza, which allows accelerators to plug into AMD processor sockets and communicate with the CPU via HyperTransport. The concept behind Torrenza is already being used by DRC.

It is important to note that Geneseo and Torrenza represent two completely different means of aiding co-processors. The former is an enhancement to the bus architecture, whereas the latter leverages the direct connect architecture. AMD of course still has the HTX expansion slot, but this has not been as widely used as PCI Express. And Intel still has its work on CSI, which should debut any decade now.

Comments Off

What is stream programming?

A stream is an array whose elements can be operated on in parallel, similar to SIMD computing. In stream programming, data is gathered from memory into a stream, operated on in the stream, and then scattered from the stream back into memory. Memory latency is thus minimized by accessing the data in chunks, similar in effect to caching.

This style of computing was popularized by Stanford’s Merrimac project, which featured the Brook programming language (an extension to C, actually). At least two start-ups have come from this work: Stream Processors, which will develop signal processors, and PeakStream, which has just released a software engineering tool known as “Platform” intended to simplify development on co-processors.

PeakStream’s Platform is a combination virtual machine (no kernel modification) and library (a standard C++ API). The VM includes a scheduler that directs operations to the best runtime system, whether it be a CPU, a GPU, or even the Cell processor. The library should be easy for anyone familiar with products like MATLAB. Ultimately the goal with Platform is that technical computing customers will obtain much better performance in so-called “heterogeneous” systems.

In a way, the combination of PeakStream + GPU resembles ClearSpeed’s own approach, though Advance uses standard BLAS rather than a proprietary library. It is interesting to note that programming solutions for both co-processors and multicore CPUs have appeared recently. Intel is now offering their Threading Building Blocks library and Mercury has released their MultiCore Plus SDK. The only thing left is better vendor-supported tools for distributed-memory programming.

Update: To be fair to competitors, RapidMind is a commercial distribution of Sh. RapidMind / Sh is similar to PeakStream / Brook as both use stream programming to target CPU, GPU, and Cell. Pick whichever you prefer.

Comments Off

What are grand challenge problems?

Grand challenge problems refer to really difficult tasks that stretch the limits of cognitive ability. They are especially popular in high technology fields as working towards a solution often yields many new applications. Below are a list of some open problems, along with correlations to previously solved grand challenges.

Prove NP ≠ P

This is the holy grail of anyone working in theoretical computer science. Essentially the class P (set of all languages decidable by a deterministic Turing machine in polynomial time) is known to be a subclass of NP (set of all languages decidable by a nondeterministic Turing machine in polynomial time). But is it a proper subclass? A language L is said to be C-complete iff all languages within C reduce to L. This implies that if an NP-complete language were deterministically decidable in polynomial time, then all NP languages would be deterministically decidable in polynomial time. For example, the Traveling Salesman Problem would no longer require an exponential order of magnitude to compute.

It is generally believed today that no such NP-complete language exists. However, no one has actually proven this. This problem has gained so much attention that it is actually sponsored by the Clay Institute of Mathematics. The first person to solve this will win $1M, and probably get the Turing Award and Fields Medal as well.

While it does seem like an arduous task, other similarly difficult problems have been solved before. Perhaps the most famous open problem in mathematics was Fermat’s Last Theorem, which was eventually proved by Andrew Wiles. It only took 350 years.

Pass the Turing Test

The Turing test is that a computer’s responses in a text-based communications forum are indistinguishable from a human being’s. This is perhaps the holy grail of artificial intelligence researchers. The annual Loebner Prize recognizes the best advance towards this goal, though no one has won outright. This year’s award will in fact be announced the day of this blog posting.

This problem seems analogous to that of building a champion chess machine. This feat was accomplished by IBM through Deep Blue, which defeated world champion Garry Kasparov.

Put a Man on Mars

“They can put a man on the moon, but you still gotta eat your vegetables.” Sayings like this epitomize the sheer capability of the human race; we actually put a few members of our species on a different celestial body. Way to go NASA! The modern equivalent of this problem is to put a man on Mars.

Map the Human Proteome

With the complete map of the human genome, many began to wonder if it was possible to map the human proteome. There are many more proteins than protein-defining genes, so this is a much more difficult task.

Comments Off

What is Roadrunner?

Not to be confused with Time Warner’s broadband Internet access provider, Roadrunner will be the first supercomputer to use the Cell BE (see earlier post on the PS3 as a cluster node). The unique architecture calls for inclusion of Opteron processors as well, which means that the system will be able to run a broad range of applications. The Cell will be used in much the same way as a co-processor, thus rivaling competing vendors such as ClearSpeed and DRC Computer. The machine will be installed at Los Alamos in 2008.

Comments Off

What is software pipelining?

Most computer architectures today are able to execute a few instructions in parallel on a single core by having multiple control and arithmetic-logic units. The order of execution may be scheduled by the compiler (in a VLIW processor) or by the hardware itself (in a superscalar processor). Of course not all instructions can be executed in parallel simply because of dependencies.

Software pipelining, which really has nothing to do with hardware pipelining, is a loop optimization technique to make statements within an iteration independent of each other. The goal is to remove dependencies so that seemingly sequential instructions may be executed in parallel. Consider a vector computation:

z = 2x2 + y

Expressed in C, this action would be:

for ( i = 0; i < N; i++ ) {   xsquared = x[i] * x[i];   twoxsquared = xsquared << 1;   z[i] = twoxsquared + y[i]; }

The statements within the iteration must be executed sequentially because of the order of operation in arithmetic. However, each computation of z[i] is independent of the others. Therefore, we may structure the loop such that partial computations of various z values are performed sequentially. This will then allow the compiler or CPU to partially parallelize the resulting instructions.

We begin by unrolling the loop and arranging the iterations into a pipeline:

xs = x[i] * x[i];
txs = xs < < 1; xs = x[i+1] * x[i+1];
z[i] = txs + y[i]; txs = xs < < 1; xs = x[i+2] * x[i+2];
z[i+1] = txs + y[i+1]; txs = xs < < 1;
z[i+2] = txs + y[i+2];

Notice on the third row that we essentially have the original loop statements, but in reverse. This is where the instruction-level parallelism comes into play; these statements are independent of each other. This is our new loop. With the addition of some starter and finishing code, the end result is as follows:

xs = x[0] * x[0]; txs = xs < < 1; xs = x[1] * x[1];  for ( i = 0; i < N-2; i++ ) {   z[i] = txs + y[i];   txs = xs << 1;   xs = x[i+2] * x[i+2]; }  z[N-2] = txs + y[N-2]; txs = xs << 1; z[N-1] = txs + y[N-1];

These are not entirely independent because xs and txs are shared among iterations. However, this is better than before because the index computations may be manipulated in parallel. Furthermore, it allows for a sort of pre-fetching against array elements.

There is another caveat regarding this loop in that the array must have at least two elements. Should there be any possibilities to the contrary, the user must code a special case.

Software pipelining can really mangle code, as this example shows. Thus, for maintainability, such a tool should be one of the last optimizations a programmer considers. But for cases when the processor absolutely must be saturated on every core, removing intra-loop dependencies in this manner is one of the most effective techniques available.


Leave a comment

How do we improve locality of reference?

Parallel execution is the most common technique for speeding up large applications, but it alone is not sufficient for all cases. In particular, Amdahl’s Law states that the inherently sequential portion of a program is the limiting factor of speed improvements.

An additional tool for performance enhancement is to reduce the effective latency for fetching data. For example, the memory system could be made entirely from transistors (SRAM) rather than a combination of transistors and capacitors (DRAM). Such an architecture would be prohibitively expensive for most users, so many designs today call for a memory hierarchy featuring some SRAM and a lot of DRAM. This of course exploits the principle of locality in which data tends to be organized in arrays or accessed in loops.

Therefore, it is beneficial for a program to organize data to be accessed repeatedly and sequentially. Consider the follow matrix-matrix multiplication:

for ( i = 0; i < N; i++ )   for ( j = 0; j < N; j++ )     for ( k = 0; k < N; k++ )       C[i][j] += A[i][k] * B[k][j];

Languages like C have row-major ordering, in which the left-most index is interpreted first. Therefore, to improve locality of reference, the program should fix the row and vary the column. The variable i is used twice as a row index, k once as a row index, and j never as a row index. Therefore, the loop statement that iterates with i should be the outermost, the statement with k next, and j last. Thus, the above code should be transformed as follows:

for ( i = 0; i < N; i++ )   for ( k = 0; k < N; k++ )     for ( j = 0; j < N; j++ )       C[i][j] += A[i][k] * B[k][j];

All we've done here is swap the order of two loop statements. This operation is legal as addition is both commutative and associative. That is, the definition of the program is not changed when these two loop statements are swapped.

The performance gains from such a simple transformation can be tremendous as we have reduced the number of cache misses. The real beauty of this transformation, though, is that it requires no special hardware beyond what is already present on most machines. That is, this optimization can be performed on just about any platform.

As a general rule of thumb, locality is improved when the loop statements are structured such that immediate data accesses are all present within a cache block. Happy transformations!


Comments Off

Is RDMA really that bad?

The paper “A Critique of RDMA” presents a good argument against remote direct memory access in high-performance networks. An overview of the technology is present in an earlier post on the Virtual Interface Architecture (VIA).

The VIA model requires an enormous amount of overhead for the initial communication because the software layer must pin the pages and swap the permission keys. This may be permissible in applications that match with remote memory access (such as partitioned global address space languages) because of various caching techniques. Other applications, like remote procedure calls, do not map well to remote memory access and thus do not permit caching.

It should be noted though that QsNet does not have pinning as a kernel patch performs DMA transfers once the page has loaded into memory on its own. This method is not particularly portable as each version of the kernel must be modified slightly differently, which makes software maintenance difficult for IT managers. Also of note is that Blue Gene does not have virtual memory, and therefore ensures that pages are always in memory. Both IBM’s and Quadrics’s approaches are not commodity, however, and are therefore expensive.

Given that most software in networked systems will require explicit notification of message receipt, the remote memory access model is not particularly adequate, and thus the RDMA constructs are not particularly useful. (As a side note, partitioned global address space languages are best targeted for ccNUMA machines, though these are even more expensive.) The aforementioned paper basically states that network vendors would be better off pushing constructs for programmed IO, such as message matching on the NIC. This indeed appears to be the best approach of all as it maps well to Sockets.

Comments Off

What are some open-source tools for technical computing?

Because of MATLAB‘s sheer popularity, most open source tools mimic MathWorks’s proprietary language. Octave, which started as a teaching tool in a similar vein as MATLAB, is probably the most widely used. Its language is very compatible with MATLAB and the command-line front end can be customized to mimic MATLAB’s editing window.

Likewise, Scilab is another open source tool, though the language—while perhaps “inspired”—is different from MATLAB’s. However, Scilab does include Scicos, which is used for simulation and modeling. Scicos is unfortunately incompatible with Simulink.

As for statistics, many users prefer a dedicated package like S-PLUS to MATLAB’s matrix-oriented computation. S-PLUS itself is a commercial implementation of the S language. Similar to S is the R language and its open-source implementation.

In terms of symbolic computation like Maple and Mathematica, there are Axiom and Maxima. These packages are useful for solving integrals and the sort.

Thus there are numerous open-source, as well as commercial, packages available for technical computing. As is usually the case, the open-source programs tend to be slower and have fewer features, whereas the commercial applications tend to be fairly expensive.

Comments Off

What is Cray Hood?

“Hood” is Cray’s next-generation supercomputer, and is based on the XT3 architecture. It uses Opterons and an improved SeaStar network (torus topology). Details are pretty light at the moment, but it will be installed next year.

Comments Off

What is OpenFabrics?

OpenFabrics is the new name for OpenIB. The title change reflects the shift in supporting only InfiniBand to supporting both InfiniBand and iWARP. The organization has an “enterprise distribution” of a software stack for these networks. If nothing else, this will give iWARP a leg-up in its competition against Myri-10G.

Comments Off

Why did Microsoft release C#?

It may be tempting to write off the .NET strategy as merely a ploy to increase dependency on Microsoft’s platform, and indeed that may be a correct assessment. Given that Microsoft holds such dominance on the maturing desktop market, the folks in Redmond must be aiming to expand in the lucrative server sector. It appears that C# is a way to have more developers build software for Windows, and thus generate greater customer demand. In a way, this mirrors MS’s push into technical computing. The Linux / Unix community should take this as an incentive to create better programming tools.

Comments Off

Is government intervention necessary for broader HPC adoption?

The largest computing systems have historically required government support, as the market for these machines is too small to interest most vendors. Even today, the US government purchases the biggest machines itself and pushes select corporations through DARPA’s High Productivity Computer Systems (HPCS) initiative. But such intervention does not guarantee economic sustainability.

Indeed, commodization and ease-of-use has given greater access to technology than any government program. It seems more appropriate to supplement a low-cost baseline platform with co-processors and special software. For example, MATLAB and blade-based “personal supercomputers” can be augmented with Star-P and ClearSpeed. So government intervention here seems rather unnecessary.

As a side note, it appears that much of the worry on competitiveness and innovation stems from xenophobia. The Americans were spurred into action more by the success of the Japanese than by a genuine desire to do science. RIKEN’s MDGRAPE-3 may well be the world’s first petaflop computer. Congratulations are in order.

Comments Off

Video Archive

insideHPC.com is a production of insideHPC, LLC. © 2006-2013 Sitemap