Entries filed under “HPCAnswers”

Archived postings from Chis Aycock’s blog at HPCAnswers.com. Preserved here for posterity.

New Whitepaper: Introduction to Intel OpenCL Tools

 

A new whitepaper by Intel’s Vinay Awasthi describes the status of the company’s OpenCL implementation and available tools for developers using the Intel OpenCL SDK.

The Intel implementation is the only implementation at the moment that implements out of order queues. Intel’s implementation also allows multiple work-items per workgroup for CPUs. There is also preview support for device fission extension (not fully validated). We will cover the benefit of such options later in this whitepaper. With this implementation, you will also receive OpenCL offline compiler. This compiler will let you observe assembly instructions and intermediate representation (IR) of your OpenCL kernels instantly without having to plug them into a program or using any APIs to get IR. Developers can use this tool to also compile kernels for correctness.

Read the Full Story.

Also posted in HPC | Leave a comment

HPC enables discovery of new blood pressure drug

Also at the new HPCwire today, news that U of Florida researchers used HPC to find a new drug that lowers blood pressure and prevents heart and kidney damage, at least in rats. More research in humans coming.

Researchers used one of the world’s most powerful supercomputers to process 140,000 prospective drug compounds in a matter of weeks. The computer predicted which molecules would be most likely to enhance the activity of ACE2, rotating them in thousands of different orientations to see how they would bind to certain pockets on the enzyme’s surface.

…After hitting on the “lead” compound, UF researchers then tested it in hypertensive rats that had developed fibrosis of the heart and kidney. The animals received the drug for two weeks. Tissue samples from treated animals revealed a significant decrease in fibrosis of the heart, kidney and blood vessels, said Ostrov, who described the findings as “striking and reproducible.”

And the good news on this one keeps coming, apparently

…Early results also show the compound inhibits inflammation, which has significant implications for a number of human diseases, including autoimmune diseases such as type 1 diabetes and rheumatoid arthritis as well as other diseases involving fibrosis, such as Alzheimer’s, Ostrov said.

As the old SNL sketch goes, “it’s a non-dairy whipped topping AND a floor wax.” Anyway, good stuff.

Also posted in HPC | 1 Comment

InsideTrack: Letter from SGI to LNXI customers

The letter linked below was sent from SGI to Linux Networx customers, and the InsideTrack secured a copy through its vast network of industry insiders.

The letter, from SGI Global Services VP Bob Pette, outlines what LNXI customers can expect. A few highlights of interest.

First, SGI’s take on what happened

Today SGI announced the purchase of certain assets of Linux Networx, Inc.. In conjunction with this transaction SGI has offered employment to a number of Linux Networx employees in the Engineering, Sales and Services areas and acquired LNXI’s spare parts inventory.

And then what it means for LNXI customers (note the honesty):

SGI did not acquire Linux Networx’s service contracts and as such, does not have a service contract in place with you. …During this period of transition, please continue to place your LNXI service requests as before by calling 1-800-459-7138 or online at http://support.linuxnetworx.com.

Very refreshing. Here is the whole letter as a PDF.

Also posted in InsideTrack | 1 Comment

Article: File Systems for HPC Clusters

Jeffrey B. Layton has written yet another interesting article for Linux Magazine.  This time, he’s put together an overview for those interested in implementing a parallel file system on a cluster.

If you have an interest in pulling together your own cluster, or maybe you just want to understand more about cluster technology, it’s necessary to grok the differences between clusters and standard systems.

Read the full article here.  [free registration required]

Also posted in Enterprise HPC, HPC | Leave a comment

Will SaaS work in HPC?

Software-as-a-Service is useful for products that require network connectivity, such as email and instant messaging; just witness the popularity of Gmail and Meebo. Among enterprise customers, ERP/CRM applications hold some degree of promise, like Salesforce.com. So then the question is whether technical computing customers could use SaaS.

The best example I can think of is in offloading hefty workloads to managed servers. This isn’t grid computing as traditionally known, but rather an application that can be called on-demand to perform a very specific task. I believe that in the future, it may be possible to farm heavy number-crunching from Excel or MATLAB to another company’s server on the fly. I’m just waiting for Microsoft to introduce “Excel Services Live.” You heard it here first.

Comments Off

What is the benefit of domain-specific languages?

For starters, domain-specific languages make users more productive than general-purpose languages and give them more flexibility than a simple GUI. Consider what SQL gives to database managers, or Excel to finance professionals. And while C++ has features like polymorphism and operator overloading that allow for “syntactic sugar” in mathematics libraries, most engineers will prefer MATLAB because, if for no other reason, it’s interactive.

But these languages have an added bonus that the HPC community should now take seriously: because they are limited, domain-specific language are easier to optimize. While ACCELLERANT tries to parallelize any and all code, Star-P sticks to just matrix operations. After all, why should a non-programmer bother with stream computing, electronic systems-level design, and partitioned global address spaces when all he wants is to crunch numbers faster?

Before anyone decries dynamically typed languages for their perceived low performance, just keep in mind that many popular websites (massively distributed computing infrastructures) are actually programmed in ASP or PHP. Given proper optimization, a domain-specific language can be fast for both the computer and the user. So here’s to new languages for professionals.

Comments Off

What is the best way to keep up on HPC news?

Staying informed in this market can be difficult given our niche position. However, there are a few sources that anyone in this field should most definitely be familiar with.

First and foremost are the conferences, namely Supercomputing (SC). Held annually in the US, this monster get-together showcases all of the latest in research and development, plus offers a number of tutorials for emerging technology. A week here is equivalent to a semester in grad school. A distant second in this category is the International Supercomputer Conference (ISC) held annually in Germany.

Among online sources, the best for original articles is HPCwire, whom I’ve written for. As for news snippets, John E. West’s InsideHPC is a daily source. Coincidentally, John is also a regular contributor to HPCwire.

For the broader technology market, there are always Slashdot, Dzone, and The Register. These occasionally have articles that may be of interest to HPC practitioners.

Those are the major news and information sources. As mentioned before, a surprisingly bad source is Wikipedia. I had thought about the effort to create a “wikiHPC” to act as an online Hennessy and Patterson, but then I realized that we already have Wikipedia and so could probably just add to that. Grad students should feel free to copy and paste the factual background material of their thesis.

Comments Off

What is Duff’s Device?

Duff’s Device is a loop-optimization technique for C code that relies on macros to unroll a repetitive task. The primary benefit of loop unrolling is reduce branching, which is one of the single most expensive operations in computing. While some branching is necessary for the cache, too much branching will actually break the memory hierarchy, in addition to the pipeline. Programmers who require extreme performance would do well to learn a number of best-practice loop optimizations. Duff’s Device is one of them.

Comments Off

What is Parallel Knoppix?

Have you ever been in a position where you needed to run an MPI application a few times, but not enough times to justify buying your own cluster? Do you have access to a few PCs, but can’t or don’t want to install any software such as Condor on them? Then maybe you could use Parallel Knoppix.

Parallel Knoppix is a bootable CD for running MPI applications on a network of workstations. It’s a Linux distribution that executes the common steps for determining hardware and configuring devices. As of this writing, there is no 64-bit version of it, though that may change in the future. The disc image can be downloaded from the project’s website, or may be purchased from LinuxCD.org.

Comments Off

What is Terracotta?

Terracotta is an open source distributed shared object facility for Java, which allows multithreaded applications to run on clusters with minimal changes. It works with existing application servers and other web platforms, which makes distributing application loads across multiple nodes (JVMs) straightforward. It performs thread synchronization and even thread migration transparently for the user.

In addition to the runtime facilities, Terracotta provides a declarative approach to clustered software. That is, the programmer merely annotates which data members are shared. Likewise, the user may specify which methods contain critical sections, thereby creating a monitor.

The system architecture relies on a central server that stores the state of shared objects. Client nodes (JVMs) receive updates for objects currently in memory; thus, any data transfers occur only at the object level. For fault tolerance, the server itself may be clustered with one live and others in standby.

The company behind Terracotta has an open source business model that sells support contracts for enterprise customers.

Comments Off

What is CPUShare?

CPUShare is a grid computing initiative that pays its participants for providing idle processing time. Unlike BOINC, the provider is selling his time rather than donating it. While there is no word on the actual revenue a seller could reasonably expect to earn, anyone considering this program should consider the cost of electricity for running the software before picturing profits.

It seems like this system is more aimed for buyers in that they can order CPU time without paying for a cluster. However, the buyer must port his code to CPUShare’s platform. Given the time and money required to use this system, a user may be better served by purchasing an accelerator and porting his software to that, especially since grid computing only works in scenarios where there is lots of computation and little need for synchronizing communication.

As a word of advice for sellers who are contemplating any shared computing program, please anticipate the wear-and-tear that can occur against the disk drive. One work around for this is the create a RAM disk.

Comments Off

When will Ethernet be able to compete directly with InfiniBand’s latency?

I received this question in reference to an article from a few months ago. My paper was about functionality instead of mere performance, though my comments regarding RDMA-based overhead should hint at how poor InfiniBand is for some applications. Many of the benchmarks out there assume that the memory region is being reused and that the protection tags can be cached, which isn’t the case when there are numerous communication partners in the system.

As for 10 Gig E, vendors typically offload TCP onto the card, which takes care of most issues when communicating over the Internet Protocol. The real question is whether 10 Gig E can match InfiniBand for IP-based communication. I believe it already can.

It is certainly possible to tweak an IB app to run faster by using uDAPL in place of Sockets, provided there are few communication partners. Oracle RAC does this by restricting communication to selected pre-determined pairs; that is, there is no free-for-all that one typically finds in open client / server architectures.

Most customers would be served equally well with Ethernet. The reason I’m pushing that network is that it is much more commodity than InfiniBand. And indeed, we now see that vendors are pushing a hybrid solution, such as iWARP, Myri-10G, and QsTenG. That is, vendors with experience in high-performance computing are building on Ethernet and pushing it for enterprise markets, in addition to their traditional technical markets. The overall goal isn’t performance (though they certainly are achieving that) but rather price.

Comments Off

What is the difference between AMD’s Stream Processor and NVIDIA’s GeForce 8800? (Or, is Cray’s strategy the right one after all?)

AMD has announced a Stream Processor that comes from its recent acquisition of ATI. The processor is currently available on a PCI Express board and is provided with one gigabyte of dedicated memory. It also comes with the Close to Metal (CTM) interface for software developers. CTM is the target of stream programming platforms such as PeakStream and RapidMind, though its open nature allows it be targeted by in-house developers.

The Stream Processor is different from the CUDA technology in the GeForce 8800 in that the latter has cooperating cores and can therefore run multithreaded applications without stream programming. That is, AMD’s approach is a vector processor—SIMD—whereas NVIDIA’s approach is a multithreaded processor—MIMD. (To be precise, a stream processor applies a “kernel” of related instructions stored in a cache, whereas a vector processor applies a single instruction stored in a register; for our discussion, the difference is minimal.) This SIMD vs. MIMD divide also appears when comparing ClearSpeed and the Cell BE.

It is interesting to note that the offer of vector processors and multithreaded processors matches Cray’s adaptive supercomputing strategy. (Cray also offers FPGAs, which have been the focus of Celoxica and DRC.) And the CPU behind all of this is the x86; AMD’s offerings are currently being favored over Intel because of the direct connect architecture.

Cray might have the satisfaction of being right, but they still need to worry about market penetration before the smugness settles in. The other vendors have the benefit of commoditization, which is the exact force that removed Sun from being the leader in enterprise computing. Third-party OEMs have already announced the inclusion of the Stream Processor at Supercomputing this week. Can Cray keep up with that amount of volume?

One interesting side note I’d like to close with: while contemplating the SIMD and MIMD issues, I realized that the x86 vendors already have a watered-down version of both of these, namely SSE and multi-core architectures. It appears that Flynn’s taxonomy still rings true today; everyone is rushing to add these components to CPUs, either on-chip or along-side.

Comments Off

What is CUDA?

CUDA (compute unified device architecture) is NVIDIA’s GPU architecture featured in the GeForce 8800. Positioning itself as a new means for general purpose computing with GPUs, CUDA provides 128 cooperating cores. Because the cores can communicate with each other, the GPU can run multithreaded applications without the need for stream computing. Along with this innovation, NVIDIA has released a software development kit that includes a standard C compiler as well as an optimized BLAS library. CUDA may indeed be the final piece needed to make GPUs the next wave in HPC.

Comments Off

How can we overcome bus saturation in multi-core systems?

Multi-core systems, in combination with specialized co-processors for hefty tasks, are hailed as the future of high-performance computing. In a bus-based architecture, the environment is an SMP in which all of the memory is accessible by all of the processors in the same amount of time. This setup works well for a few cores, but has tremendous trouble for the dozens of cores promised in the future. The resource contention in an SMP is not a new issue; the solution of yesterday is the same for today: NUMA.

In a NUMA architecture, memory regions are aligned with processors, so that some memory accesses take longer than other memory accesses. Of course this setup brings other headaches, such as cache coherence (which really needs to be performed directly in hardware for performance reasons) and data partitioning choices (so that most accesses are for local memory rather than remote). These downsides are usually accepted simply because NUMA is the only way to achieve scalability in systems with many multiple processors, and now many multiple cores.

This is a key difference between AMD’s and Intel’s respective strategies. AMD has embraced the NUMA architecture and is proceeding with HyperTransport. Intel may do something similar in the future, but for now is sticking with SMP by using PCI. Because of AMD’s approach, there are some startups that are creating Opteron computers that rely heavily on HyperTransport. (Fabric7, PANTA Systems, and Liquid Computing also share the fact that they embrace virtualization, which is another blog post altogether.)

So the answer for dealing with bus saturation is to not have a bus at all. That is, multi-core systems require a direct connect architecture. The original vision of InfiniBand was to achieve this, though the bloated spec and the delayed product launches quickly dashed the Trade Association’s plans for world domination. Perhaps HyperTransport and other less ambitious technologies will be the saviour for multi-core computers.

Comments Off


View All Videos

insideHPC.com is a production of insideHPC, LLC. © 2006-2011 Sitemap