Sign up for our newsletter and get the latest HPC news and analysis.

Four paths to parallelism with Java

JDJ posted an article last week (thanks to Multicoreinfo.com for the pointer) to an article on parallelism in Java

Parallel programming in Java is becoming easier with tools such as the fork/join framework, Pervasive DataRush, Terracotta, and Hadoop. This article gives a high-level description of each approach, pointing you in the right direction to begin writing parallel applications of your own.

I’ve written about dataflow programming environment DataRush before (and podcasted), and most of you are probably familiar with Hadoop, the open source implementation inspired by Google’s MapReduce infrastructure. Terracotta was new to me, though

So far, we’ve constrained ourselves to running on a single multicore machine. However, we live in a world of inexpensive hardware and networking. If we’re willing to accept more administrative complexity, we can make an attempt to gain horizontal scaling through a distributed solution. As your problems get larger, you can just throw more machines at it.

Terracotta is an open source solution for doing just that. It allows multiple JVMs, potentially on different machines, to cluster and behave as a single JVM. Not only does this provide more processing power, it also provides more memory. The best part – this is all transparent to the programmer. You can make a multi-threaded program into a clustered program without rewriting any code. Just specify which objects need to be shared across the cluster and things like change propagation and distributed locking are handled for you.

Terracotta does have some significant (to my mind) disadvantages though, especially for parallel newcomers — the kind of folks most likely to move into parallelism through Java.

Unlike the other solutions we discussed, Terracotta does not provide (by itself) any abstractions that hide concurrency. You still have to worry about threads and locking when writing code. Plus, since it’s easy to share objects, it’s also easy to naively introduce cluster-wide hotspots that kill performance.

The article closes with some specific application features to look for when selecting among these tools for your project.

Your job as a software engineer is to distill the fundamental nature of your application and choose the tool whose “sweet spot” most closely aligns with it. The high-level overview we’ve provided here will give you a start in your research.

I recommend the article, even if you are a dyed-in-the-wool MPI guy. It’s good to study other implementations, figure out what they are doing right that you aren’t, and pull that information into your own practice.

Comments

  1. While it’s true that familiarity with concurrent programming principles is needed to make full use of all of Terracotta’s developer-facing features, the extensive library of Terracotta Integration Modules (TIMs) for use with third-party technologies allows many people to make use of Terracotta *without* needing to know anything about concurrent programming.

    This can be seen to great effect in the high-scale reference web application we built to show how Terracotta is used in a real-world scenario. When you look at the code to examinator, you’ll find very little concurrency-aware code. All of the concurrency is handled inside the various TIMs used by the application (e.g., Spring Webflow, Spring MVC, Spring Security, …).

    You can see a full list of available TIMs here:

    http://terracotta.org/web/display/orgsite/Integration+Guides

    Like Terracotta itself, all of these TIMs are open source free for use in production.

  2. John West says:

    Orion: thanks. Just curious, do you have any pointers to or experiences with using Terracotta in a scientific programming example?

Resource Links: