DataRush posts 2 TB per hour on MalStone B

I thought this might be of interest to at least some of you — Pervasive Software announced last week that their flagship DataRush product posted a rate of 2 TB/hr on MalStone B (a stylized benchmark for data intensive computing, Robert Grossman describes it here) using a 32-core Intel Xeon 7550 server. MalStone B10 has 10 billion records, which equates to just under 1 Terabyte of data (100 byte, fixed record size).

“These results provide powerful validation of the ability of Pervasive DataRush to scale massively and consume all available cores as commercially available core counts increase,” said Ray Newmark, vice president of sales and marketing for Pervasive DataRush. “This kind of performance is a beacon for organizations struggling with complex or large data who want to harness the power of multicore. We enable users to process large amounts of data to obtain actionable information faster and more cost-effectively than other technologies.”

Pervasive DataRush ran the 10-billion-row benchmark on an Intel server with a 64-bit JVM 6 installed on 64-bit Windows 2008. The Pervasive DataRush runtime of 31.5 minutes was 26 times faster than the same test in a published benchmark using Hadoop on a 20-node cluster. Not only did Pervasive DataRush achieve superior performance, the application showed excellent scalability from two to thirty-two cores. This level of performance and scalability allows organizations to leverage the most appropriate hardware for the performance desired.

We’ve written about Pervasive before; you can find an in-depth piece here.

I didn’t really have a point of reference for the significance of this result, so I got in touch with the company. Here’s what they had to say

The run-time we are publishing (31.5 minutes) is faster than many of the other published runtimes (maybe faster than all, we’d need to double check on the latest published). One of the main precepts of DataRush is our ability to process large amounts of data, hence our focus on the throughput rate, not just the wall clock time.

The 2 Terabyte/hour rate is excellent for this benchmark. This is especially so given that we are comparing ourselves against other runs that were made using clusters. This shows that a single machine can handle jobs that once were only considered feasible on cluster configurations. Again, one of our focuses with DataRush is to run on commodity hardware. We configured the system using the RAID card that came with the box using terabyte drives we bought at a local electronics store. We used a RAID-0 configuration.