Big Data and the Evolving HPC Cluster

Gord_SissonOver at Scientific Computing, Gord Sissons from IBM writes that Big Data is driving the same kind of revolution in HPC architectures that we saw 20 years ago with the rise of clustered systems over monolythic vector systems.

HPC is changing again, and the catalyst this time around is Big Data. As storage becomes more cost-effective, and we acquire the means to electronically gather more data faster than ever before, data architectures are being re-considered once again. What happened to compute over two decades ago is happening today with storage. Consider that Facebook generates approximately 60 terabytes of log data per day. We become numb to large numbers like this and it is easy to forget just how much data this is. The time required to read 1 terabyte from a single disk at 50 megabytes per second is approximately 6 hours — reading 60 terabytes at this rate would take 15 days. When confronted with these volumes of data, the only path forward is to harness distributed approaches and rely on parallelism. Leaders like Google and Yahoo have done exactly this out of necessity creating the Google File System and leading to the creation of Hadoop. Hadoop relies on clusters of storage dense nodes to store vast data volumes reliably and economically while leveraging a programming framework that enables fast parallel processing of the data. One of the key ideas behind Hadoop MapReduce is that it is more efficient to vector compute tasks to where data already resides, rather than attempt to move the data across networks.

Read the Full Story.