In the late 1980s, genomic sequencing began to shift from wet lab work to a computationally intensive science; by end of the 1990’s this trend was in full swing. The application of computer science and high performance computing (HPC) to these biological problems became the normal mode of operation for many molecular biologists.
Most of these researchers did not have a history of working with expensive supercomputers – typically they relied on commodity Linux workstations and small HPC clusters to meet their sequencing needs.
The problem is that for many genomic workflows, this kind of distributed computing model was, and still is, not the best one to use. When you are dealing with thousands of cores in a very large system with distributed nodes, each with its own, unshared memory, node-to-node communications becomes an issue. The limited amount of RAM available to each CPU can also become a bottleneck. Both issues can be addressed with globally shared memory.
This is the 3rd of 6 article series on the trends and technologies impacting the application of HPC in Life Sciences.
Solutions such as hypercube architecture are being implemented to improve the design of large clusters. Also, given the terabytes of data being generated by today’s Next Generation Sequencing (NGS) machines – and the need to sequence and analyze genomic data quickly and cost-effectively – parallel processing is essential. However much of the sequencing code in use today is not parallelizable or is embarrassingly parallel at best. Very little code optimization has occurred with NGS applications over the past fifteen years, certainly as compared to other computational solutions in other research disciplines.
Even if the code is modernized, conventional distributed Linux HPC clusters are difficult to scale to keep up with the torrents of genomics data being processed. Just throwing more cores at the problem is not the answer. For example, consider the computational requirements of processing an entire genomic dataset containing some 10TB of data. A standard Linux cluster or a cloud solution such as Amazon EC2 can’t squeeze a 10TB dataset into memory. Today even a terabyte of memory on a fat node is a luxury, and the task (and cost) of transferring that amount of data over the Internet to remote storage is still a major challenge.
The growing capabilities and complexities of today’s NGS tools are adding to extensive and unpredictable flows of data that threaten to swamp conventional server and storage solutions. In fact, many current complex genomics problems are beyond the reach of these architectures. In many instances, code written to run on distributed commodity Linux clusters uses algorithms that cannot handle the massive amounts of data that have become the norm in the latest NGS incarnations.
Also, as NGS systems and applications continue to rapidly evolve, storage is a key consideration. The size and complexity of the datasets produced by these systems, sometimes running to petabytes, continue to place high demands on the capacity and throughput of the IT infrastructure’s storage systems, whether it is part of an HPC cluster or a supercomputer.
Fortunately there is a solution. It is under these challenging circumstances that a unique architectural approach like SGI’s NUMA (non-uniform memory access) and a large shared-memory machine, like the SGI UV system, really come into their own.
Check back next week for the article on ‘Removing Bottlenecks Genomic in Processing Time.’ If you prefer you can click here to download the insideHPC Guide to HPC in Life Sciences in a PDF courtesy of SGI and Intel.