The SGI UV system allows computational biologists to take a different, far more effective approach to dealing with the huge data sets generated by contemporary Next Generation Sequencing (NGS) and analytic applications. Traditionally they have relied on Hadoop-style solutions – break the problem up into small chunks, send the chunks out across a distributed architecture for processing, and then reassemble the results. This approach has limited utility in genomics. Scientists and computational biologists are more likely to achieve breakthrough insights when looking at the entire dataset at once. This is not about finding a needle in a haystack – it’s looking at the entire haystack in order to define new needles. And that can only happen if you can hold the entire haystack in immediately accessible memory, which is what a CSM architecture allows you to do.
Another reason to use the SGI UV for genomics is that although it is made up of modular chassis, the system looks to the user and IT like one giant machine with 32 PCI slots available in a single partition. Physically the SGI UV is architected as a blade server and each blade has its own PCI slot. Intel Xeon Phi coprocessor cards can be inserted in these slots.
This is the 5th of 6 articles on the trends and technologies impacting the application of HPC in Life Sciences.
For example, a 5,000 core Xeon cluster that also contains another 2,500 Intel Xeon Phi cores all looks like one memory footprint. The result is a very high end, scalable machine ; no other system on the marketplace comes close to providing the total aggregate RAM and RAM per core access that the SGI UV does.
SGI’s UV systems are complemented by a variety of capabilities including the free SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics. HTC helps maximize throughput on HPC systems – a major issue when thousands of jobs are running, each carrying its own load of management overhead.
HTC is a wrapper that presorts the data to ensure that the largest and longest jobs are run first and the smaller jobs last. These are distributed in a load balanced way over the system without the overhead associated with using a third party scheduler like LSF or PBS.
Solving the Storage Problem
In addition to scheduling issues, storage is a major component of effectively dealing with the massive amounts of structured and unstructured data typical of genomics and life sciences applications, including the huge datasets flowing from NGS machines operating on a 24/7 basis.
To meet this challenge, SGI has developed ArcFiniti, a disk-based active archive solution. Based on the Intel Xeon E5 processor family, this is an integrated hardware and software platform designed specifically to handle large amounts of unstructured file-based data, which constitutes the bulk of the data being generated by the constantly evolving, advanced sequencing machines.
The solution includes patented SGI technology that significantly reduces power consumption and ensures data integrity. ArcFiniti is available in five different configurations, ranging from 156TB to 1.4PB of usable storage in a single rack before compression. Not only does this density result in significant infrastructure savings, but also allows users to quickly and easily access the archived data.
Plug and Play
One of the SGI UV’s features much appreciated by IT departments is its simple deployment. The system arrives at its destination fully configured and ready to plug in. Unless the customer has specified special customization, the system can be running in minutes.
And, because the SGI UV is based on industry standards – Intel processors, Linux OS, and standard I/O and management interfaces – the system is easily integrated into a heterogeneous datacenter environment.
Check back next week for the article on ‘HPC Architectures for Life Sciences.’ If you prefer you can click here to download the insideHPC Guide to HPC in Life Sciences in a PDF courtesy of SGI and Intel.