SpaRC: Scalable Sequence Clustering using Apache Spark

Print Friendly, PDF & Email

In this deck from the Stanford HPC Conference, Zhong Wang from the Genome Institute at Lawrence Berkeley National Laboratory presents: SpaRC: Scalable Sequence Clustering using Apache Spark.

“Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. Assembly of these data sets requires tradeoffs between scalability and accuracy. Current assembly methods optimized for scalability often sacrifice accuracy and vice versa. An ideal solution would both scale and produce optimal accuracy for individual genes or genomes. Here we describe an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC) that partitions reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomes and metagenomes from both short and long read sequencing technologies. It achieves near linear scalability with input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modification while delivering similar performance. Our results demonstrate that SpaRC provides a scalable solution for clustering billions of reads from next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar large scale sequence data analysis problems.”

The software is available under the BSD license at

Zhong Wang, PhD is the Group Leader for the DOE Joint Genome Institute at Lawrence Berkeley National Laboratory. Research in the Wang group focuses on developing new bioinformatics solutions to support JGI in adopting new genomics technologies, scaling up genomics analysis to peta-bases, and enabling data-driven science. This is a group of computational biologists, bioinformaticians, computer scientists and biostatisticians. Through collaborations with internal and external scientists, the Wang group has developed customized data analyses and software to enable “grand-scale” scientific projects. These solutions include large-scale genome variants analyses, next-generation transcriptome de novo assembly, metagenome and metatranscriptome assembly, Hadoop based on sequence analysis framework, etc. For example, this group was responsible for the tera-base scale data analysis in the cow rumen and sheep rumen metagenome projects.

Standing at the intersection between the rapid development of genome and computational technologies, in the coming years the Wang group will continue its effort to explore new genome and computational technologies (NanoPore, GPU, Cloud etc) and their applications in solving DOE’s most challenging problems.

See more talks in the Stanford HPC Conference Video Gallery

Check out our insideHPC Events Calendar