IU to Develop Data Analysis Tools with NSF Grant

Print Friendly, PDF & Email

iuThe NSF has awarded $5 million to a team of Indiana University Bloomington computer scientists working to improve how researchers across the sciences empower big data to solve problems.

Led by IU Distinguished Professor of Computer Science and Informatics Geoffrey Fox, the team will address one of the leading challenges in tackling some of the worlds most pressing issues in science: the ability to analyze and compute large amounts of data. Hampered by the ever-increasing volume, variety and velocity of data, scientists are expected to embrace a project that will design, develop and implement building blocks that enable a fundamental improvement in the ability to support data-intensive analysis on a wide range of cyberinfrastructure.

Many scientific problems depend on the ability to analyze and compute large amounts of data, but this analysis often does not scale well, Fox said. Our project will integrate features of traditional high-performance computing, such as scientific libraries, and communication and resource management middleware, with a rich set of capabilities already found in the commercial big data ecosystem.

Working with Fox out of IUs Digital Science Center as co-investigators will be two assistant professors in the IU Bloomington School of Informatics and Computing, Judy Qiu and David Crandall. Collaborating with the IU team will be supporting researchers from the University of Arizona, Emory University, University of Kansas, Rutgers University, Virginia Tech and the University of Utah.

Specifically, the five-year project will address major data challenges in seven research communities: biomolecular simulations, network and computational social science, epidemiology, computer vision, spatial geographical information systems, remote sensing for polar science, and pathology informatics.

The project libraries created with this funding will have the same beneficial impact on data analytics that other scientific libraries have had for supercomputer simulations, Fox said. And they will be implemented to be scalable and interoperable across a range of computing systems including clouds, clusters and supercomputers.

He compared the new scientific libraries to existing parallel solution applications like PETSc — the Portable, Extensible Toolkit for Scientific Computation — which is used by partial differential equations and supports Message Passing Interface, a standardized and portable message-passing system that can function on a variety of parallel computers. Another successful scientific library has been ScaLAPACK, a free software library that contains numerical linear algebra routines for distributed memory parallel computers.

Called MIDAS — Middleware for Data-Intensive Analytics and Science — the new software will enable scalable applications with the abilities of high-performance computing systems and the rich functionality of open-source storage and large-scale processing clusters on community hardware like Apache Hadoop.

Our innovative architecture integrates key features of open-source cloud computing software with supercomputing technology,” Fox said. “And our outreach involves data analytics as a service that includes training and curricula set up in a MOOC (massive open online course).

Fox has already been recognized internationally for his free Big Data Applications and Analytics MOOC that Computerworld called one of the 10 great MOOCs for techies earlier this year. Self-paced and in 12 sections, the MOOC investigates big data processing and analytics through parallel cloud computing. That MOOC has most recently been made part of the new Master of Science in data science, which Fox was instrumental in developing for the School of Informatics and Computing.

Crandall said computer vision and machine learning have made significant algorithmic advances over the past few years, but that academic community doesnt necessarily have the expertise to scale those advances to big data problems.

The high performance systems community has that expertise but doesnt necessarily understand our algorithms, he said. Our goal is to create software abstractions to help connect these communities together with applications in different scientific fields, letting us collaborate and use other communities tools without having to understand all of their details.

The team will also engage other scientists and educators with annual workshops and activities at discipline-specific meetings, both to gather requirements for and feedback on the new software. The grant will also allow for student outreach, including at Elizabeth City State University (N.C.), where Fox and Qiu have already served as mentors while establishing computer clusters there that partner in polar science with IU and others. The grant will allow Minority Serving Institution students like those at Elizabeth City to participate in new summer research experiences.

Im excited to contribute to the technology of building scalable analysis platforms,” Qiu said. My work aims to promote robustness, efficiency and interoperability in running large-scale algorithms on national supercomputing centers and commercial clouds.

Of the 17 awards totaling $31 million, IU was but one of two proposals accepted as an early implementation project due to its existing maturity and robustness, according to Irene Qualters, division director for advanced cyberinfrastructure at NSF.

Developed through extensive community input and vetting, NSF has an ambitious vision and strategy for advancing scientific discovery through data, and this vision requires a collaborative national data infrastructure that is aligned to research priorities, efficient, highly interoperable, and anticipates emerging data policies,” she said. This project tests a critical component in a future data ecosystem in conjunction with a research community of users.

This latest announcement comes on the heels of Fox, again serving as principal investigator, receiving two other NSF grants in August. The first, worth $475,000, will develop new theoretical tools and computational services for large data set analysis for experiments in hadron spectroscopy, and the second, for $900,000, will develop a Rapid Python Deep Learning Infrastructure described as an artificial intelligence approach to solving big data problems.

Sign up for our insideHPC Newsletter.