TACC Podcast Looks at the Challenges of Computational Reproducibility

Print Friendly, PDF & Email

In this TACC Podcast, Dan Stanzione and Doug James from the Texas Advanced Computing Center discuss the thorny issue of reproducibility in HPC.

Trust, but verify. The well-known proverb speaks to the heart of the scientific method, which builds on the results of others but requires that data be collected in a way that can be repeated with the same results. Scientific reproducibility extends beyond just recreating the conditions of a physical experiment. The computational analysis of data also factors into the reproducibility equation.

Computational reproducibility is a subset of the broader and even harder topic of scientific reproducibility,” said Dan Stanzione, TACC’s executive director. “If we can’t get the exact same answer bit-for-bit, then what’s close enough? What’s a scientifically valid way to represent that? That’s what we’re after.”

Computational reproducibility can be difficult to achieve. Even working with the same data, one analysis might yield small differences from another. Stanzione explained that a computer’s hardware and software systems can change a lot over time, partially because of software upgrades such as security patches. Changes in scientific software libraries, operating system components, and computer hardware upgrades can slightly alter results.

Reproducibility means many things to many people, because it is in fact many things, and it has many aspects,” said Doug James, former deputy director for High Performance Computing at TACC. James cited a definition for reproducibility by Lorena Barba, associate professor of mechanical and aerospace engineering at George Washington University. “She describes reproducibility as conducting your research as if someone might want to do it again. That means traceability, automation, and transparency. It means the ability to survive inspection by one’s peers, to give them the confidence that if they needed or wanted to do this again, they could,” James said.

Researchers can control the software configuration of their workstations, but not on the systems at TACC and other supercomputing centers. But James explained that TACC has developed tools for researchers to control their software environments. One example is the Lmod module system, written and maintained by TACC’s manager of HPC Software Tools, Robert McLay.

The module system has commands that offer insight into what you can load and have loaded. It allows you to save, preserve, and quickly recover your favorite collections of software so that you can come to the table tomorrow with the same software that you had today. And in particular, it’s designed to make managing and controlling the software environment easy and repeatable for the individual user. That’s the kind of thing that TACC does in the supercomputing environment to promote and enhance reproducibility,” James said.

Other tools developed at TACC to enhance reproducibility include XALT, which keeps track of the software packages, libraries, and versions used to execute essentially any job, workflow or command on the system; and TACC Stats, which leverages XALT metadata to gauge how efficiently the software uses its resources.

TACC also enhances reproducibility through the expanded use of containers, added Stanzione. “Containers are a very lightweight technology for virtualization. We can store not only the code that you used, but also the environment around it, the operating system, and the libraries. That way we can go back and get the same software environment and store the whole thing,” he said. Docker, an open source container platform, is one of the main tools used for this purpose.

Another opportunity for TACC to enhance reproducibility comes from the increasing use of science gateways and web services, which provide a portal for TACC systems without users having to build and compile their own code. Software versions, workflows, and other metadata can be stored for later use.

If we can preserve the data and then build a container to preserve the software environment, we have a lot of pieces that help make science more reproducible,” Stanzione said. “Our strategy is to keep pushing those technologies forward and expose our users to the best practices for enhancing computational reproducibility.”

Download the MP3

Sign up for our insideHPC Newsletter

Comments

  1. There are two main sources of irreproducibility in HPC: floating-point arithmetic (floats) does not have the associative property, and math libraries are not required to be correctly rounded. The lack of associative addition particularly corrupts linear algebra libraries that are parallelized across multiple processors, changing the order of addition.

    Both are failures in the design and specification of floats, and both are eliminated with posit arithmetic. Posit arithmetic supports perfectly associative addition, and all math library functions are mandated to be correctly rounded in the Draft Posit Standard, so posit-based computation is just as reproducible and portable as integer or fixed-point computation, something we sorely need in HPC.