Should all academic software be released as open source by default? In this special guest feature, Neil Chue Hong (Software Sustainability Institute), Simon Hettrick (Software Sustainability Institute), Andrew Jones (@hpcnotes & NAG), and Daniel S. Katz (University of Chicago & Argonne National Laboratory) discuss the the role of open source software in publicly funded research.
In their recent paper, Krylov et al. [1] state that the goal of the research community is to advance “what is good for scientific discovery.” We wholeheartedly agree. We also welcome the debate on the role of open source in research, begun by Gezelter [2], in which Krylov was participating. However, we have several concerns with Krylov’s arguments and reasoning on the best way to advance scientific discovery with respect to research software. Gezelter raises the question of whether it should be standard practice for software developed by publicly funded researchers to be released under an open-source license. Krylov responds that research software should be developed by professional software developers and sold to researchers.
We advocate that software developed with public funds should be released as open-source by default (supporting Gezelter’s position). However, we also support Krylov’s call for the involvement of professional software developers where appropriate, and support Krylov’s argument that researchers should be encouraged to use existing software where possible. We acknowledge many of Krylov’s arguments of the benefits of professionally written and supported software.
Our first major concern with Krylov’s paper is its focus on arguing against an open-source mandate on software developed by publicly funded researchers. To the knowledge of the authors, no such mandate exists. It appears that Krylov is pre-emptively arguing against the establishment of such a mandate, or even against it becoming “standard practice” in academia. There is a significant difference between a recommendation of releasing as open-source by default (which we firmly support) and a mandate that all research software must be open source (which we don’t support, because it hinders the flexibility that scientific discovery needs).
Our second major concern is Krylov’s assumption that the research community could rely entirely on software purchased from professional software developers. We agree with this approach whenever it is feasible. However, by concentrating on large-scale quantum chemistry software, Krylov overlooks the diversity of software used in research. A significant amount of research software is at a smaller scale: from few line scripts to short programs. Although it is of fundamental importance to research, this small-scale software is typically used by only a handful of researchers. There are many benefits in employing professionals to develop research software but, since so much research software is not commercially viable, the vast majority of it will continue to be developed by researchers for their own use. We do advocate researchers engaging with professional software developers as far as appropriate when developing their own software.
Our desire is to maximise the benefit of software by making it open—allowing researchers other than the developers to read, understand, modify, and use it in their own research—by default. This does not preclude commercial licensing where it both is feasible and is the best way of maximising the software benefit. We believe this is also the central message of Gezelter.
In addition to these two fundamental issues with Krylov, we would like to respond to some of the individual points raised.
“The term ‘open source’ is ubiquitous but its meaning is ambiguous. Some codes are ‘free’ but are not open,(13)”
[1]: 2752, right-hand column, paragraph 3
This is simply incorrect. The Open Source Initiative provides a widely accepted definition of open-source software: http://opensource.org/definition. Although there is always some debate over the finer points of open source, all definitions agree that open-source software must be open – whether it is free or not is irrelevant.
“Software from academia is often developed with an emphasis on ideas rather than implementation, fed by the need for timely peer-reviewed journal publications that provide ongoing grant support and future jobs for graduate students. To bring new ideas to the production level, with software that is accessible to (and useful for) the broader scientific community, contributions from expert programmers are required. These technical tasks usually cannot—and generally should not—be conducted by graduate students or postdocs, who should instead be focused on science and innovation.”
[1]: 2751, right-hand column, paragraph 2
We agree that there is an over-reliance on graduate students and postdocs to develop software, especially untrained ones, but there are expert programmers in academia. The real problem is that the current academic career structure makes it difficult to hire an expert programmer, so workarounds have been developed where these people are hidden, often in postdoc roles. Expert programmers could be based in academia or industry; we just need the research funding rules to allow, even encourage, either the support of properly funded research software engineer (RSE) positions or research grants to fund non-academic sources of such expertise. We should also encourage graduate students and others in academia to acquire software skills so that they can contribute to software projects and collaborate more effectively with professional software engineers.
“Software is not data, and simply because it is feasible to put software on the Internet does not imply that it should be posted.”
[1]: 2751, right-hand column, paragraph 4
As argued by Gezelter, reproducibility is best served by transparency and availability. If software is required to validate results, it must be made available. There is also a wider issue to this debate. Our research is funded in the main by taxes, which could equally well be spent on health, education, or any number of other vital services. It is our moral duty to ensure that research funding serves the public good, either through advancing scientific discovery or by creating revenue for the wider economy. In a number of countries, there is a growing expectation that this is best achieved by allowing the public to view, use, and benefit from all the results of funding. This is as true of software as it is of data.
“Occasionally the open-source model is touted on the grounds that one can use the source code to learn about the underlying algorithms, but this hardly seems relevant if the methods and algorithms are published in the scientific literature. Source code itself rarely constitutes enjoyable reading, and using source code to learn about an algorithm is a last resort forced by poorly written scientific papers. Better peer review is a more desirable solution.”
[1]: 2752, right-hand column, paragraph 4
Publishing only the algorithm would appear to provide transparency without the need of opening the source code – but it is a flawed argument. The algorithm itself is only part of the description of the method used to perform the research. The practical implementation of that algorithm (i.e. the software) is also required to have a complete description of the method – and thus enable proper review and reproducibility by others. Access to the source code also enables an understanding of how to implement the algorithm optimally, or how to implement it in a system containing other algorithms.
“Nevertheless, the software itself is a product, not a scientific finding, more akin to, say, an NMR spectrometer—a sophisticated instrument—than to the spectra produced by that instrument.”
[1]: 2752, left-hand column, paragraph 1
We generally do not credit people for scientific findings, we publish papers—which are products—that show why these scientific findings are believable and then credit people for the publication. Under this rationale we should be looking to publish not only methods or algorithms but also the code that implements them. A software paper should be viewed as an advertisement for the software, in the same way that a research paper is an advertisement for the research.
“Unlike the development of, for example, a smart-phone app, where the code base is small (3) and a relatively large community can easily write extensions and add-ons, production of scientific software involves the curation of millions of lines of source code. The complexity of this code demands long-term user and developer support to maintain its integrity and performance while keeping up with new computer architectures, fixing bugs, and adding features.”
In response to this point, we quote Mike Olsen of Cloudera “[T]here’s been a stunning and irreversible trend in enterprise infrastructure. If you’re operating a data center, you’re almost certainly using an open source operating system, database, middleware and other plumbing. No dominant platform-level software infrastructure has emerged in the last ten years in closed-source, proprietary form.” In fact, the approach to most “platform” style software is to open source it to make it easier to maintain. There are many examples of software: Linux, Apache Spark, R, Petsc, Lapack and Moodle, to name but a few, that are complex, open-source codes that not only maintain their “integrity and performance” but survive and deliver effectively.
“Gezelter acknowledges the cost of maintaining scientific software and suggests alternative models to defray these costs including selling support, consulting, or an interface, all the while making the source code available for free.(2) These suggestions strike us as naı̈ve, something akin to giving away automobiles but charging for the mechanic who services them.”
[1]: 2752, left-hand column, paragraph 4
This model works well for other industries, so there is no reason to judge it as naïve within the field of research software. For example, mobile phones are often free but rely on a subscription plan to recoup costs, printers are often sold at below cost with the expectation that the user will later overpay for the ink cartridges. Likewise, open-source software producers can both make their code open for inspection and charge a subscription licensing fee for an Enterprise Edition, the model used by GitLab; or sell premium features based on an open-source platform that is “given away”, like Continuum Analytic’s Anaconda scientific and data analysis suite.
The problem is that research funding is perceived, possibly accurately, to favour the (potentially inefficient) creation of new software in preference to paying for the use of established software (whether via support, commercial licences or other means). In this matter we agree with Krylov and argue that it is often more cost- and time-effective to spend research funding on established software where it is available and fit for purpose, rather than write new software.
This brings us to a final point on which we fully agree with Krylov: software is not free. This is either manifestly true with proprietary software, or is hidden in the time that must be invested in either developing one’s own software or configuring open-source software. As Krylov states:
“The creation of scientific software is a labor-intensive process, and its support and curation even more so. How do we pay for these labor costs? The answer is clear in the case of commercial software, where license fees are used to defray the costs of development and support. In this model, users buy the software that fits their research needs and affords them the highest productivity… software that offers a competitive advantage is a sensible investment of research funds.”
[1]: 2752, left-hand column, paragraph 3
Even what appears to be expensive proprietary software is often significantly less expensive in reality than the cost of employing someone to develop software with similar functionality. If a researcher’s needs are fulfilled by non-free software (whether open or closed source), then there is normally a solid case for purchasing that software rather than developing an alternative.
Conclusion
The debate between Gezelter and Krylov covers two related but distinct issues: (i) should software developed by publicly funded researchers be released as open source; and (ii) should researchers develop their own software or use existing software?
None of the authors of this response believe that they possess “a rigid, mindless focus on an open-source mantra” [1], but nor are we hard-line proponents of proprietary software. Such black and white thinking is rarely appropriate in the nebulous world of research. Instead, we believe that publicly funded research software should be released as open source by default – with researchers allowed to implement a closed source approach if they can demonstrate a clear benefit to the research community.
We believe that researchers should use existing software (open source or proprietary) where possible, because it is inefficient use of research time and funds to reinvent software. However, we are realists. We understand that finding appropriate existing code and using it, especially if necessary to adapt it for new purposes, can be difficult. Thus, while we believe that by default researchers should use existing software, we also recognize there are many situations where developing new software can be justified as the more efficient approach.
Ultimately, we must accept that research is best served through using a combination of open-source and proprietary software, through developing new software and through the use of existing software. This approach allows the research community to focus on what is optimal for scientific discovery: the one point on which everyone in this debate agrees.
Acknowledgements
Some work by Katz was supported by the National Science Foundation (NSF) while working at the Foundation; any opinion, finding, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the NSF.
References
- What Is the Price of Open-Source Software? Anna I. Krylov, John M. Herbert, Filipp Furche, Martin Head-Gordon, Peter J. Knowles, Roland Lindh, Frederick R. Manby, Peter Pulay, Chris-Kriton Skylaris, and Hans-Joachim Werner. The Journal of Physical Chemistry Letters 2015 6 (14), 2751-2754. DOI: 10.1021/acs.jpclett.5b01258 – http://pubs.acs.org/doi/full/10.1021/acs.jpclett.5b01258
- Open Source and Open Data Should Be Standard Practices. J. Daniel Gezelter. The Journal of Physical Chemistry Letters 2015 6 (7), 1168–1169. DOI: 10.1021/acs.jpclett.5b00285
I agree with most of what you write, and in particular with the need of distinguishing openness (anyone can look at the source code) from the economic model for software development. I consider openness to be essential for science. Yes, the methods implemented should also be documented in human-readable form, but when you run into inexplicable behavior with some piece of software, you need to be able to look at the source code. Bugs are, unfortunately, the rule rather than the exception.
I would like one argument in favor of Open Source software in research: the possibilty for someone else to build on it, adding new computational methods. If all scientific software were proprietary, such developments would be so costly as to be out of reach of small research groups.
the point about open source software is collaboration. .. in other words, to avoid reinventing the wheel. so It addresses the complaint the researchers developing software are wasting resources. Even better, collaborators tend to improve it, to the benefit of a community including the original author.
Software is meant to help research efforts. Presumably outside the scope of its consideration is software development process itself. I’m of the opinion that research must remain focused and true to its experimental goals without spilling into unendless debates about merits of software’s source.