Encouraging the open publication of software code: learning from the Open Data Institute Certificate model.
Encouraging the open publication of software code: learning from the Open Data Institute Certificate model.
Posted on 16 March 2017
Encouraging the open publication of software code: learning from the Open Data Institute Certificate model.
By Alice Harpole, University of Southampton, Danny Wong, Royal College of Anaesthetists, and Eilis Hannon, University of Exeter, Software Sustainability fellows
There has been a collective push in recent years to make all empirical data open access, and this is often a requirement where it has been funded by taxpayers. One reason for this is to improve the overall quality of research and remove any barriers from replicating, reproducing or building on existing findings with the by-product of promoting a more collaborative style of working. In addition to making the data available, it is important to make it user-friendly by providing clear documentation of what exactly it is and how the data was generated, processed and analysed. There are a number of situations, where the key contribution from the research is not simply the underlying data but the software used to produce the findings or conclusions, for example, where a new methodology is proposed, or where the research is not based on any experimental data but instead on simulations. Openly sharing software is as critical here as sharing the raw data for experimental studies. What’s more, there are likely many projects where both the data and software are equally as important, and while there is an expectation to provide the data, this currently does not extend to novel analytical pipelines. In this blog post, we propose a system similar to that established by the Open Data Institute to encourage and support the release of data, but in this case for custom software.
Motivation
At present, there is little incentive for researchers to release their code, and few guidelines for how code should be published or what criteria it should satisfy. However, there is an increasing appetite for code to be published in some form, and for researchers to be open to sharing their code, in order to encourage reproducible science. There are various levels of releasing software which can be adopted, from publishing the code itself in a peer-reviewed fashion, down to making code snippets available on a researcher’s blog or website, and a whole range of possibilities in between.
Researchers have differing tasks and motivations and often, software contributions take a back seat to getting papers published as they do not directly contribute to the metrics universities or research groups rely on to measure outreach or success and obtain funding. While we recognise that having code published supplements the methodology of any peer-reviewed publication, in the current research environment there needs to be either a culture change (as there has been with open data) or a reward system that acknowledges research code.
Recognising that different disciplines produce code with different goals and applications, we propose a set of standards with a tiered approach to how code should be shared and published. This may be analogous to the Open Data Institute’s Gold, Silver and Bronze level Open Data Certificates.
Rationale
Depending on the subject matter, there may be different expectations for a researcher when it comes to publishing their code, and different journals may have different requirements for the level of code disclosure. We (the authors) have had varying experience with sharing research data and software, in addition to using software published by other researchers, which have influenced the proposal here.
We propose that the Gold standard level of sharing code, which fulfils the highest levels of good practice, would incorporate things such as: documentation, testing, clear version-control, proven functionality and a separate peer-reviewed publication of the software itself in a journal such as the Journal of Open Source Software or the Journal of Open Research Software.
Following this, Silver standard would be achieved by sharing code in a publicly available repository with version control, such as repositories hosted on GitHub, GitLab, or BitBucket, etc., where both code and documentation can be viewed (maybe hosted on Read the Docs, GitHub Pages, GitLab Pages or on a wiki-like place) and downloaded for reuse by other researchers and testing by peer-reviewers or for post-publication review by the public.
Lastly, a Bronze standard where the code is supplied as text/script files uploaded as supplementary material alongside the journal publication, or hosted on the researcher’s personal website, and can be readily accessible to the reader.
Such an approach will have many benefits beyond direct citations of the software: research groups will able to evidence the quality of their software or analysis to funding bodies, and when applying to jobs researchers can provide evidence of the quality of the software they develop.
Case Studies
For this to be a success, it is imperative that researchers understand why open software is important. Without this no researcher will make the time, or prioritise writing rigorously tested code over publishing papers. It is not anticipated that gold standard will be the goal for all research software; however the idea behind a tiered scheme is to encourage all researchers to aim to go one better each time, gradually improving the quality of the software produced by researchers worldwide.
Here the authors present examples from their own work where they have previously shared code, to illustrate how Gold, Silver and Bronze standards might work practically:
Gold
Alice published a paper in the Journal of Open Source Software last year that described a piece of software modelling burning in extreme fluids. To be accepted by the journal, the code had to satisfy a set of criteria with regards to its sustainability. We believe that this set of criteria would be a good basis on which to model the Gold standard. The paper’s senior author documented the process of submitting software to the journal on his blog post.
Silver
Eilis’ experiences of publishing software involved collating a series of analysis scripts from a study integrating genetic and epigenetic data were put in a GitHub repository, and given a DOI via Zenodo which could be referenced in the journal article reporting the results. The purpose here was not to release a new software programme, but more to accurately document the exact analysis routine used to generate the results published.
Bronze
Danny published a paper evaluating the baseline trauma health services at a Major Trauma Centre before the construction of a helicopter landing pad. The manuscript for this paper was written in R Markdown. He then described the process of writing the manuscript on his blog, hosting the source code for his manuscript on his website for readers to access.
Take home messages
Currently there is a lack of incentive or consistency when it comes to publishing custom computer code. However, making code public would improve research practice by encouraging reproducibility. We envisage that providing a set of benchmarks for researchers will encourage better code publishing practices. Our case studies describe a spectrum of different experiences with publishing code, which demonstrate the concepts we are describing. We propose that the Institute incorporates the Gold, Silver and Bronze standards we have suggested in its guidance for publishing scientific code. We hope that the Institute supports our proposal for these benchmarks, as it would encourage academic journals to hold code referenced in their articles to a higher standard.