Best practices for scientific software
Posted on 29 November 2017
Best practices for scientific software
By Damien Irving, climate scientist
This post was originally published on the Dr Climate blog.
Code written by a research scientist typically lies somewhere on a continuum ranging from “scientific code” that was simply hacked together for individual use (e.g. to produce a figure for a journal paper) to “scientific software” that has been formally packaged and released for use by the wider community.
I’ve written at length (e.g. how to write a reproducible paper) about the best practices that apply to the scientific code end of the spectrum, so in this post I wanted to turn my attention to scientific software. In other words, what’s involved in turning scientific code into something that anyone can use?
My attempt at answering this question is based on my experiences as an Associate Editor with the Journal of Open Research Software. I’m focusing on Python since (a) most new scientific software in the weather/ocean/climate sciences is written in that language, and (b) it’s the language I’m most familiar with.
Hosting
First off, you’ll need to create a repository on a site like GitHub or Bitbucket to host your (version controlled) software. As well as providing the means to make your code available to the community, these sites have features that help with things like community discussion and software release management. One of the first things you’ll need to include in your repository is a software licence. Jake VanderPlas has an excellent post on why you need a licence and how to pick one.
Packaging/installation
If you want people to use your software, you need to make it as easy as possible for them to install it. In Python, this means packaging the code in such a way that it can be made available via the Python Package Index (PyPI). If your code and all the libraries it depends on are written purely in Python, then this is all you need to do. People will simply be able to “pip install” your software from the command line.
If your software has non-Python dependencies (e.g. netCDF libraries), then it’s a good idea to make sure that it can also be installed via conda. Using recipes that developers (i.e. you, in this case) submit to conda-forge, this popular package manager installs software and all it’s dependencies at once. I’ve talked extensively about conda in a previous post.
Documentation
While it might seem like the documentation pages for your favourite Python libraries were painstakingly typed by hand, they were almost certainly created using software that automatically takes all the information from the docstrings in your code and formats them nicely for display on the web. In most cases, people use Sphinx to generate the documentation and Read the Docs to publish it (here’s a nice description of that whole process).
Assistance
In providing assistance to users, software projects will typically use a combination of encouraging people to submit issues on their GitHub/Bitbucket page (for technical questions that will possibly require a change to the code) and platforms like Google Groups and/or Gitter (a chat client provided by GitLab) for more general questions about how to use the software.
The bonus of GitHub issues, Google Groups and Gitter is that anyone can view the questions and answers, not just the lead developers of the software. This means that random people from the community can chime in with answers (reducing your workload) and it also helps reduce the incidence of getting the same question from many people.
Testing
If you want users (and your future self) to trust that your code actually works, you’ll need to develop a suite of tests using one of the many testing libraries available in Python. You can then use a platform like Travis CI to automatically run those tests each time you change your code, to make sure you haven’t broken anything. Many people add a little code coverage badge to the README file in their code repository using Coveralls, to indicate how much of the code is covered by the tests.
Academic publishing
To make sure you get the academic credit you deserve for the hard work associated with releasing and maintaining scientific software, it’s important to publish an academic article about your software so that people can cite it in the methods sections of their papers. If there isn’t an existing journal dedicated to the type of software you’ve written (e.g. Geoscientific Model Development), then the Journal of Open Research Software or Journal of Open Source Software are good options.
This is obviously a very broad overview of what’s involved in packaging and releasing scientific software. Depending on where you sit on the scientific code/scientific software spectrum, not all of the things listed above will be necessary. For instance, if you’re writing code that only needs to be used by a group of five people working on the same computer system, hosting on GitHub, testing using Travis CI and the use of GitHub issues and Gitter for discussion might be useful, but perhaps not packaging with PyPI or a journal paper with the Journal of Open Research Software.
A great resource for more detailed advice is the Software Sustainability Institute’s online guides. It’s also worth checking out the gold standards in the weather/ocean/climate space. In terms of individual researchers releasing their own software, this would be the eofs and windspharm packages from Andrew Dawson. Packages like MetPy (UCAR/Unidata), Py-ART (ARM Climate Research Facility) and Iris/Cartopy (MetOffice) are good examples of what can be achieved with some institutional support.