How to cite and describe software

By Mike Jackson.

Researchers face significant challenges when trying to understand, reproduce or reuse research in which software has played an integral part. In this green paper, I give examples of the problems that can arise when reproducing someone else's research, and propose some practical approaches to resolving, or at least reducing, them. I also look at the important distinction between describing the software that was used, and citing it.

1 Can I get a copy of the software that was used?

For many years I worked on a research project called OGSA-DAI (a novel framework for distributed data management), and I recently came across a paper in which OGSA-DAI forms a key component. Like any researcher who would want to reproduce the research, I wanted to know which version of the software had been used in the paper. This required some difficult detective work.

The authors had cited an OGSA-DAI paper that should have meant they were using a version of the software between OGSA-DAI 1 and 6. Later in their paper, the authors mentioned a component that was specific to OGSA-DAI versions 2.5 to 6. However, the authors then talked of another component and a toolkit, which was only available with a completely different version of the software. Without my highly detailed knowledge of the OGSA-DAI project, it would have been impossible to determine what software was used.

Even if a researcher had determined which version of OGSA-DAI to use, they would have found that the version they needed is no longer readily available, and that the available releases fundamentally different to the one they needed.

The authors provide a list of the software they used (Globus Toolkit, OGSA-DAI, Java SDK, Apache Tomcat, Apache ANT, MySQL and Oracle) and for all of these (except the notable exception of OGSA-DAI) they had provided version numbers. However, they had not described how to obtain any of the software, and this is another barrier that a researcher must overcome to reproduce the research.

And the final barrier? There is no information on how to access the code developed by the authors. Without this, reproducing the research is impossible.

2 What software should we describe?

When writing a research paper in which software has played a role, how do you decide which software to describe? Do you mention everything you used - even down to Microsoft Word and Xemacs? The answer depends on the focus of the research. If you were researching popular text editors, then you probably would mention Word and Xemacs. If you were comparing the performance of two bioinformatics applications, you would not.

A good guideline is "did the software play a critical part in my research?" or "did the software provide something novel?". For example, in the paper mentioned in section 1 above, the critical components for replicating the author's work are the specific version of OGSA-DAI they used, and, possibly, the databases they used. The other packages listed (Globus Toolkit, Java, Apache ANT and Apache Tomcat) are all needed to use OGSA-DAI, but don't directly contribute to the novelty of their work.

Examples of areas where the software used directly impacts upon the results, and so needs to be described, includes numerical modelling or simulations, usability evaluations, performance evaluations of algorithms (where the evaluations are done using implementations of the algorithms), or research using software that does some form of automated analysis (e.g. image analysis or optical character recognition).

For some software, the decision may have already been made for you. Certain software publishers have licencing terms and conditions that require the use of the software to be acknowledged or cited in the references or bibliography, in any publications of results produced using that software. For example SAS mandate a citation for users of their commercial statistical analysis software. The HSL Mathematical Software Library, a collection of FORTRAN codes for large scale scientific computation that have been under development for almost 50 years, enforces a similar requirement.

3 Why can't we just cite an associated paper?

It is common, when describing software, to cite a journal or conference paper associated with the software in the references or bibliography, rather than reference to the software itself. The OGSA-DAI paper above is an example. There are major problems with this approach.

Not all research software has an associated paper that can be cited by a paper's author. This can arise for a number of reasons. Software can be released at any time, whereas papers are subject to timescales arising from peer review. In addition, what is perceived as "novelty" may affect the chances of publication, whereas there is no such inhibitor for releasing software.

Sometimes there will be no associated paper because not all software used in research, or produced by research projects, is developed by researchers. It may be produced by developers on behalf of researchers and these developers may have no interest in publishing papers. As a colleague of mine, Ally Hume, once remarked, our project not producing papers was "a success, as it meant we were focused on developing the software" (see also Ilan Todorov's comments in "Is the work of scientific software engineers recognised in academia?").

As we saw in the OGSA-DAI example, even if a paper does exist, there may be inconsistency between the version of the software, as described in the paper, the version that was actually used, and the version currently available. The paper may be correct at the time of writing, but by the time of publication, or reading, that version may no longer be available because it has been deprecated, or the project that produced it has since completed and its outputs are no longer available (I mention this for completeness – ensuring long-term availability of research software is a major area and warrants an article in itself).

Relating to this is the distinction between algorithms and implementations. Some paper authors may only be interested in the algorithm implemented by the software, others in the implementation itself, but this may not always be clear from the paper.

4 How would we cite software?

If citing a paper associated with the software is not enough, then why not cite the software itself in the references or bibliography? Software citation is an evolving area. A web search for "how to cite software" and its variants shows that this is a popular question. The answers can broadly be classed into citation formats recommended by journals, citation formats recommended/mandated by software providers and, most contentious, the view that software is not a citable output.

4.1 Recommendations from publishers

The American Psychological Association (APA), whose style guidelines are often used for social sciences publications, dominate the results of web searches in citing software. As one needs to purchase the official style guide, many resources are available that provide free examples. For example, Purdue University's Online Writing Lab provide the following example for software that is available for purchase

Ludwig, T. (2002). PsychInquiry [computer software]. New York: Worth.

For software available online, their example is

Hayes, B., Tesar, B., & Zuraw, K. (2003). OTSoft: Optimality Theory Software (Version 2.1) [Software]. Available from http://www.linguistics.ucla.edu/people/hayes/otsoft.

Purdue also provide examples of the Modern Language Association of America (MLA) style guidelines, often adopted for liberal arts and humanities. An interesting aspect of the MLA guidelines are that URLs are discouraged due to their tendency to break. Instead, they recommend the use of URLs only if the reader could not use the title, author and date with a search engine to find a publication (or, by extension, software).

The Journal of Statistical Computation and Simulation publishes computer-dependant research into statistics. Their style guide has two examples of citations for software

T.G. Golda, P.D. Hough, and G. Gay, APPSPACK (Asynchronous parallel pattern search package); software available at http://software.sandia.gov/appspack.

MultiSimplex 2.0. Grabitech Solutions AB, Sundvall, Sweden, 2000; software available at http://www.multisimplex.com.

The Journal of Statistical Software publishes research into statistical software and algorithms. Their style guide also has a specific entry for citing software. They recommend the use of whatever format is suggested, or required, by the software provider, which I'll discuss shortly. In the absence of this, the journal recommends a BibTeX entry of form

@Manual{SAS-STAT,
author = {{\proglang{SAS} Institute Inc.}},
title = {\proglang{SAS/STAT} Software, Version~9.1},
year = {2003},
address = {Cary, NC},
url = {http://www.sas.com/}
} 
4.2 Recommendations from software providers

As mentioned to by the Journal of Statistical Software guidelines, software providers, especially those writing software within a research environment, may provide a recommended, or required, citation format for any researcher who uses the software within their research. This ensures that the provider's contribution to the research is acknowledged. For example, the authors of the R open source statistical programming language and environment provide a BibTeX entry in their FAQ that can be used if citing R

@Manual{,
title = {R: A Language and Environment for Statistical Computing},
author = {{R Development Core Team}},
organization = {R Foundation for Statistical Computing},
address = {Vienna, Austria},
year = 2011,
note = {{ISBN} 3-900051-07-0},
url = {http://www.R-project.org}
} 

Some providers go further and make citation a part of the licence, a legally-binding condition of use of the software. For example, SAS's mandated citation is of form

The [output/code/data analysis] for this paper was generated using [SAS/STAT] software, Version [9.1] of the SAS System for [Unix]. Copyright © [year of copyright] SAS Institute Inc. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc., Cary, NC, USA.

HSL require the following citation to be included in any publication describing results achieved using an HSL library

HSL(2011). A collection of Fortran codes for large scale scientific computation. http://www.hsl.rl.ac.uk.

The Phylogenomenclature (P-Genome) bioinformatics software provide a citation on their download page for both their software and an associated publication and recommend both be cited

Sarkar IN, Planet PJ, Thornton J, DeSalle R, Figurski DH. Phylogenomenclature. Available at: http://www.dbmi.columbia.edu/~ins7001/research/CAOS/P-Gnome/PGdownload.html.

Sarkar IN, Thornton J, Planet PJ, Figurski DH, Schierwater B, DeSalle R. An Automated Phylogenetic Key for Classifying Homeoboxes. Molecular Phylogenetics and Evolution 2002 Sep; 24(3):388

4.3 Software is not a citable output

There is a view that software is not a citable research output, which poses problems for authors, especially when faced with software which requires citation as a condition of use. For example, in latex-tricks a poster states that "Strictly speaking you are not citing anything published, but referencing a software". The poster recommends referring to software in footnotes not in the references. A poster to the superuser question and answer site likewise recommends footnotes, explaining that "editors simply don't expect you to cite a piece of code".

On the r-help e-mail list a poster describes how their use of the R citation format was knocked back by a publisher who objects to the occurrence of both an ISBN and a URL. The publisher requests either a name and location if it is a book, or, if it is a web site, removing it from the references and including it as an in-text citation. The publishers explain that they consider web-based data not available in a recognised data store to be unpublished. In response, another poster makes the valid point that R is not a book, document or data file but software and expresses the hope that "a publisher should always allow a possibility to properly cite software used in a scientific article." Another contributor to this discussion comments that "Statistical software, like ground-penetrating radar or a magnetometer..., is a tool...it really falls within the scope of methods. It is reasonable in that case to cite the URL in the text."

In a GIS forum discussion on citing the ArcGIS software, one poster goes further, arguing that no analysis tools should be citied in the references since they are not "authoritative sources", that it is the person who used the tool who is an "authoritative source", though tools should, at least, be described. I disagree with this view. If I read a paper that describes a novel algorithm and base my research upon this, I'd be required to cite the paper. Now, supposing the authors of the algorithm had not published it as a paper, but published it as an implementation? It has the same novelty and is making the same contribution to my research, so why should it not be cited as such?

Another complication is that authors, reviewers or publishers may simply be unaware of or misunderstand the requirements that software licences and other conditions of use can place upon their users when it comes to citation. For example, a poster to the R users forum describes his anger at the failure of another researcher to cite his R packages within a paper, despite being required by the licence to do so. The researcher informs the developer that "The referee of this paper advised me that it is not needed to cite 'R' because it is a kind of free software different from the commercial package such as SAS." This is of concern since much research software is released as open source, often in response to requirements from funders to promote access to research outputs.

5 What are the limitations of existing citation formats?

Even if a journal accepts software citations and a software publisher recommends them, these still may lack information required to understand the research done. For example, R has an ISBN but, as a user has observed, this can refer to any of their versions. As another example, HSL consists of a large number of discrete sub-routines so using the recommended citation in itself does not provide enough information to a reader. As a result, authors who have used HSL usually supplement the citation with in-text descriptions of the packages used. For example the following paper used HSL

Gelabert, M.M, Llorenç, S., Sánchez, D, and López, R. (2010) Multichannel effects in Rashba quantum wires. Phys. Rev. B 81, 165317, April 2010. doi:10.1103/PhysRevB.81.165317.

The authors state in their appendix that their "resulting sparse linear problem is then solved using routine ME48" of HSL (the HSL catalog describes this as a "sparse unsymmetric system: driver for conventional direct method").

As another example of problems with citations, the Phylogenomenclature (P-Genome) citation both has a link that is broken and omits the version number.

From the previous section we can see there are a number of elements common to these citations including date, author, product name, version and URL. However, even this level of detail may not be adequate. For example, the software used might not be a release with a version number but a check-out from a source code repository, in which case, the version could be described in terms of a combination of a repository URL, check-out date, branch or tag name, or a revision number. The specifics would depend in part on the revision control system underlying the repository, as, for example, CVS uses file-specific version numbers, whereas Subversion uses repository-wide version numbers. Some software may not even have either a source code repository or a version number. In this case, such information as the download location and date become more important. In contrast, some software may be used as a service (e.g. a web service or RESTful end-point) rather than being downloaded and installed on a researcher's desktop. Finally, the software may not be accessible online but only via an e-mail to its author in which case the access date and contact details of the author become important.

6 How do we describe and cite software?

So, how should we describe software that has contributed to our research in our papers? Computational Statistics and Data Analysis author instructions provide a general guideline that

"Papers reporting results based on computations should provide enough information so that readers can evaluate the quality of the results, as well as descriptions of pseudo-random-number generators, numerical algorithms, computer(s), programming language(s), and major software components that were used."

Likewise, the SAS citation, which includes software name, company, location, date, version number, and platform, is proposed specifically to provide "the level of detail generally required by scientific journals in order to assure that the data can be replicated". They recommend that if the author did not do the analysis they get this information from the party that did.

6.1 Software can be granular

The fact that some software may consist of discrete well-defined components imposes additional requirements when describing software. As we saw with HSL, not only is a researcher's use of HSL significant, the names of the sub-routines within HSL that were used are also required, to deliver an accurate description of the research undertaken.

This is analogous to the situation faced by those who publish and consume data as part of their research. In response to these, The Digital Curation Centre (DCC) have produced a guide on data citation and linking that discusses these problems and provides advice

Ball, A. & Duke, M. (2011). "How to Cite Datasets and Link to Publications?". DCC How-to Guides. Edinburgh: Digital Curation Centre. Available online: http://www.dcc.ac.uk/resources/how-guides/cite-datasets.

(The above is their recommended citation!)

As the authors describe, data can be hierarchically structured into various elements including databases, tables, columns, rows, directories, files, records, features, data points, collections, documents and so on. Each of these elements may or may not have a specific identifier associated with them, depending upon how the data producer has published them. The DCC guide authors recommend that authors should cite data sources at the finest level of granularity that was adopted when an identifier, or for our purposes, a citation, was assigned. This can be supplemented in the text with the information to find any specific subset of the data that was used in the research. As researchers using HSL show, this advice can be readily adopted to describe the use of research software where it is appropriate and applicable to do so.

6.2 DOIs are persistent so if one exists, use it

DOIs are a way of providing a persistent identifier for research outputs. A DOI, or digital object identifier, is a persistent identifier that, when entered into a DOI broker service, takes us to meta-data about a research output, including how to get the output. This may be a URL, or it may just be an e-mail address or telephone number. DOIs are now becoming a standard means of identifying journal papers. For example, the DOI of the OGSA-DAI paper is 10.1007/978-3-642-11842-5_35 and typing this into a resolver service at http://dx.doi.org or http://www.crossref.org takes us to a SpringerLink page from which the paper can then be downloaded. The advantage of DOIs is that they separate the description of an output from its location, allowing the output to move over time. As the DCC guide explains, DOIs are becoming increasingly prevalent as a way of identifying data sources. DataCite, who advocate and support data publishing and citation for research, also recommend the use of DOIs. Both DCC and DataCite recommend DOI's be cited in their URL form, providing a resolver service URL prefix, for example http://dx.doi.org/NNNN, as opposed to using them in their citation form, doi:NNNN.

6.3 URLs are transient, but better than nothing

As we saw, the MLA guidelines advise against using URLs due to the transient nature of many web artefacts. However, if the software has a URL then I'd recommend listing that URL, despite the MLA's concerns, along with enough information for the reader to find the software via a search engine. If the URL is stable then the user can just follow the link, which is quicker than doing a web search (and there can always be a delay before material newly-added to the web is available via a search engine). If the URL is broken then I'd trust the reader to do the web search anyway. As mentioned earlier, the link in the P-Genome recommended citation is broken, but pasting the entire citation into a search engine reveals the software's current page.

6.4 Cite any software that you view as having contributed to your research

I'd recommend that, in the first case, you should always add in a citation in the references or bibliography for the software you want to describe. If there is an associated paper for the software then, by all means, cite that too, but, remember, this is no replacement for citing the software itself. Why do I recommend citation of the software? It sends a message that you view that software as having made a significant contribution to your research and that you want to acknowledge its authors. If a reviewer asks you to remove software references, then you can explain that in your view these are valid research outputs, they were essential to the research you are describing, and you'd be unhappy not attributing a fellow researcher's work. If your citation is part of a software licences terms and conditions then you should make it clear that you're legally bound to include the citation. If, for whatever reason, you are not allowed to explicitly cite your software then you should still describe it in the body of your paper.

How you do cite the software? This very much depends upon the style guidelines recommended for your paper. If there is a recommended citation then use it. If there is no recommended citation from the software publishers, then I'd suggest that your citations contain the following information, inspired by both the examples presented earlier and the examples in DataCite's guide on "why cite data?".

Software purchased off-the-shelf:

ProductName. Version. ReleaseDate. Publisher. Location.

SuperScience. 1.2. December 2012. ResearchSoftware. Edinburgh, UK.

Software downloaded from the web:

ProductName. Version. ReleaseDate. Publisher. Location. DOIorURL. DownloadDate.

OGSA-DAI REST. 4.2.1. December 2012. OGSA-DAI Project. http://sourceforge.net/projects/ogsa-dai. 27/04/2012.

UltimateFFT. 2.4. December 2012. Fred Bloggs, EPCC, The University of Edinburgh, UK. http://www.epcc.ed.ac.uk/ultimate-fft. 27/04/2012.

C implementation of Wu's color quantizer. 2. 1991. Xiaolin Wu, Department of Electrical & Computer Engineering, McMaster University, Hamilton, Ontario. http://www.ece.mcmaster.ca/~xwu/cq.c. 27/04/2012.

Software checked-out from a public repository:

ProductName. Publisher. URL. CheckoutDate. RepositorySpecificCheckoutInformation.

OGSA-DAI REST. OGSA-DAI Project. http://sourceforge.net/projects/ogsa-dai. 27/04/2012. Check-out: ogsa-dai/branck/ogsadai4.1/, revision 1657.

Software provided by a researcher:

ProductName. Author. Location. ContactDetails. ReceivedDate.
BestFFTroutine ever file. Fred Bloggs, EPCC, The University of Edinburgh, UK. Fred [dot] bloggs [at] epcc [dot] ed [dot] ac [dot] uk. 27/04/2012.

6.5 Provide additional details in the body of your paper

Whether or not you can cite the software in the references or bibliography, remember that you may need to provide additional details, if applicable. This information may include: operating system, specific packages, sub-routines, queries, files, libraries, scripts, service end-points, configurations, parameters or workflows. You may also want to provide details of other software you found helpful or used, but did not contribute to the novelty of your research (in the OGSA-DAI paper this includes Apache ANT and Tomcat, for example).

This information should be described in the body of your paper, in the methods section, footnotes, acknowledgements or appendices. We saw a simple example of this from the HSL paper earlier. As another example, in an r-help thread, a poster, Achim Zeileis states that he uses "Computational details" section in his papers to list the R packages he has used and their versions. An example of this is in his paper

Zeileis, A., Kleiber, C. and Jackman, S. (2008) Regression Models for Count Data in R. Journal of Statistical Software. 27(8), pp1-25. ISSN 1548-7660. http://www.jstatsoft.org/v27/i08.

The "Computational details" section is

The results in this paper were obtained using R 2.7.0 with the packages MASS 7.2-42, pscl 0.95, sandwich 2.1-0, car 1.2-8, lmtest 0.9-21. R itself and all packages used are available from CRAN at http://CRAN.R-project.org/.

6.6 It does not matter where it is so long as it is there!

Ultimately though, it does not matter where in the paper the information is, or how it is distributed throughout the paper, so long as the information as a whole provides enough information to allow your readers to understand what you did, to replicate what you did, to validate what you did and to be able to take on and use your research.

7 Top ten tips to describe the software you used in your research

  1. Describe any software that played a critical part in, or contributed something unique to, your research. Do this in enough detail for a peer to be able to understand what you did, repeat and validate what you did, and reuse your research.
  2. The are many options for describing the software you have used: footnotes, acknowledgements, methods sections, and appendices.
  3. Be aware that a licence may place you under an obligation to attribute the use of software in your publication.
  4. Cite papers that describe software as a complement to, not a replacement for, citing the software itself.
  5. In the first draft of a paper, always put software citations in references or bibliographies.
  6. Be prepared to debate with reviewers why you have cited the software: you want to acknowledge the contribution of the software's authors and the value of software as a legitimate research output.
  7. Inform reviewers if you are legally obliged to cite the software because of a clause in the software's licence.
  8. If a reviewer disagrees with a formal software citation, you can still make a general reference to the software in the paper.
  9. Recommended citations may not have enough information to accurately describe the software that was used - you may need to add more detail yourself.
  10. If the software has a DOI (digital object identifier) use it to cite the software. If the software has its own website, use the website's URL for the citation.

8 Conclusion

Describing the contribution of software to research closely relates to a number of other issues around the role of software in research. This includes publishing research software in a persistent and citable way, ensuring the availability of research software (and data, online services and other artefacts) for the long-term, promoting the recognition of software as a valuable research output in its own right, and ensuring that the developers of research software have their contributions recognised and rewarded. These are concerns which affect not only those using research software, but those who develop or modify research software, those who release research software, paper reviewers, programme committees, publishers and funders. All these agencies have a part to play and the Software Sustainability Institute is working with many different individuals and groups to explore and resolve these. See for example, "Publish or be damned? An alternative impact manifesto for research software".

This paper summarises my views on how researchers can describe the software that contributed to their research, and give the authors of that software due acknowledgement. I hope this can encourage and contribute to debate as to what should constitute best practice in describing how software contributes to a body of research. Please feel free to let me know what you think is wrong, what you think is missing and your suggestions as to how to improve this advice. Now you know what I think, tell me what you think!