A standard format for CITATION files
Posted on 12 December 2017
A standard format for CITATION files
By Stephan Druskat, Humboldt-Universität zu Berlin, Radovan Bast, University of Tromsø, Neil Chue Hong, Software Sustainability Institute, University of Edinburgh, Olexandr Konovalov, University of St Andrews, Andrew Rowley, University of Manchester, and Raniere Silva, Software Sustainability Institute, University of Manchester
The citation of research software has a number of purposes, most importantly attribution and credit, but also the provision of impact metrics for funding proposals, job interviews, etc. Stringent software citation practices, as proposed by Katz et al. [1], therefore include the citation of a software version itself, rather than a paper about the software. Direct software citation also enables reproducibility of research results as the exact version can be retrieved from the citation. Unique digital object identifiers (DOIs) for software versions can already be reserved via providers such as Zenodo or figshare, but disseminating (and finding) citation information for software is still difficult and handled very differently across projects. In most cases, it would be useful to have the information provided directly in the repository containing the software, as citation information would then be updated with the software repository. For this, Robin Wilson has suggested the use of CITATION files, but practices for providing citation information still vary widely.
How do people provide citation information now?
Currently, researchers and research software engineers inform their users of the preferable citation method in many different ways, such as, but not limited to:
-
the project website (e.g. SciPy, GAP, Taverna, Cytoscape, Avogadro, Chemistry Development Kit (CDK))
-
the project wiki (e.g. SageMath)
-
a software/package index/archive like The Comprehensive R Archive Network (CRAN) and the Python Package Index (PyPI) (e.g. ggplot2)
-
the project README file (e.g. RAVEN)
-
a project CITATION file as mentioned in a previous post
-
the return of a function call (e.g. GNU Octave, Bioconductor)
-
the project issue/bug tracker (e.g. BioJS)
-
a DOI provided by Zenodo, figshare, or others (e.g. Chemistry Development Kit (CDK))
-
the program output file (e.g. Dalton/LSDalton programs)
In many cases, however, the project does not provide any instructions on how the authors prefer to be cited at all.
In addition to the different places where users need to look when searching for a project’s citation information, projects also have divergent recommendations. Some projects only ask to cite the software, some to cite the software and a specific paper. Others provide a list of papers to cite, yet others request researchers to only cite articles related to the part of the project they use, such as specific packages, program arguments, or configurations.
In order to resolve, we propose a standard mechanism: software projects should hold a CITATION file in their source code repository. This will allow the file to evolve with the rest of the source code. The file could also be delivered with distributions of the software. Additionally, the CITATION file should be in a standardised format (e.g. the Citation File Format (CFF)) to simplify its re-use, as Stephan Druskat has argued. [2].
Such a standardised CITATION file must contain at least enough information to allow users of the software to cite it, and in particular to cite the version that they are using.
Requirements for a standardised CITATION file format
The format should be machine-readable to enable simple re-use of the contained information, but should also be human-readable, as the file may be included in a software distribution, and end users should be able to easily extract the citation information from it.
It should also be human-writable in order to support on-the-fly creation of CITATION files by software authors. This means that the file structure should be accessible and relatively simple
In short, standardised CITATION files should be easy to read, write and edit.
Furthermore, they will have to provide support for the use cases attested in the software citation principles [1]. In order to do so, a number of software-specific properties will have to be included
The format should supply a reference type for software. Ideally, the format would supply more fine-grained types, such as software source code, software executable, software container, and virtual machine (image).
The specific version of the software that is being referenced should be cited whenever possible, to enable reproducibility. The version number is especially important in cases where authors have not registered a DOI for their software version.
Ideally, the authors will register a DOI for a specific version of a software release. However, if no DOI exists, a combination of the source code repository URL and a version control commit hash/revision number should be supplied. Additionally, if the source code software is not available publicly, the URL of the artifact repository should be supplied.
Especially in the case of commercial or unreleased software, there may be no available DOI for citation. In this case it may be feasible to supply contact information for the author of the software, or a serial or product number. For the latter, the format should have a way of recording generic numbers.
The format would have to define a relatively fine-grained set of date records for download and release dates, which the user may have to fall back on should there be no DOI available.
The format should support all scenarios for authorship. In the context of software, this could mean thousands of authors, a group or project as author, and a mixture of person and group authors. The format should supply a set of roles for authors that caters for different ways a software can be authored. Therefore, roles such as main author, programmer, tester, reporter (of bugs/issues), contributor (of software patches), technical writer (of documentation), administrator (of a software system), researcher (informing a software system) and others should be defined. These may also be useful for other cited works to accomplish a more detailed picture of authorship.
In order to enable unambiguous attribution of authorship it should be possible to link a person to a unique identifier, such as an ORCID.
While it is not strictly necessary for the purposes of citation for a paper or similar, if feasible, the format should supply a way to record the programming, markup, and data languages used in the software. In a future scenario, this information could be used in other metadata records for the software.
Although a standardised CITATION file format is arguably a compromise between plain text CITATION files and a more ideal state (such as, widespread use of transitive credit as described by Katz and Smith [3]), it should be the best possible compromise. Support for specifying the dependencies of a software would be a step in that direction.
Apart from these criteria, the citation format should be all-purpose, able to record different kinds of references beyond software, simply for the reason that software authors may want to provide different sources for citation. In this respect, the format should focus on software citation, but have the ability to function as a general citation format such as BibTeX or RIS, and should thus provide the standard set of reference information items that is also covered by these two formats.
Compatibility with other formats, for example those used by repositories such as Zenodo or figshare, would be ideal. However, the implementation of best practices for software citation should not be sacrificed in order to achieve this compatibility.
Finally, in order to allow Unicode characters in references, the format should be encoded in UTF-8.
We recommend the use of the Citation File Format (CFF) [4], which is being developed specifically to fulfill these requirements.
Support for the use of formatted CITATION files
If we want formatted CITATION files to be adopted and used by the research community, their use must be as easy as writing a README or a LICENSE file. To simplify and support the adoption, use, and interoperability with existing standards and services we recommend establishing a number of tools to create, validate, convert, and read CITATION files in the Citation File Format (CFF) [4].
We believe it would be very useful to offer a lightweight website that asks the right questions (authors, name, version, DOI, etc.) and generates a CITATION file in the recommended format that can be copy-pasted into the software repository, validates the format of a copy-pasted CITATION file, and allows CITATION files from similar formats to be converted to the recommended format in a similar way to Google Translate. Libraries should also be offered for different programming languages that offer bindings to read CITATION files which other codes and projects can build upon.
Usage of CITATION files in the future
We think that CITATION files with a standard format have a lot of re-use potential.
Source code as well as binary repositories can use them to display citation information for credit in a much more prominent manner than just serving the file itself. Similar to license or contribution information, provider sites like GitHub or GitLab, can use CITATION files to highlight citation information in their GUIs, not only on the top (i.e., repository) level, but also down to package/directory level in the source tree, for all levels for which a CITATION file is provided. This would enable high-grained citation information for cases where, for example, a single package or source code file implements an algorithm that has been described in a paper which should be cited when using the implementation.
Also, other actors can re-use CITATION files to generate linked data for transitive credit information ( Katz and Smith [3]), other citation formats for software (such as those provided for R packages), and metadata for software, such as that using the CodeMeta schema [5].
References
[1] Smith AM, Katz DS, Niemeyer KE, FORCE11 Software Citation Working Group. (2016) Software citation principles. PeerJ Computer Science 2:e86. https://doi.org/10.7717/peerj-cs.86
[2] Druskat, Stephan (2017): Track 2 Lightning Talk: Should CITATION files be standardized?. figshare. https://doi.org/10.6084/m9.figshare.3827058
[3] Katz, DS & Smith, AM. (2015) Transitive Credit and JSON-LD. Journal of Open Research Software. 3(1), p.e7. http://doi.org/10.5334/jors.by
[4] Druskat, Stephan. (2017, October 6). Citation File Format (CFF). 0.9-RC1. Zenodo. http://doi.org/10.5281/zenodo.1003150
[5] Matthew B. Jones, Carl Boettiger, Abby Cabunoc Mayes, Arfon Smith, Peter Slaughter, Kyle Niemeyer, Yolanda Gil, Martin Fenner, Krzysztof Nowak, Mark Hahnel, Luke Coy, Alice Allen, Mercè Crosas, Ashley Sands, Neil Chue Hong, Patricia Cruse, Dan Katz, Carole Goble. 2017. CodeMeta: an exchange schema for software metadata. Version 2.0. KNB Data Repository. doi:10.5063/schema/codemeta-2.0