Software Metadata Creation and Curation

Posted by s.aragon on 2 July 2019 - 9:42am

Screenshot of British Library webpage

By Emily Bell, Radu Gheorghiu, Patricia Herterich, Daniel Hobley, and Sarah Stewart.

This post is part of the CW19 speed blog posts series.

All attendees of the Software Sustainability Institute Collaboration Workshop 2019 are users or developers of research software, but may not recognise that the production and use of research software demands effective curation and attention to the metadata. We spent a breakout session thinking about where the community is in terms of effective curation of software and its metadata, what the problems still are, and where we can see opportunities for better practice in the future. This speed blog represents some of our thoughts.

Why is this issue important?

The maintenance of clear metadata is a key issue for software sustainability. If software is not archived properly, it can’t be found – by researchers, users, or learners.1 Work is often replicated simply because researchers aren’t aware of what has already been done, and as a multi-year software project comes to the end of its funding, its outputs and learning opportunities can be lost. Few libraries, for example, even have searchable databases of the software projects that use their own collections. Moreover, even if basic principles of archiving are followed and the existence of that software is recorded, without standardised metadata describing the actual functioning of that software the same issues can arise.

The creation of metadata across memory institutions (e.g. libraries and archives) is fuelled by the work of individuals; even with clear metadata schemas and document type definitions (DTDs), the institutions our teams work with all rely on cataloguers, sometimes with specific training and sometimes not. As a result, inconsistencies creep in.

Sometimes, the same field is filled in differently by different people. Sometimes, deliberate decisions are made to add further fields to represent the idiosyncrasies of a particular item, and sometimes the (perfectly sensible) rationale behind such decision-making isn’t recorded, making it incredibly difficult for future researchers to work with the data. As time passes, institutions forget why decisions were made and new ones corrupt the schema. Together, these issues lead to the same metadata standards being applied differently across collections.

The introduction of new schemas means that older entries become invisible, and variation across collections makes the question of interoperability an impossible one to address. WikiData and Trove present one possible answer to this question in the form of a ‘citizen science’ approach to crowdsourced metadata, relying on the sheer volume of contributors to eventually assert ‘good’ standards, but the issue of human error and variation remains central to the question of metadata creation and curation of software. Occasionally metadata documentation is not made freely available and easily discoverable, but resources such as FAIRsharing are beginning to address this.

Metadata interoperability

We mainly discussed two points: firstly, metadata creation and the issues around large parts of metadata not being interoperable; and secondly, issues around the curation of software going forward. At the moment, metadata is often created by humans (especially for historic collections of material) using a variety of standards that often aren’t interoperable. This makes mining this metadata for research purposes difficult.

For curating software and metadata describing software, there are currently tools that can be used to archive software (e.g. Zenodo or Figshare through GitHub integration), but those services are not curated and thus levels of metadata describing the software vary. Zenodo and Figshare both offer the potential of the DataCite metadata schema, but many users might only fill in the minimum required fields or provide additional structured information.

Other services offer curation (e.g. the Community Surface Dynamics Modelling System) but are discipline-specific with limited resources, while Protégé enables the creation of OWL ontologies to address the question of interoperability -- with the same issue that inconsistency makes it very difficult to create such an ontology2. There is a key question as to who should do this work and where such a position should sit: is this a central curation role? Would having journals curate related software limit the software that is archived to only academic projects and/or published work? And, perhaps most fundamentally, where would the funding for this kind of sustainability project come from?

Curating the Past

We also talked a little about differing needs regarding the curation of new software as it is created and incorporated into existing metadata structures, versus how we go about modifying existing (likely imperfect) metadata describing existing software to fit new schemas. The latter is clearly necessary to move towards comprehensive databases, but for the reasons already discussed, creates a risk of the “broken telephone effect” (or “Stille Post”) where incorrect information is created in the act of copying, and then gets propagated forwards.This effect is probably unavoidable to some extent, but is minimised by reducing the frequency of translations (i.e. by having robust, trustworthy, hopefully future-proof metadata schemes at every stage, to increase their lifetimes), and also by forbidding customisable fields as much as possible.

The former seems the easier need to address in some ways, but still requires methods for catching and entering the metadata about the software into databases for the first time. We cautiously anticipated that current drives in the software development community towards model interoperability will make the actual capture of the metadata more straightforward in the future. This may already be happening, with tools such as Grobid allowing automatic extraction of metadata from PDFs, and tools like the Community Surface Dynamics Modelling System’s Basic Model Interface permitting automated stripping of model input/output data from compatible software.

Improving the future: recommendations

The following recommendations might provide a basis for best practices for metadata creation for software:

Libraries need to develop acquisition policies and workflows for software.
Librarians or curators working with software require training and documentation to learn to identify and describe software using metadata standards that are interoperable (e.g. machine-readable, linked data).
There should be a set metadata framework for software so that there are no non-interoperable fields, or those that there are are clearly described and classified.
A standard framework should be agreed upon -- this could be a combination of several current standards (e.g. DataCite metadata schema) with fields required for software use and interoperability (e.g. dependencies, operating environments, etc. which are often included as a README file.
Use of text and/or data mining and/or an open API to extract key metadata fields from legacy records for eventual enrichment, recognising that it wouldn’t necessary be a good use of resources to fully reformat archiving that has already taken place.
Use of persistent identifiers (such as DOIs) for versioning.

1 For exploration of the role of non-authoritative metadata in making digital objects accessible to learners, see Mimi M. Recker and David A. Wiley, “A Non-authoritative Educational Metadata Ontology for Filtering and Recommending Learning Objects,” Interactive Learning Environments 9:3 (2001), 255-271, DOI: 10.1076/ilee.9.3.255.3568.

2 For a discussion of ways of creating an OWL ontology visualisation and some of the challenges, see Simone Kriglstein, “OWL Ontology Visualization: Graphical Representations of Properties on the Instance Level”, 2010 14th International Conference Information Visualisation, DOI: 10.1109/IV.2010.23.