Home News and blogs hub

What are the formats, tools and techniques for harvesting metadata from software repositories?

Author(s)

Stephan Druskat

SSI fellow

Daniel Garijo

Mark Turner

Alex Henderson

Posted on 20 May 2021

Estimated read time: 4 min

Sections in this article

What are the formats, tools and techniques for harvesting metadata from software repositories?

Posted by j.laird on 20 May 2021 - 1:39pm

Harvester in a field Image from Hinrich, CC BY-SA 2.0 DE

By Daniel Garijo, Stephan Druskat, Mark Turner and Alex Henderson.

This blog post is part of our Collaborations Workshop 2021 speed blog series.

Software metadata describes software components by specifying their main features, domain, limitations, license, authors, or usage instructions. It is crucial for finding, understanding, assessing the quality of or comparing different tools; but so far very few repositories provide these metadata in a machine-readable format.

In this post we summarise the discussions held at the CW21 collaborations workshop to: 1) highlight issues with current practices, 2) provide pointers to tools that may help improve the situation and 3) summarise incentives for the research community to start adopting best practices for software metadata.

Is software really findable?

While academic publications have generated methods for recording basic structured metadata (e.g. creation, title, keywords), there is no common practice followed by code repositories so far. Researchers often provide most of these metadata in their documentation or README files, which makes it difficult to consume by automated harvesters. When registering software in a registry, researchers have to describe it by hand.

Barriers and tools for creating software metadata

The scientific community has started to develop common vocabularies for software metadata. CodeMeta is a Schema.org extension designed to capture basic software metadata, acting as a “Rosetta Stone” between dozens of software metadata vocabularies used by different software registries. CodeMeta also has tools for guiding researchers when creating their software metadata. However, there are still a number of issues with current practice:

Generating metadata by hand has little appeal to researchers. If they have spent a significant amount of time creating highly detailed documentation, why spend additional time to represent this information in a machine-readable manner? Tools for automated metadata extraction, such as SOMEF, present an opportunity to automatically harvest metadata from README files. However, automated approaches are always prone to small errors and may require additional curation.
Software covers a wide range of domains, which may require specialised metadata. These are outside the scope of CodeMeta, but are usually critical to finding similar software for a specialised task. How to consistently describe software in specialised domains?
Software registries usually rely on curation by domain experts, who review individual entries manually. How can we ensure that researchers annotate correct metadata without having to resort to additional curators? How can we incentivise and support librarians, as original metadata experts, to bring software under their scope?

Moving forward: How to incentivise RSEs?

We believe that creating a community discussion towards the following measures will help improve software metadata capture:

Metadata roles and credit: Public institutions should acknowledge software and its metadata as a first class citizen in research. Similar to the role of librarians in institutional repositories, a role for software curation should be created to help maintain and describe software created by RSEs associated with a public institution. Like publications, software and its metadata should be taken into consideration in their CVs.
Seals of metadata quality: Automated tools may help to assess the status of a software component automatically to see how well described it is, as well as helping with templates to describe it better. These tools can help guide users adopt best practices for software metadata.
Software metadata impact studies: We need to better quantify the long term effect of having proper metadata on finding and reusing research software, in order to better motivate RSEs to create and maintain such metadata.
Software education curricula: Good practices for software metadata should be taught at universities and internalised by early career researchers and RSEs as an integral part of the job.

If you are interested in following up this discussion, join the FAIR for Research Software working group (FAIR4RS) or the Consortium for Scientific Software Registries and Repositories, where we are working towards realising some of these practices.