Abandon hope, we are archiving things we can’t use: considerations for documenting complex objects

Posted by s.aragon on 29 November 2018 - 9:29am
Bunker
Image courtesy of pixabay

By Daina BouquinChristopher BallAnna-Lena LamprechtCatherine JonesTyler J. Skluzacek 

This post is part of the WSSSPE6.1 speed blog posts series.

Containers, virtual machines, Jupyter notebooks, web applications, and data visualisations that run in a browser are all examples of complex digital objects made up of multiple components. Each of those components may have unique dependencies (hardware, software, external datasets, etc.) and different “authors”. Each component will also have different expected functionalities and may even have different licenses. The inherent complexity of composite objects like these are not currently addressed in contexts where metadata standards for data and software are being discussed. As a result, archives are accepting into their repositories digital objects that cannot be cited, reused, or in many cases even opened.

So, abandon all hope. Everything is going to break and we will not be able to find what we need… Actually this is the current state of digital preservation in many contexts and rather than despair, we can at least outline the considerations we need to take into account when we talk about both what is still needed to address these issues and where there are models and projects already helping to address them.

Different metadata for different needs

If we want to have any chance of defining metadata for digital objects like those defined above, we must acknowledge that their complexity creates a spectrum of possible requirements. First though, one needs to assess what the goal of documenting the “object” even is: do we want to open and execute something? Do we just want to be able to inspect it? Do we just want to find it? Do we want to cite it? Which parts do we want to cite? What do we want the pieces to link to? Is this a stand alone object or a supplement to something else?

Metadata to address each of these questions would require different elements. Different disciplines and communities will have different values to employ in deciding what rises to the level of “open and execute,” but what minimal metadata can we recommend to people who take it upon themselves to go to that level? What minimal metadata can we expect to require when we are not looking to go that far? For instance if you wanted to open and execute software that requires external datasets or APIs, you would need a toy example and more information/bundled dependencies to actually do that. If you just wanted to cite a composite object and not distinguish between its various components, your requirements would be much lower.

So, there are a lot of questions communities will need to ask themselves. At any rate, there will be common considerations that need to be assessed independently from the discipline in question. Below we have outlined some of those considerations and currently unresolved questions:

  • Different “levels” of metadata:

    • Documenting each component on its own: “composite object” implies that there are multiple pieces (duh). Each piece needs to be something that can be distinguishable from the others if you want to be able to cite or version each piece or share the pieces independent of one another.

      • E.g. Jupyter notebooks are made up of a .ipynb file, a kernel, and a browser app that is used to render the notebook. The ipynb file could also be exported as HTML, a PDF, or many other formats. Those pieces could then be put into a docker container or VM. Different metadata elements would be needed to document each of these components given different expected functions (just cite or run the whole thing)

    • Composite whole: much like the idea of a “concept” doi, the composite object should be something that can be talked about as a whole and citable as a supplement to something else. Ideally the “concept” should be linked to metadata about each component

      • E.g. a Jupyter notebook that is a supplement to a research paper (Starry - https://arxiv.org/abs/1810.06559)

      • E.g. a VM containing the desktop application, toy example datasets, dependencies needed to reproduce a finding

    • If the object is a stand alone object (not a supplement to something)

      • The amount of metadata you gather will need to be linked to whether or not the expectation is that a person would be able to run, inspect, find, and/or cite the object and any versions of the object.

    • If the object is a supplement to another object:

      • The composite object needs its own identifier separate from the object that it is supplementing. If you want to be able to version different pieces of the composite object or the object as a whole, that object should be able to change at a rate that is different from the thing that it is supplementing.

We are not starting from scratch

There are projects under development that can be leveraged or augmented to start addressing some of this complexity. Projects like CodeMeta and CFF could be used to start documenting individual components of composite objects to ensure that people who build containers or other types of complex objects can be credited for the work they are contributing. Tools like ReproZip could be used to generate more comprehensive documentation and reproducible workflows. ReproZip containers could be treated as their own composite objects or supplements to others.

The ability to link files like these together or distinguish between components (e.g. codemeta.json file for a VM and a codemeta.json file for a piece of code linked to a paper) will likely require these formats to be extended or augmented or cross-walked to other standard formats. The ability to get people to create either of these files though is another matter entirely.

Implementation is still about people

People do not generally document their work to the degree that would be needed to adequately “archive” the fully reproducible execution of the types of complicated objects we’re focusing on. That being said, you can always ask and work on developing tools and protocols that are easily implemented at various stages of the software development/research process. Discipline-specific minimal requirements and expectations need to be robustly discussed and debated. In the scholarly landscape publishers and funders also need to contribute. Without knowing how much funders value the software that they fund, disciplinary communities will not be able to adequately assess the degree to which they should be documenting and ascribing metadata to what they make.


Want to discuss this post with us? Send us an email or contact us on Twitter @SoftwareSaved.