How do we improve data management in machine learning?

Posted by j.laird on 6 July 2021 - 10:00am

Lego By Rick Mason via Unsplash

By Jonathan Frawley (Durham University), Emily Lewis (UCL/UKAEA), Stephen Haddad (Met Office Informatics Lab) and Carlos Martinez (Netherlands eScience Center).

This blog post is part of our Collaborations Workshop 2021 speed blog series.

A large part of machine learning (ML), and data science in general, is organising and preparing data so that it is ready for analysis. Though crucial, many report this as the least interesting and most difficult part of the machine learning pipeline. Time spent wrangling a dataset could be used to refine models or test new algorithms. Instead researchers spend large amounts of precious time wondering which data format to use or even where to host their dataset. There are many resources available for the later stages of the ML process, but there are comparatively fewer for these early questions of data engineering.

A data scientist might ask: what is the best method for handling my empty data and NaN values? In the best case, it might be that the ML model handles empty data values fine and no preprocessing is necessary. In the worst case it may not train at all! Would it be appropriate to replace those null values with zero? Or would it be better to impute those values? Or drop those features completely?

This can start to feel like a rabbit hole of branching questions with unclear and context-dependent answers. Best practices are highly problem specific, often gained from informal experience and not well documented. They also draw on information from many different domains such as statistics, programming and algorithm design.

A key challenge is to improve access to this knowledge and enhance data management for all researchers. What are these best practices and how do we share them most effectively?

In this post we discuss several common problems, suggest individual approaches and propose the RSE Toolkit as a possible solution for sharing knowledge, along with Zenodo for sharing data.

Common problems

Although the specific details of issues faced by data scientists are as diverse as can be, there are certain similarities across common problem types. The following is a broad categorisation of challenges encountered. Most data scientists working in ML should recognise some of them - and if you have never faced these issues, please share your secret with us!

Missing data -- incomplete observations, missing columns, gaps in time series, you name it. Incoming data is never as complete or perfect as our ML algorithms would like it to be. Sometimes you can interpolate, sometimes you can deduce, sometimes you need a crystal ball to figure out what those missing values are.
Data wrangling -- transforming data from what it is to what it should be is also a common challenge. For a human capturing data “1.0”, “one” and 1 are all the same, but if you want to feed this to a ML algorithm, you better make sure they have all the same format.
Sharing datasets -- after all the effort required to prepare a dataset, you sometimes want to share that with your colleagues. And although arguably there are many options for sharing large volumes of data, it is difficult to decide the most efficient way of doing so.
Working with sensitive data -- alternatively you do not always want to share your data! Sometimes (particularly when working with personal or medical information), it’s important to make sure you are not accidentally sharing any sensitive data and safeguard access.

Individually tailored solutions

Ultimately researchers are dealing with data specific to their particular problem. No tool, library or code is going to generalise perfectly to all cases. However most workflows can be decomposed into a series of standard problems with known solutions. Drawing out general patterns contributes to a set of best practices a researcher can use to inform their own pipeline.

Returning to the common problem of missing data: irrespective of the domain we can identify the specific type of data present. Among other things it might be categorical or numerical, bounded floating point or fixed, or even a set of pixel images. Depending on type and context, we can then decide appropriate approaches for performing data in-fill.

For example, numerical data would lend itself to imputation, use of which would enable NaN-sensitive models later in the pipeline. Knowing how to snap together these modular, interchangeable “Lego bricks” of generic data science enables researchers to build a solution for their specific problem.

Lego sets usually come with an instruction booklet. The machine learning equivalent would be a manual of standard data science approaches with examples of how to combine them. Such generalised decision making frameworks would be helpful in many aspects of data management.

Integration with RSE Toolkit

The RSE Toolkit project aims to create a curated list of resources to support RSE’s creating robust and sustainable code. This is the ideal platform to host our proposed data science manual. Our goal is to compile best practices and knowledge to incorporate guidelines for machine learning projects into the RSE Toolkit. This will reduce duplication of effort among RSE’s and researchers working in data science. We hope this will be of use to the community and welcome contributions.

Data sharing infrastructure issues

Infrastructure for sharing datasets is required. The MNIST database of handwritten digits is a highly used resource for machine learning examples. PyTorch, one of the most commonly used Python libraries for machine learning, mentions http://yann.lecun.com/exdb/mnist as the official source to download this dataset. This is potentially an issue as this data is hosted in the website of Professor Yann LeCun, and although this has been the official place where this dataset is historically found, it does not provide any guarantee of the website being available in the long term and thus there is no guarantee of the dataset being sustainable. This in turn translates into a sustainability risk for PyTorch examples.

Zenodo offers a solution for this issue. By default Zenodo has a 50 GB limit per record, although higher quotas can be requested and granted on a case-by-case basis. Additionally Zenodo is hosted by laboratory CERN, and provides a promise of record retention for the lifetime of the laboratory, which currently has an experimental programme defined for the next 20 years at least.

Summary

By openly sharing our expertise on data management, we can help researchers spend more time on research and less time wrangling data. Zenodo offers a long-term location for research data, which will allow researchers in the future to more easily reproduce results.