Life sciences data needs to be FAIR

Posted by s.aragon on 30 January 2018 - 9:59am

By Justin Clark-Casey, Research Software Engineer, Department of Genetics, University of Cambridge.

InterMine is an open-source life sciences data integration platform created at the University of Cambridge in the UK. It takes biological data from many sources and combines it into a unified whole, adding visualisations and search facilities that aim to provide insights for scientists and engineers that can accelerate research and the growth of the bioeconomy. Over the next three years, through a grant from the UK Government’s BBSRC funding agency, the InterMine team is going to be making sure that the data it makes available is as findable, accessible, interoperable and reusable—”FAIR”, for short—as possible. Keep reading to find out why we believe that making the world of data a “FAIRer” place is so important.

The Oxford Dictionaries define data as the “facts and statistics collected together for reference or analysis.”. We’re gathering ever more of it every year, and it’s no wonder that The Economist, for one, calls it “a driver of growth and change”, creating “new infrastructure, new businesses, new monopolies, new politics and—crucially—new economics”.

The life sciences and biomedical domains are no different. High-throughput technologies such as next-generation DNA sequencing have blossomed in recent decades, kick-started by initiatives such as the Human Genome Project and driven forward by their immense value in the lab and the clinic. As the name “high-throughput” suggests, these techniques are generating more data than ever before. As of the end of 2015, the European Bioinformatics Institute (whose tagline is “The home for big data in biology”) was storing 75 petabytes worth of data, a quarter the size of Facebook in the preceding year.

But this is just the beginning. By 2025, genomics data acquisition, storage, distribution and analysis requirements are predicted to match or exceed the requirements of astronomy, YouTube or Twitter. This includes not only the types of data we already collect, such as human genome sequences, but also data coming from the fast expansion of fields such as proteomics (the large-scale study of proteins), phenomics (the systematic study of physical and biochemical traits) and metagenomics (the study of genetic material gathered directly from the environment, instead of from selected organisms). On top of that, pile biomedical data from traditional and new sources such as the Internet of Things (think of the personal activity tracker on your wrist) and mobile-phone health apps. Data mountains are becoming entire data countries, entire continents, entire planets.

Yet this landscape of data is critical if we are to address present and future challenges in science and medicine. We need it for cancer research, for drug development, for tackling the iniquities of aging. We need it to monitor and take care of the environment, and produce goods and services more cleanly and cheaply through biology.

A data landscape is no good, however, if you can’t navigate it, if you can’t find within it what you need for your work or life. Modern search engines are good for questions of daily existence— “What does insulin do?”, “What are the best migraine medicines?”—but harder to use for scientific and medical research questions. Analysing and ranking the relevance of terse and technical scientific language is difficult, the same gene name can refer to different genes from different organisms, and searching for an antibiotic can bring up an unwanted mixture of information about the antibiotic itself and where it was used as a means of conducting other scientific experiments.

You also can’t be sure that a search engine has access to the most relevant data. Sometimes this is for very good reasons—for instance, when the data is about human beings you want to be sure that no-one’s privacy is compromised, which can mean restricting access to authorised researchers and anonymising records. But at other times data interfaces are simply ill-designed for computers to access. If we are going to run machine learning and other big data technologies over our enormous datasets we need to make them easy for machines to access.

Suppose you can find and access the data that you want. Can you understand it? Can you interpret the text of a paper or the column headers of a spreadsheet to know how that information integrates with all the other data that you have? In other words, are the datasets interoperable with each other? If you’re human, then you integrate data all the time. Let’s say you’re a researcher studying the effects of exercise on breast cancer. You have some NHS data tables for patients with some rows labeled “BRCA1”, the official symbol for a breast cancer gene. With custom software, you can link this with exercise levels from activity tracker data. What’s more, you know that “BRCA1” is sometimes called “breast cancer 1”, and that the standard Medical Subject Heading Term (MeSH) for exercise is “D01544”, so you can use these terms to find existing relevant medical studies. Tying disparate data sources together like this is essential for moving research forward. But it’s still really hard for machines to do, despite advances in machine learning.

Okay, so let’s assume you’ve found, accessed and mashed interoperable data together, ready for use in settings such as research or product development. Do you have the rights to reuse everything that you’ve integrated? This is a question of ever more importance as data increases in value, is drawn from ever more sources and may be subject to terms that restrict republishing. You don’t want to spend years on a project or product only to find that you need to spend unexpected money on data licensing fees or, even worse, that you simply can’t trace the license associated with some data, which prevents you from using it at all.

None of these problems are entirely new. But as the data landscape we’re building grows ever broader and richer, they’re coming increasingly to the fore, both for ourselves and for the machines we rely upon to handle our data.

How can we respond? That’s what the the FAIR initiative is about, a coordinated and sustained effort to build standards, guidelines and software to meet the coming data challenges. As you might guess from the initials, it’s aimed at making data maximally Findable, Accessible, Interoperable and Reuseable, all of the areas that we discussed above. It’s an effort drawing in many organisations and individuals from science, industry, government and beyond.

So what are the proposed solutions? FAIR data is a broad and evolving idea, and there are any number of proposed practices and technologies to promote it. To give some examples, one aspect of findability is better search for specific scientific data, as we touched on before. This means better and more structured web-accessible descriptions of datasets, so that both biology-specific tools and general search engines like Google can make data of interest easier to locate. An aspect of accessibility is making data easier to retrieve. For instance, if we want to find out what a particular biological database knows about a protein we want to go to a simple URL rather than entering details in some custom webpage. Interoperability means using common data formats and structures created by organizations such as the World Wide Web Consortium (W3C) rather than coming up with our own. And reusability means adding license information to all downloaded data, in a format that machines can understand.

If you’re interested in following InterMine's journey to make it more FAIR, or in collaborating on code or ideas to make the world of data a FAIRer place in general, then please visit InterMine's blog or tweet at @intermineorg. And on my personal blog I’m always happy to discuss FAIR, open science, and the growing knowledge graph ecosystem that surrounds it.