Data sustainability: a polar perspective
Posted on 8 October 2012
Data sustainability: a polar perspective
By Kathryn Rose, Agent and postdoctoral researcher, British Antarctic Survey.
The Polar Regions represent some of the most remote and harsh environments on Earth, and yet they have always drawn our attention, forming an international arena for science, exploration and discovery. The importance of these regions as drivers of global environmental change has become astoundingly evident. There is been a growing need to ensure the sustainability of the data produced from these remarkable international efforts. Access to data sets will become even more critical as economic and political interest continues to grow in the polar regions.
At the end of the nineteenth century it was obvious to scientists that insights into these unique regions would be best gained through collaborative expeditions and synchronous long-term observations. From this concept the first International Polar Year (IPY) was born (1882-1883). Each successive IPY has seen a marked increase in the quantity and quality of data generated from the Polar Regions.
In May 2012, an international conference held in Montreal, Canada, marked the close of the fourth IPY (2007-2008). The conference really emphasised the need to share, as effectively as possible, the significant quantity of data collected during the IPY. What do we mean by data sustainability? It can relate to all aspects of the scientific process, from careful, standardised data collection, through to collaborative data sharing and archiving. Importantly though, to be sustainable, we require a fully accessible long-term archive of data and an associated fully integrated software system that can be used, or developed, to help encourage the everyday scientist to share and document their findings. The overarching driver behind this is to try and ensure that all IPY endeavours are exploited to their fullest, in terms of both their scientific merit and their relevance to both the public and policy makers. A factor that is even more pertinent as we face today’s global economic downturn. The two main organisations that oversee polar science – the International Arctic Science Committee (IASC) and the Scientific Committee on Antarctic Research (SCAR) – are working to this end. For Antarctic scientists, the free exchange of data is also an integral part of the Antarctic Treaty that upholds the culture of science, discovery and knowledge advancement at the heart of the Antarctic continent.
Long-term data sustainability requires a good data-management infrastructure. For Antarctic data management, the breadth of research disciplines actively pursued south of 60 degrees can, in itself, be very challenging. Data sets can vary enormously in terms of the way they’re collected (instrumentation and software used), their size, format and organisation. In addition, research projects often result from significant international collaboration, so that they may become strongly multidisciplinary and potentially disparate. Problems may also arise if the requirements for data dissemination and management vary between countries. Ideally we require a globally accessible framework in which data can be handled in a consistent manner. So what data management strategies do we already have in place?
Examples include the Antarctic Master Directory, which represents a high-level database that is part of the Global Change Master Directory. The latter was set up to reference what work has been done and where, in order to increase collaboration and prevent repetition. This is particularly useful in areas, such as the Polar Regions, where many international scientific organisations work. During the IPY, an IPY Data Policy was established and applied to all IPY funded projects, requiring that datasets be made publically available. An IPY Data and Information Service (IPYDIS) was also launched in order to ensure the long-term preservation of scientific data by developing archives (complete with metadata) and data centres. The aim of IPYDIS has been to create a fully integrated data-sharing system that will catalogue all IPY data in globally and freely accessible archives. Additionally, many more localised, discipline specific, data repositories also exist. For the Antarctic, SCAR has established SC-ADM (the SCAR – Antarctic Data Management group), designed to coordinate and manage Antarctic data sets.
In the last decade, the polar science community has taken large steps to develop mechanisms by which scientists can more readily exchange information. A worthy example includes SCAR’s Marine Biodiversity Information Network (SCAR-MarBIN). Due to the importance of the Southern Ocean to ecological systems, a web portal was established to compile and manage data on marine biodiversity across this vast area. This significant project provides the first complete register of Antarctic marine species. The system allows species to be visualised and mapped, and information can be browsed and downloaded. As a result, scientists can highlight key areas that are undergoing change, that need further investigation, or indeed protection. So whilst the benefits of such data sharing seem obvious, and there appears to be a growing number of online resources available for data management, what are the attitudes towards data sharing systems and are they actually used?
In an informal survey, scientists were asked not only if they wanted open access to data sets but also if they were currently sharing their data. The answer was typically yes to the former, but often no to the latter. This throws up several questions on how and when people share their data, whether there are suitable means available for people to share their data and whether or not this process should be made compulsory. It seems that data preservation and access are often considered to be afterthoughts and regularly left with data managers to deal with. There is a growing awareness that further consideration for data management and sustainability needs to be made during data collection and even at the grant writing stage.
There are several ways in which communities are trying to enhance data sharing. In one example, new data management software linked to website interfaces is being developed to try and persuade the everyday scientist to engage in data management during the course of a project. As a front runner, the Antarctic Biodiversity Information Facility (ANTABIF) is a data repository that standardises, cleans, and adds metadata to datasets. To encourage scientists to make their data available they have developed software with a semi-automatic paper producing system. The simple online software interface is designed to be akin to Endnote. The user enters a description of the data in a project and the processing steps used, and the system organises a paper according to completed metadata fields. The paper produced can be edited before it’s submitted for peer review, but it is then available online with its data. It is hoped that such open access websites will make datasets more readily accessible, well advertised and thus cited. The aim is to encourage collaborations, prevent study repetition and generally allow people to use good datasets to the full. At the moment this tool is just for biodiversity data, but the format could be transferred to other research themes to make the whole system of sharing information, science and ideas easier.
In other areas, such as genetics, publications will not accept manuscripts unless the data (genetic code) discussed is already submitted to an open access data repository (e.g. GenBank as described in this Nature article). Other funding organisations, such as the USA’s National Science Foundation, require that data should be made available within a stated time following collection. These approaches highlight that there needs to be a balance between free and open access to data and metadata, and protecting the rights of investigators. As the tools for data management and data sharing become more readily available and user friendly it seems likely that uptake will increase. A combined carrot and stick approach may be the most successful in driving this process, and may ultimately result in greater sustainability of polar datasets.