Software and research: the Institute's Blog

Docker helps biofuels research

By Scott Edmunds, Executive Editor at GigaScience.

With greater awareness of the difficulties in making scientific research more reproducible, numerous technical fixes are being suggested to move publishing away from static and often not reproducible papers - which have changed little since the 17th century - to more reproducible digital objects that better fit 21st century technology. New research in the Open Access journal GigaScience demonstrates a potential approach through publishing open data and code in containerised form using Docker, and also allowing scientists to tackle another scourge of the 21st century – climate change, through better understanding of the production of biofuels.

One of the most promising areas in biofuel development is biogas, which has huge potential as a renewable and clean source of energy. Biogas is the production of methane gas through the anaerobic digestion (fermentation) of organic matter such as agricultural or food waste. Detailed knowledge on the functioning of the fermentation process is key for optimising this process. However, the vast majority of the microbes involved remain unknown and cannot be cultivated in laboratories.

Making code citable with Zenodo and GitHub

By Megan Potter & Tim Smith, CERN.

For Open Science, it is important to cite the software you use in your research, as has been mentioned in previous articles on this blog. Particularly, you should cite any software that made a significant or unique impact on your work. Modern research relies heavily on computerised data analysis, and we should elevate its standing to a core research activity with data and software as prime research artefacts.  Steps must be taken to preserve and cite software in a sustainable, identifiable and simple way. This is how digital repositories like Zenodo can help.

Best practice for citing a digital resource like code is to refer to a digital object identifier (DOI) for it whenever possible. This is because DOIs are persistent identifiers that can be obtained only by an agency that commits to the obligation to maintain a reliable level of consistency in and preservation of the resource. As a digital repository, Zenodo registers DOIs for all submissions through DataCite and preserves these submissions using the safe and trusted foundation of CERN’s data centre, alongside the biggest scientific dataset in the world, the LHC’s 100PB Big Data store. This means that the code preserved in Zenodo will be accessible for years to come, and the DOIs will function as perpetual links to the resources. DOI based citations remain valuable since they are future proofed against URL or even protocol changes, through resolvers such as doi which currently direct to URIs.  DOIs also help discoverability tools, like search engines and indexing services, to track software usage through different citations, which in turn elevates the reputation of the programmer.

Is your research software correct?

By Mike Croucher, Fellow of the Software Sustainability Institute. This article was originally posted on his blog www.walkingrandomly.com

You’ve written a computer program in your favourite language as part of your research and are getting some great-looking results.

The results could change everything! Perhaps they’ll influence world-economicsincrease understanding of multidrug resistanceimprove health and well-being for the population of entire countries or help with the analysis of brain MRI scans

Thanks to you and your research, the world will be a better place. Life is wonderful; this is why you went into research. 

It’s just a shame that you’re completely wrong but don’t yet know it.

EclipseCon France 2015

By Boris Adryan, Fellow of the Software Sustainability Institute

The Eclipse Foundation is an open-source community and home to tools that primarily revolve around software development. Their most well-known project is the Eclipse IDE, which has become the standard integrated development environment for Java programmers. Eclipse is a broad church, however, and provides a platform for collaborative projects reaching from software (writing) to interest groups (talking).

At the Software Sustainability Institute's "Selection Day 2014" I met Tracy Miranda from Kichwa Coders, a small software consultancy that works closely with DAWNScience, an open-source data science project developed at the Diamond Light Source by my co-fellow Mark Basham and others. Tracy is not only a superb developer, but also heads the relatively new Eclipse Science interest group. So far Eclipse Science only has a few projects in their portfolio (for example DAWNScience), but the contributors face much of the same challenges that the Software Sustainability Institute has recognised. Thus, the Eclipse Foundation asked me to wear my academic hat and that of the Institute's Fellows, and invited me to give a talk on "Better Software Better Research." and the mission of the Software Sustainability Institute. The slides of my talk are online and in the following I'm going to summarise a number of insights I gained at the conference.

Institute support for EPSRC RSE Fellows and RSE Community

RSE AGM and HackdayBy Neil Chue Hong, Director.

The Institute is delighted by the large number and breadth of applications to EPSRC's pilot Research Software Engineer Fellowship scheme, which represents a potential step change in the way that research software development is undertaken in the UK. As one of the original proponents of the term 'Research Software Engineer', and a constant supporter of the RSE career path and community, we will continue to help the community in achieving its goals. In particular, we are committed to working with the RSE community, and successful RSE Fellows in the following ways, including continued support for the UK RSE Association and helping establish a network for RSE Group Leaders with Sheffield, Manchester, Southampton and UCL.

Data Carpentry goes to Netherlands

By Aleksandra Pawlik, Training Lead.

Last week the Institute helped to run Data Carpentry hackathon and workshop at the University of Utrecht in Netherlands. Both events were a part of ELIXIR Pilot Project aiming to develop Data and Software Carpentry training across the ELIXIR Nodes. The project is coordinated by ELIXIR UK and a number of other Nodes are partnering up, including ELIXIR Netherlands, ELIXIR Finland and ELIXIR Switzerland.

The hackathon consisted of two days during which the participants, representing ten ELIXIR Nodes, worked on Data Carpentry training materials. Day one started with the introduction to Data and Software Carpentry teaching model. This was then followed by a review and discussion on the existing materials. The participants made suggestions about the possible improvements for the existing materials and new topics to be developed. The overall theme of the hackathon was genomics and hence the participants could base their work on the existing contents for teaching genomics in Data Carpentry. Eventually three groups were formed:

  • Group 1 which worked on creating training materials on using ELIXIR Cloud resources.
  • Group 2 which worked on a decision tree for using cloud computing.
  • Group 3 which worked on different aspects of understanding how to use one's data for genomics. In particular the group worked on describing the file formats, file manipulation, pipelines integration, post-assembly - de novo RNA Transcriptome Analysis, handling blast annotation output and verifying data.

Revealing the magic of field theory algebra

By Paul Graham, EPCC and Software Sustainability Institute.

We have a new project working with Dr Kasper Peeters of Durham University and his software, Cadabra: a computer algebra system which can perform symbolic algebraic computations in classical and quantum field theory. In contrast to other software packages, Cadabra is written with this specific application area in mind, and addresses points where the more general purpose systems are unsuitable or require excessive amounts of additional programming to solve the problems at hand.

Cadabra has extensive functionality for tensor computer algebra, tensor polynomial simplification including multi-term symmetries, fermions and anti-commuting variables, Clifford algebras and Fierz transformations, implicit coordinate dependence, multiple index types and many more. The input format is a subset of TeX, and it supports both a command-line and a graphical interface.

The ultimate brain-dump: unifying Neuroscience software development

By Dr Robyn Grant, Lecturer in Comparative Physiology and Behaviour at Manchester Metropolitan University.

In the neurosciences, we produce tons of data, in many different forms – images, electrophysiology and video, to name but a few. If we are really going to work together to answer big picture questions about the brain, then the different data types really need to start interacting and building on information from each other. I understand this, however, it is quite complex in practice, and begs questions in how best to specify data types, annotations and formats to make sure researchers can develop the software and hardware to interface efficiently.

One first step towards a unifying concept for data, software and modelling is the Human Brain Project (HBP). This is a European funding initiative that has received a lot of criticism globally for its big picture thinking and focus on human brain experiments. At the British Neuroscience Association meeting 2015 this Spring, I attended a session on the HBP, interested to see what they might say.

The Light Source Fantastic: a bright future for DAWN

By Steve Crouch, Research Software Group Leader, talking with Matt Gerring, Senior Software Developer at Diamond Light Source and Mark Basham, Software Sustainability Institute Fellow and Senior Software Scientist at Diamond Light Source.

This article is part of our series: Breaking Software Barriers, in which we investigate how our Research Software Group has helped projects improve their research software. If you would like help with your software, let us know.

Building a vibrant user and developer community around research software is often a challenge. But managing a large, successful community collaboration that is looking to grow presents its own challenges. The DAWN software supports a community of scientists who analyse and visualise experimental data from the Diamond Light Source. An assessment by the Institute has helped the team to not only attract new users and developers, but also increase DAWN’s standing within the Eclipse community.

The Diamond Light Source is the UK’s national synchrotron facility based at the Harwell Campus in Oxfordshire. By speeding up electrons to near light speed, they give off light that is 10 billion times brighter than the sun. Over 3000 scientists have used this light to study all kinds of matter, including new medicines and disease treatments, structural stresses in aircraft components, and fragments of ancient paintings, to name but a few.

How to avoid having to retract your genomics analysis

By Yannick Wurm, Lecturer in Bioinformatics, Queen Mary University of London.

Biology is a data science

The dramatic plunge in DNA sequencing costs means that a single MSc or PhD student can now generate data that would have cost $15,000,000 only ten years ago. We are thus leaping from lab-notebook-scale science to research that requires extensive programming, statistics and high-performance computing.

This is exciting and empowering - in particular for small teams working on emerging model organisms that lacked genomic resources. But with great power come great responsibility... and risks that things could go wrong.

These risks are far greater for genome biologists than say physicists or astronomers who have strong traditions of working with large datasets. This is because biologist Researchers generally learn data handling skills ad hoc and have little opportunity to gain knowledge of best practices. Biologist Principal Investigators - having never themselves handled huge datasets - have difficulty in critically evaluating the data and approaches. New data are often messy with no standard analysis approach, even so-called standard analysis methodologies generally remain young or approximative. Analyses that are intended to pull out interesting patterns (e.g. genome scans for positive selection, GO/gene set enrichment analyses) will enrich for mistakes or biases in the data. Data generation protocols are immature & include hidden biases leading to confounding factors (when things you are comparing differ not only according to the trait of interest but also in how they were prepared) or pseudoreplication (when one independent measurement is considered as multiple measurements).