Making biology compute with CGAT

Posted by a.hay on 25 February 2015 - 2:00pm

By Alexander Hay, the Institute’s Policy & Communications Consultant, talking with Andreas Hegar, CGAT.

This article is part of our series: Breaking Software Barriers, in which Alexander Hay investigates how our Research Software Group has helped projects improve their research software. If you would like help with your software, let us know.

Life Sciences often suffer from a lack of programming skills. This isn’t always a problem – you don’t need to know how to code in order to gauge the diurnal eating habits of squirrels, for example – but it does become an issue when you need to work with large datasets.

This is a growing problem. Next Generation sequencing techniques produce vastly more data than ever before, and more people are needed to properly handle this and analyse it. Many life scientists do not need these skills, or at least, have not needed them until recently. The most sensible solution to this, then, is to train biologists these skills.

Enter CGAT

One solution to this problem is Computational Genomics: Analysis and Training, or CGAT, based at the MRC Functional Genomics Unit at the University of Oxford. This is run by a core staff of five, which includes Technical Director Dr Andreas Hegar, and was founded as a response to the simple fact that many biologists lack the essential data translation skills needed to work with data.

CGAT’s approach is to directly train the biologists themselves, who are chosen from a highly competitive list of applicants – 70 to 80 per position in some cases. Those who are selected are picked based on their keenness to learn. After this, the CGAT team work out the extent of their computing skills and then draws up a training plan tailored for each particular student.

This includes introductory courses in shell scripting and coding, such as in Python, and they are then set specific tasks, such as writing code for the CGAT database and workflows. Afterwards they are assigned to work on a diverse range of projects, from cancer research to transcription regulation, which they manage and run with CGAT’s partners, the CGAT team itself keeping an eye on how this progresses from the sidelines.

So far this has been a success, with the first batch of CGAT fellows having completed their stints and returned back to their original fields of expertise, but now with the software skills they can use to improve their career prospects. So why did CGAT need the Institute’s help?

“Anyone for a portable CGAT?”

"One thing we needed help with was our software development procedures”, Andreas explains. However, “we also have issues with the external dependencies of our code. This makes it very hard to port software from one environment to another.”

While this might sound easy for seasoned coders, “it is a very large collection of scripts that rely on between 40 to 50 external tools."

To complicate matters further, Andreas adds, “all of these tools have different dependencies as well and they need system libraries to be installed and compiled. They need data to be indexed and located in certain locations.”

This lead to the main challenge facing CGAT – how were their fellows able to keep on making use of the CGAT environment after the end of their fellowship? In other words, what was the best way to make this all portable? Andreas contacted the Institute and asked for its help.

“Who analyses the analysts?”

Interestingly, the Institute was most helpful in demonstrating how effective CGAT’s approach already was. “We were already following a lot of best practice and it was very good to get this confirmation. We're not computer scientists and don't come from a software development background - we're all scientific programmers - so it was good to get some confirmation that what we do is on the right track."

Still, there was nonetheless room for improvement. For starters, the Institute helped CGAT improve its web presence, ensure its site was improved and also encouraged a move into social media, particularly Twitter, which has helped the team recruit new members. CGAT also made greater use of GitHub’s advanced features, such as Milestones, which allowed it to open its processes much more than before.

“So, previously, we would all sit in one room and it's very easy to resolve issues at that level,” Andreas explains, “but we also now put up more on GitHub as a record and to make sure that progress is measured and it is all tracked properly."

There was one main problem which the Institute spotted, though. CGAT’s software required more skill to install in the first place than was needed to operate it afterwards. “It highlighted another area we could have done better”, Andreas admits, and adds that they plan to remedy this by providing more installation options and piggy backing off a US-based stack CGAT is presently negotiating access to.

The next step for Andreas is to look at the instruction itself. "I'd be interested to hear whether the training we do is something that builds good scientific software designers, so it would be good for the Institute to see if our training is heading in the right direction to building sustainable software.”

If you'd like free help to assess or improve your software, why not submit an application into the Institute's Open Call?