An introduction to CGAT

Posted by a.hay on 5 February 2015 - 2:00pm

By Andreas Heger, CGAT Technical Director.

Today, biologists have access to high-throughput measurement techniques that can assay many variables or entities at the same time. One striking example has been the advent of massively parallel sequencing techniques in the form of next-generation sequencing (NGS).

While the sequencing of the human genome took more than ten years and cost billions of pounds just a decade ago, a researcher can now send off material to a sequencing service and expect the equivalent of multiple human genomes worth of data within a few weeks and for not much more than the cost of a typical experiment. Unfortunately, few biologists are trained to best deal with the handling and statistical issues of the resultant large data sets.

CGAT is a training program funded by the Medical Research Council. Our aim is to take biologists with no or little computational experience and provide them with the skills to analyse large and complex data sets. As a result, our graduates are able to combine both worlds. They can now formulate biological relevant hypotheses and experimental designs but also analyse and interpret the resultant data sets in the context of the biological question.

This training takes part in a research setting. Our background is in comparative and functional genomics and we work on biomedically relevant projects with scientists across the UK. While we do not concentrate on tool development, we have nonetheless developed many custom scripts and analysis pipelines with automated workflows. Our code has been developed over the last ten years and has undergone multiple additions and changes. For example, during this time, we moved from subversion to mercurial to Git.

Our main challenge has been to implement the change from a one-developer, one-user model to one with multiple developers and multiple users. The difficulty has been in achieving consistency in code quality and style, command line usage and parameterisation, for example, especially as our contributors are at various stages in their training. To help with this, we wrote extensive documentation and style guides, but also made use of GitHub and extensive testing. We use regression tests to make sure code continues to work after changes have been made, but we have also written tests for code style and tests that enforce consistent naming of command line options. Travis has helped us to continuously monitor when tests fail.

As a result of our trainees’ turnover, we have faced the challenge of making our code base portable to allow them to take it with them. While the code itself is mostly Python and C and can thus in theory be installed widely, in practice the size of the code base means that there are numerous dependencies on third party packages that need to be satisfied. Furthermore, there are dependencies on the computational environment such as data location or computer cluster configuration. The challenge is to make sure that all these dependencies are fully parameterisable, and to ensure that installing our code does not become a barrier to use.

To facilitate uptake, therefore, we have implemented multiple installation options, from source to Docker containers and full images. We are also examining solutions such as CernVM-FS to reduce the overhead of setting up third party tools and data sets.