Cold Spring Harbor Laboratory/Wellcome Trust conference on Genome Informatics

pumpkin.jpgCold Spring Harbor, New York, US. 2-5 November 2011.

Report by Colin Semple, MRC Human Genetics Unit, Edinburgh.

Report

The original movie Halloween (1978) went from a very modest beginning to a bloated franchise spanning two decades. A minor industry in its own right, it spewed poor acting, seasonal gore and shoddy merchandise on to the screaming teenage populace. Few tricks, and certainly no treats. The Genome Informatics meeting, traditionally held in autumn in rural Long Island, New York has spawned a similar number of sequels and also had fairly humble origins. Although I should emphasise that the focus is on the interface between computing and high-throughput genomics, not mass stabbings.

This year we stumbled through the pumpkin-littered campus to witness sessions on genomic software and database development, and hear the screams of those disappearing under the deluge of huge, novel datasets. The distance travelled by computational genomics since the first of these meetings in 2001 was examined by one of the organisers, who announced that the number of participants had swelled to an all time record of 300. (Equally sobering was the news that we had achieved a record proportion of female participants: a quarter.) And herein lies the biggest problem for bioinformatics in 2011: on the one hand, biologists have unparalleled opportunities to produce enormous sequencing datasets and, on the other hand, the people active in extracting useful information from these data are (still) a rather rare breed. These demographics are now feeding software development, with the aim of providing computational environments to enable data analysis by bioinformatics newbies; usually web-based, icon or menu-driven interfaces to command-line executables.

The grand old man (and in bioinformatics it appears unlikely to be a woman) of web platforms for genomic data analysis is Galaxy, which has been around since 2005, and has gradually blossomed into a large, open-source project with an active developer community and thousands of users around the world. This year we were introduced to Galaxy Cloudman, a way of using Galaxy on the Amazon compute cloud, in an entirely non-free fashion. Just remember to use somebody else's credit card. Still, in principle this kind of development allows users to quickly set up an entire compute cluster for genomic analyses with little effort and zero overheads for hardware headaches. Of course, many of us deal with data that are, at least temporarily, confidential and uploading it all to a US data center is frowned upon, which means locally installed instances of Galaxy will continue to be popular. And Galaxy is rapidly acquiring a diverse field of competitors.

Presentations included BioBrowser an extension to the Firefox web browser, allowing data analysis for local or remote datasets. The elegant EpiExplorer provides a Galaxy-like interface to genome-scale analyses that uses Google search tricks under the hood. These interfaces and analysis systems are also becoming underpinned by a diversity of data warehousing and querying engines, exemplified at this meeting by the open-source InterMine project. As many of us have more data and users than we know what to do with, these are very welcome developments: in an ocean of data everyone must be able to swim, even if it's the doggy paddle.

The word cloud generated (courtesy of Wordle) from the text of all abstracts submitted to this meeting displays the prominence of genome sequencing data and analysis at this meeting. (I've added 500 instances of the word pumpkin as an internal scale.) The large consortia currently releasing huge datasets documenting the human genome's oddities and intricacies (ENCODE, NIH Roadmap Epigenome, BLUEPRINT, etc.) were very much in evidence, as people presented their first glimpses of novel biology, but these large projects are already looking dated. The recent development of desktop-sequencing machines, capable of generating huge datasets in every lab, suggests the genomic data flood will intensify over the next few years as biomedical science comes to grips with cheap, ubiquitous, high-throughput sequencing. Final word should go to Circos, a software package for data visualisation and graph drawing, which has done more to inject some beauty into genomics than anything else. The multi-coloured, circles and arcs of Circos plots burst out of faded powerpoint slides, dangled from posters and even leaked onto the cover of the conference program. It is far from new (released in 2009) but it is free, easy to use and amazingly is written entirely in Perl. Spooky.