Bioinformatics needs to change says Dr Thomas Connor and for that to happen we need to look to the cloud. Here he tells us about CLIMB – the world’s largest HPC system for microbial bioinformatics research
I’m an academic researcher, interested in answering questions that will help us to better understand, track and treat disease caused by bacterial pathogens. To do this we use a variety of big data approaches that interrogate genome sequence data. When I took up my first academic post, I came face to face with three key issues that most bioinformaticians the world over must deal with;
• Dealing with a lack of suitable local infrastructure for performing bioinformatics
• Overcoming poor tools for rapidly sharing research data
• Having to locally reinvent the wheel in response to a lack of portability and documentation for many bioinformatics software packages
Over the past few years I have had to expend far more time on dealing with these challenges than I might have expected before arriving; time that necessarily takes me away from my research. The reasons for these issues are relatively clear to me – many pre-existing HPC resources in the UK aren’t often well suited for microbial bioinformatics workloads; they have been designed for performance, often with chemistry or physics cases in mind.
Often, these HPC cases seem to have been designed around a stable set of packages that are tuned for performance and are run over and over again. Many of these packages are also developed to scale over multiple cores and multiple nodes as required. In comparison, bioinformatics software often evolves rapidly and organically, and there are many similar (but different) tools for doing any particular job. This means that when we develop software, it will often not be with an eye on best practice or portability; instead developing software as quickly as possible in order to get our results as soon as we can, on the system that we have available. This often means that documentation is poor, and that the software is hard-coded for a local environment that another researcher would have to reproduce in order to reuse the software.
This lack of portability of software and poor documentation is a major bugbear for many bioinformatics researchers, and has the very real potential to bring about a reproducibility crisis in computational biology. Most of these issues come down to the fact that the various elements – data, software and infrastructure – are rarely found together in one place. This then creates a vicious circle where bioinformaticians have to develop a local environment that brings these elements together, perpetuating the problem. Believing that cloud technologies provide an obvious mechanism for overcoming this key issue, a group of collaborators from the University of Birmingham, Cardiff University, Swansea University, and the University of Warwick developed a proposal to build a national shared infrastructure for the wider microbial bioinformatics community in the UK.
The idea that one could provide a single environment that can be used for researcher training and for research activity is a powerful one. If that environment is accessible anywhere, and brings together storage and compute capacity that is designed for the workloads that researchers need, then the system could be a real game changer. That starting point – or more specifically a one-stop bioinformatics-infrastructure-shop for researchers in our field – was ultimately the starting point for what is now known as the Cloud Infrastructure for Microbial Bioinformatics (CLIMB).
Luckily, the UK Medical Research Council also thought that national, shared infrastructures hold huge potential for scientific research – and in 2014 we were awarded over £8M to realise our vision. Following procurement in late 2014, the kit was delivered, installed and online by March 2015, with the “service” being refined over the summer of 2015.
The system itself has been explicitly designed with the needs of microbiologists in mind. We have ensured that our system will support the sort of workloads that we would like to run, but which are often not possible, or severely limited by our available local infrastructure. We knew from experience that large, rich, complex biological datasets need more RAM than you’d normally see in an HPC system. We originally wanted machines at the four sites to have at least 256GB RAM each, but in the end we were able to take advantage of low prices on 16GB DIMMS to upgrade that – so that each site has 21x four socket 512GB RAM and 3x eight socket 3TBs of RAM, supplied by Lenovo.
This sort of capacity is a real game-changer for us and for anyone who uses the system. Having large numbers of high-RAM servers on-tap gives bioinformaticians the ability to analyse the data sets that simply wouldn’t be possible with local HPC systems or lab servers.
At the moment, each site has an identical system – configured and installed by HPC integrator, OCF. The systems are currently in an early-adopter stage, with CLIMB already being accessed by over 50 researchers from around the UK and the Birmingham node seeing 100% utilisation. The next step in the CLIMB project is to complete the process of connecting up all four sites using OpenStack Kilo; making it a true multi-site cloud system with single sign-on for academic users via Shibboleth.
Individually, bacterial datasets might seem small compared to human genomes. A human genome is approximately 1000x the size of an E. coli genome. However, unlike many human studies that examine small numbers of genomes simultaneously, microbial researchers are now routinely comparing hundreds or thousands of genomes as part of their work. This creates three different obvious strains on any HPC system.
Firstly, these large datasets require a lot of space. Uncompressed, the sequence reads for a bacterial genome can be over 3GB in size; so a 500 genome dataset will require 1.5TB of storage before any work is undertaken, and a bioinformatician may be working on several similar datasets simultaneously; meaning that a single researcher could easily require over 10TB of storage at any one time.
We believe that the way that we train bioinformaticians and the way that we do bioinformatics needs to change
Secondly, when we are dealing with hundreds of genomes, we will need to perform hundreds of individual analyses first in order to extract genomic information from our short read data. These embarrassingly parallel workloads potentially encompass thousands of jobs each running on a single core. To complicate matters these jobs often require anywhere between 4-12GBs of RAM per job; a requirement that is out of step with many HPC systems that have been designed to have 3-4GBs of RAM per core.
Finally, once we have performed our initial analyses, we then often have to undertake complex analyses of the data produced. These analyses are often memory intensive; in some cases requiring over 512GB of RAM. These facets to my workloads mean that I am almost certainly the sort of user that traditional HPC administrators dread – simultaneously stressing the scheduler, using up huge amounts of space, running on poorly coded, un-optimised software and occasionally needing vast quantities of RAM for single long-running jobs. Collectively these are the issues that we set out to solve.
Combining our large and huge memory machines with two tiers of storage – each of the four sites has 500TBs of IBM Spectrum Scale (previously known as GPFS) locally, with a further 3PBs of replicated Ceph object storage, support by Red Hat, on hardware provided by Dell – working with OCF we have been able to engineer a system that overcomes many of the classic problems faced by microbial bioinformaticians. Because of the cutting edge nature of the infrastructure itself, close contact with the vendors has been critical in delivering the system. Access to advanced HPC technology alone is not sufficient; the contacts at OCF and its partners IBM, Lenovo and Red Hat have been critical in building what is now quite a stable platform that is already helping progress research into bacterial pathogens.
While CLIMB is at the leading edge in terms of the IT that underpins the system, it is important to remember that this is an infrastructure to empower research, such as how bacterial disease moves in a hospital and reconstructing how bacterial diseases spread around the world, right through to identifying targets for vaccine development or for new anti-microbial drugs.
One of my first uses of the system has been to undertake work examining S. flexneri – the most frequent cause of bacterial dysentery, a disease that affects 165 million people around the world each year. This is a disease that is predominantly found in low-income countries, mostly in children under the age of 5. It is, as one might expect, a key vaccine target, but most of the work that has been undertaken on this organism has been based upon approaches that provide limited amount of resolution; precluding efforts to accurately characterise and track the organism over time. As part of our work we assembled a global collection of 350 samples, and examined these using genome sequencing. We found evidence that some of the current vaccine targets for this pathogen are unlikely to work, and that rather than focusing on vaccine development, the provision of clean drinking water and good sanitation may provide better results based upon the natural history of this pathogen.
This work demonstrates the potential of having a large, dedicated infrastructure to support bioinformatics research. CLIMB enabled me to take a dataset of global reach, in a disease of global importance and use genome sequencing and bioinformatics to help understand that disease, opening up new avenues for treatment.
Finally, whilst having a system such as CLIMB gives us a definite benefit, the reason we have built CLIMB is not entirely selfish. We believe that the way that we train bioinformaticians and the way that we do bioinformatics needs to change. We believe that as a community we need to get better at sharing our data and software. We believe that cloud technologies provide a key mechanism by which data and software can be trivially and easily shared; and we believe that CLIMB can and will provide a key infrastructure to enable this to happen.
Dr Thomas Connor, Senior Lecturer in Population genomics of bacterial pathogens, bioinformatics and translational microbiology at Cardiff University