High-performance computing in the search for the Tree of Life

Lead Research Organisation: University of St Andrews
Department Name: Biology

Abstract

Phylogeny, in the upper limit the entire Tree of Life, is a source of fascination in itself. Ever since we have known species share common ancestors, we have sought to discover the pattern of relationships among species. The pattern of relationships is typically represented as a bifurcating tree, whose tips represent extant species and internal nodes represent hypothetical ancestors. Until the late 20th Century, phylogenetic research was speculative, due to the limited quantity of data morphology can provide and problems of interpretation. Two technological developments in the 20th Century revolutionised the field. One was the availability of computers, allowing the replacement of subjective reasoning in phylogeny reconstruction with algorithmic and statistical approaches. The other was the availability of DNA sequence, allowing a step-change in the quality and quantity of data upon which to base phylogeny reconstruction.

Computers and DNA sequence have had a transformative effect on phylogeny reconstruction. However, after decades of stability, in the 21st century DNA sequencing technology is changing. This Big Data is placing a strain on computational phylogeny. In the 1990s, a PhD student might sequence a single gene from each of 20 species, and the insights would be novel and astounding. Now, the student might sequence the entire genome of these same species, with enormously greater opportunities for discovery and certainty. With continuing developments in DNA sequencing technology, within a decade such a student will be able to sequence the entire genome of several thousands of species. Finally, after almost two centuries of intellectual contributions on the tree of life, the tree in its entirety is almost within our grasp - with respect to input data at least.

The bottleneck has shifted from DNA sequence availability, to algorithms and software able to extract phylogenetic relationships from these data.

To find the optimal phylogeny based on DNA sequences, ideally one would evaluate all possible tree topologies. However, the number of topologies increases factorially with the number of extant species, and for 54 species there are more possible unrooted bifurcating topologies than atoms in the universe. The only way forward is a combined high performance computing (HPC) and advanced algorithmic approach. This will search the range of solutions in an effective and rapid manner, and will maximise the potential of increasingly available large computing resources.

LVB is a piece of software for phylogeny reconstruction, using the advanced simulated annealing heuristic. This allows a good (realistic) result to be obtained, by looking at only a small fraction of possible tree topologies. We seek to:

- implement LVB on parallel computer systems, including all current desktop computers and larger HPC systems, typically found in research institutes, universities and industry;

- integrate LVB with other phylogeny reconstruction software, so that initial results obtained rapidly by LVB may be further refined by a range of plausible criteria;

- make LVB available to run through a Web server, which will record aspects of the user's input as well as the quality of results, so that in future the software can adapt in light of this knowledge to be even better;

- apply LVB to a published, challenging data set of 73,060 eukaryotic species, to obtain comparative performance data and the best possible phylogeny for this major branch of the Tree of Life.

Planned Impact

The major impact of the proposed research will be, primarily, on those wishing to reconstruct phylogenies. Phylogeny is increasingly a central tool in:

- bioinformatics;

- ecology;

- comparative genomics;

- the study of drugs resistance in viruses, bacteria and cancers;

- forensic applications (reconstructing pathways of disease transmission);

- biological and biomedical education, particularly (but not only) at the university level.

Cutting across these subjects, practitioners will be in

- industry (research institutes);

- private enterprise, including the pharmaceuticals industry;

- academia;

- schools.

The St Andrews Co-I will maintain direct contact with relevant bodies involved in biodiversity, particularly the Royal Botanic Garden Edinburgh and the Royal Botanic Gardens, Kew. Additionally, the St Andrews Co-I chairs the Advisory Group for the NERC Environmental Omics Synthesis Centre (of which the PI is also a member) and is a PI on an STFC Futures Network grant to explore common purpose beteen STFC research capacity and application to environmentnal 'omics. The Advisory Group includes representatives from industry with interests in this area. All reasonable use will be made of these contacts to ensure dissemination of the work.

A second, important impact of the proposed project will be on the field of optimisation in general, spanning a vast range of applications including:

- scheduling, for example the transport industry;

- optimal use of resources to minimise waste in industry (e.g. when cutting resources into smaller pieces of varying size);

- electrical engineering (e.g. where one seeks to minimise total path length within a system);

- bioinformatics (e.g. gene network reconstruction).

Particularly for this latter impact, release of our software under an Open Source licence (BSD) will maximise benefits within both industry and academia. For example, a user may adapt the software either for in-house use or for re-release the terms of the licence, which imposes only very minimal conditions and allows both free and commercial re-use.

The STFC Co-I is an active member of the STFC Hartree Centre, which has a remit of delivering High Performance Computing solutions to UK industry. The improved code from the proposed research will be available to the Hartree Centre under the terms of the Open Source licence. This will increase the impact of the proposed research, as well as increasing the offering of the Hartree Centre for environmental customers and/or for optimisation problems.

One challenge is to reach the potential academic beneficiaries. We will employ a range of techniques including peer-reviewed and social media; local and global attention; and both expert and non-expert users. Specifically, we will:

- implement software and methods resulting from this project within phylogenetics analyses performed by the Bioinformatics unit at the University of St Andrews, reaching an estimated five major research groups within 12 months;

- publish an Open Access paper on our work, reaching thousands of readers within the first year of publication;

- disseminate announcements about software availability to heavily subscribed email lists (e.g. Ecolog-L, EvolDir, Taxacom), newgroups (e.g. bionet.announce, sci.bio.announce) and Twitter, reaching approximately 20,000 subscribers;

- make software freely available via our own Web site, via the social coding repository GitHub, and as a standard package for Debian Linux (and derived distributions such as Ubuntu);

- provide a Web server with user-friendly interface, for users to run analyses free of charge (and from which we will gather data for machine learning, to further improve phylogenetic analyses in the future);

- submit abstracts on our work to relevant conferences (e.g. the prestigious Highlights Track of ISMB).

Publications


10 25 50
Strobl MA (2016) On simulated annealing phase transitions in phylogeny reconstruction. in Molecular phylogenetics and evolution
Strobl MA On simulated annealing phase transitions in phylogeny reconstruction in Molecular Phylogenetics and Evolution
 
Description Phylogeny reconstruction software, LVB, has been shifted to a free software licence; it has been made faster; it has been made to work on very large supercomputer systems and on a publicly available Web server.
Exploitation Route Reconstruction of large phylogenies. Parallel approaches to optimisation problems in general.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology
URL http://eggg.st-andrews.ac.uk/lvb
 
Title Use of MapReduce paradigm for finding duplicate trees 
Description This is a computer algorithm for finding if a proposed phylogenetic tree has a duplicate amongst an existing reference set of thousands of trees. The test tree and the reference trees are decomposed into sets of partitions. The MapReduce paradigm is used to collate trees containing a given partition onto the same compute node, where they can be counted. The method is suitable for very large phylogenetic reconstruction problems, where such tree comparison steps can be distributed over a compute cluster. Such parallelization of the problem is required to achieve a convenient runtime. 
Type Of Material Computer model/algorithm 
Provided To Others? No  
Impact The method has been used to improve the efficiency of the phylogenetic software LVB, for particular datasets where many equivalent trees are generated. 
 
Description Hartree Centre - IBM research collaboration 
Organisation IBM
Country United States of America 
Sector Private 
PI Contribution The Hartree Centre - IBM collaboration covers a wide range of research areas. I am leading research topics in genomics and molecular modelling. My research team is bringing expertise built up from several previous projects. Collaborative projects include: * Phylogeny of food-borne pathogens. This builds on work on LVB under an STFC-funded Global Challenge grant.
Collaborator Contribution IBM have an active programme of work in Food Safety and are bringing that expertise to the project. They are also contributing particular research software and software products.
Impact None yet.
Start Year 2015
 
Title LVB 
Description LVB uses an innovative simulated annealing heuristic search to reconstruct phylogeny. During the current period, LVB's features and speed have been improved, including parallelisation through multithreading; its source code has been imported to the social coding site GitHub; and it has been moved to a standard Open Source licence. 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact As a result of this project, four new releases of LVB were made during 2015. 
URL http://eggg.st-andrews.ac.uk/lvb
 
Title LVB Web Server 
Description LVB may be run online by any member of the public, involving no specific software installation. 
Type Of Technology Webtool/Application 
Year Produced 2015 
Impact With user's permission, details of the analysis are retained, providing data for further optimization of analyses by machine-learning. 
URL http://lvb.st-andrews.ac.uk/lvb
 
Description School (Forfar Academy) visit to University of St Andrews 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Schools
Results and Impact The event provided preliminary work for funding applications, for example to the STFC Public Engagement Large Awards Scheme submitted in November 2015 (PI, Dr Daniel Barker). The event was covered in a blog at the University of St Andrews, in the Angus Council internal schools' newsletter (Angus Education, Issue 28, pp. 10-11) and the school magazine; and was publicised by the PI via Twitter and on the Raspberry Pi Forum. The event contributed data for an Open-Access publication in the peer-reviewed education research literature (Barker et al 2015, International Journal of STEM Education 2, 17).
Year(s) Of Engagement Activity 2014
URL http://synergy.st-andrews.ac.uk/biooutreach/2014/08/25/forfar-academy-pi