EnteroBase: A Powerful, User-Friendly Online Resource for Analyzing and Visualizing Genomic Variation within Escherichia coli and Salmonella enterica

Lead Research Organisation: University of Warwick
Department Name: Warwick Medical School


It is hard to think of two organisms that are more important to scientists, policy makers and the public than E. coli and S. enterica. Both have been studied extensively in the laboratory as models of how bacterial cells function, behave and evolve. However, both are also important causes of human and animal INFECTION and are seldom out of the news, particularly given their propensity to cause outbreaks. The E. coli outbreak that hit Germany in 2011, with >4,000 cases and >50 deaths, amply illustrates the power of these organisms to devastate even a wealthy advanced society. In 2013, Salmonella gained media coverage in England when >200 people fell ill after a spice festival in Newcastle.

It is important to recognise that no single strain can capture the essence of either species. Instead, what we see in nature is a riotous profusion of diversity. For example, some strains of E. coli live harmlessly in our bowels, while others cause diarrhoea, urinary tract infection or even bloodstream infection. Two E. coli strains may differ by 1/3 of their genetic make-up (genome). Both Salmonella and E. coli undergo relentless evolution, including spread of ANTIBIOTIC RESISTANCE. The huge diversity already present, twinned with ongoing evolution and spread of new lineages creates tremendous problems for microbiologists and other scientists as well as policy makers in recognising and classifying strain types. Yet such classification into well-defined, scientifically robust populations is essential before scientific, clinical or even political conclusions can be generalised across sub-types or species.

Fortunately, we have been presented with an exciting new opportunity to capture and analyse within-species diversity in bacteria in the form of HIGH-THROUGHPUT SEQUENCING, a set of innovative technologies that make bacterial genome sequencing (a process of capturing all the DNA sequences within the cell) easier, cheaper and quicker than ever before. However, this sudden availability of new data creates a fresh challenge-the DRINKING-FROM-A-FIRE-HOSE problem-namely how to store, visualise and analyse all the new data on genomic diversity generated by this exciting new technology. In addition, while expert bioinformaticians can use command-line tools to analyse genomes, lab-based bacteriologists are dependent on the creation of new user-friendly web-based resources, if they are not to miss out on this exciting new opportunity.

To address this problem we will create a new, powerful but user-friendly online database called ENTEROBASE, which will act as a one-stop shop for anyone interested in analysing and visualising genetic diversity in E. coli and Salmonella. EnteroBase will incorporate ENTEROTOOLS, a set of modular, open-source, web-based tools compatible with data formats and standards from both current and future sequencing technologies. Together, these two resources will allow bacteriologists who work in the laboratory and lack high-level computer skills to perform incisive and sophisticated computer-based analyses of bacterial DNA sequence data. Users will be able to upload and analyse their own data, as well as exploit the cumulative knowledge of the microbiology community, not just to look at global patterns of diversity within these species but also to perform speedy, near-real-time analyses of ongoing or recent outbreaks.

Principal investigator Achtman has spearheaded efforts to replace outdated 19th- and 20th-century approaches to the typing and classification of these bacteria with more modern approaches; co-investigator Pallen has applied innovative approaches to analyse the German E. coli outbreak. Both will bring to this project 1000s of users of previous similar, well-established but less powerful databases. This project will also help maintain and enhance the UK skills base and make our country the destination of choice for the brightest and best scientists.

Technical Summary

EnteroBase will present a scalable structured, curated database containing data from 100,000s of genomes and their temporal and geographic metadata from ourselves, our users and public databases. It will support analyses ranging from 7-gene multi-locus sequence typing (MLST) to whole genomes. EnteroBase databases will only include high quality sequences from E. coli and S. enterica but EnteroTools will also support analyses of genomic data from other bacterial groups.

The public interface to EnteroBase will be a customised instance of Galaxy, which is a powerful, but flexible, web-based sequence analysis and workflow management system. Initially, we will adopt Galaxy's existing graphical user interface and existing tools in order to port basic components from our xBASE and MLST facilities.

Subsequently, we will enhance Enterobase's capabilities with EnteroTools, a set of open-source user-friendly Galaxy tools, compatible with both current and future data formats. We will incorporate other resources, such as MEGA and BIGSdb, include links to access specialised external databases for identifying repetitive and mobile elements, and encourage cloud-sourcing of novel solutions by letting users publish their work-flows. EnteroTools will allow users to:

->upload and analyse sequence reads, assemble and annotate genomes and align whole genomes or genes.

->visualise relationships between bacterial genotypes; drill down to genotype clusters; perform population genetics and real-time epidemiological analyses.

->evaluate and visualise the contributions of SNPs, indels, transpositions, recombination and selection, as well as details of changes in the core and accessory genomes.

->access processed data easily in the context of associated metadata. including bidirectional links between metadata in the genomic and MLST databases, thus providing a facility for scanning the metadata from genetically related isolates that share MLST or rMLST alleles.

Planned Impact

The proposed project will benefit anyone in the UK or overseas academic sector with an interest in E. COLI OR SALMONELLA AS PATHOGENS OR MODEL ORGANISMS (including those interested in systems biology or synthetic biology), or with interests in bacterial genome evolution or population genetics or epidemiology. More generally, the resource we create here will be of interest to ANYONE INTERESTED IN EXPLOITING COMPARATIVE SEQUENCE DATA from any bacterial species.

We anticipate bringing across 1000s of users of our existing MLST and xBASE facilities to this new resource.

The proposed project will benefit anyone within the commercial private sector who is interested in developing NEW DRUGS, VACCINES OR DIAGNOSTIC TESTS for E. coli or Salmonella. Industrial users could benefit from using EnteroBase to explore genotypic--and by implication phenotypic--diversity within these species when evaluating novel vaccine or drug targets. EnteroBase will allow users to explore how ANTIMICROBIAL RESISTANCE EVOLVES AND SPREADS within these species. Similarly, companies that sell sequencing technologies stand to benefit from exploitation of and demand for high-throughput sequence data (both Solexa and Oxford nanopore sequencing were developed within the UK, with benefits to our economy).

The delineation of epidemic or highly pathogenic lineages is of KEY INTEREST TO POLICY MAKERS, whether addressing FOOD SECURITY, FOOD SAFETY, HUMAN HEALTHCARE, HEALTH AND SAFETY AT WORK OR BIOTERRORISM (note that certain E. coli and Salmonella lineages are even defined within the UK's Anti-terrorism, Crime and Security Act 2001).

EnteroBase will also assist in increasing the effectiveness of public services and policy by facilitating analyses that will GROUND POLICY DECISIONS IN A SOLID UNDERSTANDING of bacterial evolution, epidemiology, population genetics and taxonomy. The UK food industry needs detailed knowledge about the diversity and sources of Salmonella infection, such as the Agona outbreak that spread to the UK via products from an Irish food producer. Achtman has been at the forefront of efforts to replace classification of these bacteria by serovar with a MORE RATIONAL AND DISCRIMINATORY SYSTEM OF CLASSIFICATION; these efforts are likely to lead to changes in international regulations governing Salmonella and E. coli infections in animals impacting on the human food chain. Our analyses already influence the policies of organizations such as the eCDC (Stockholm), which coordinates European efforts to stop outbreaks of salmonellosis and Listeriosis.

EnteroBase will help microbiologists, bioinformaticians, epidemiologists and population geneticists to integrate bacterial genomics with epidemiological disease patterns and to elucidate genetic relationships between S. enterica and E. coli from domestic animals and human patients, with IMPACTS ON DISEASE PREVENTION, MANAGEMENT OF INFECTION AND QUALITY OF LIFE. Obvious beneficiaries within the public sector include those employed in the HEALTH SERVICES, including the NHS and Public Health England, who will gain an improved understanding of the links between population biology, taxonomy and diagnosis/prognosis for these species.

The proposed resource will also enhance the UK'S REPUTATION AS A CENTRE OF EXCELLENCE, attracting highly skilled students, academics and collaborators from foreign countries. The research and professional skills in bioinformatics gained by staff working on the project will help ADDRESS THE NATIONAL SKILLS SHORTAGE in this area; similarly, the training provided more widely as part of the project will help improve bioinformatics and genomics skills among UK bacteriologists.


10 25 50
Achtman M (2016) How old are bacterial pathogens? in Proceedings. Biological sciences
Achtman M (2014) Distinct genealogies for plasmids and chromosome. in PLoS genetics
Cui Y (2014) Genetic variations of live attenuated plague vaccine strains (Yersinia pestis EV76 lineage) during laboratory passages in different countries. in Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases
Reuter S (2014) Parallel independent evolution of pathogenicity within the genus Yersinia. in Proceedings of the National Academy of Sciences of the United States of America
Zhou Z (2014) Transient Darwinian selection in Salmonella enterica serovar Paratyphi A during 450 years of global spread of enteric fever. in Proceedings of the National Academy of Sciences of the United States of America
Description The development of EnteroBase has been funded since 1 August, 2014 and was the EnteroBase website was opened for use by the general public in December, 2015.
Exploitation Route EnteroBase provides assembled genomes and associated metadata on strain properties from all public short read archives for four bacterial genera. It provides the same service for microbiologists who upload short reads. EnteroBase also provides nomenclature for genotypes according to MLST, rMLST and CRISPR. It also includes all data previously presented by http://mlst.warwick.ac.uk
Sectors Healthcare
URL http://enterobase.warwick.ac.uk/
Description Investigator Award
Amount £1,944,236 (GBP)
Funding ID 202792/Z/16/Z 
Organisation The Wellcome Trust Ltd 
Sector Charity/Non Profit
Country United Kingdom of Great Britain & Northern Ireland (UK)
Start 09/2016 
End 08/2021
Title EnteroBase 
Description EnteroBase assembles 10,000s of genomes from public short read archives as well as from sequencing short reads uploaded by its users and associates them with metadata on the bacterial strain. It provides a user friendly web browser for examining data as well as a computer friendly API for high throughput data access and uploading. EnteroBase contains assemblies from all publicly available short read archives for Escherichia and Shigella, Salmonella, Yersinia and Moraxella catarrhalis. Genotypes are called automatically from the genomes for MLST, rMLST and CRISPR. EnteroBase uses state of the art methods for assembling genomes and calling genotypes thus providing a unique service for non-IT specialists. 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
Impact An initial overview of the genetic diversity of four species causing enteric diseases in humans and animals 
URL http://enterobase.warwick.ac.uk
Description Applied Maths 
Organisation Applied Maths
Country Belgium, Kingdom of 
Sector Private 
PI Contribution Committed to develop an API to EnteroBase that can be used by customers of Applied Maths using Bionumerics as well as the general audience of bioinformaticians
Collaborator Contribution Free multiuser license to Bionumerics 7.6 for development purposes within the context of developing EnteroBase Commitment to developing a function Bionumerics plug-in which will access the EnteroBase API
Impact No outputs yet
Start Year 2014
Description Univ of Oxford 
Organisation University of Oxford
Country United Kingdom of Great Britain & Northern Ireland (UK) 
Sector Academic/University 
PI Contribution Plan to provide link rMLST of Salmonella enterica and Escherichia coli with BigsDB
Collaborator Contribution Plan to provide direct link to BigsDB allele server for automated uploads and downloads of rMLST alleles for EnteroBase
Impact None yet
Start Year 2014
Title MGplacer 
Description Assign metagenomic data onto phylogeny based on 1000's isolates. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact None 
URL https://sourceforge.net/projects/mgplacer/