Building a global metagenomics portal ('MGportal') to handle next-generation sequencing data and associated metadata

Lead Research Organisation: University of Oxford
Department Name: Oxford e-Research Centre

Abstract

While genomes represent the full genetic (DNA) complement of a single organism, metagenomes represent the DNA of an entire community of organisms. These organisms might be free-living in the environment, or be found on the skin or in the gut of a human being or other species. Microbial organisms play a major role in our everyday health and well-being, which is not surprising when you consider that the number of microbial cells in or on an average human body actually exceeds the number of human cells! Microbes play a similarly important role in the environment; different types of organisms live under different conditions (including extreme habitats, such as the run-off from acid mines or the depths of the oceans). Understanding how these organisms have adapted to their various living conditions will lead to a better understanding of how changes in the environment will have impact on biodiversity in the future. It may also lead to discovery of entirely new species or novel proteins which could have utility as antibiotics or other drugs. Combined with other types of 'omic data, metagenomes hold the promise of unparalleled insights into fundamental questions across a range of fields including evolution, ecology, environment biology, health and medicine. To fully exploit the promise of these data we need both scientific innovation and community agreement on how to provide appropriate stewardship of these resources for the benefit of all. Significant numbers of metagenomics projects have been awarded grants by international funding bodies. Whilst all of these projects have specific, scientifically-interesting aims, they mostly exist in isolation, with little or no cross-referencing to other metagenomic or genomic datasets. Our intention is to leverage existing infrastructure to deliver a world-class metagenomics resource with unique utility for UK-based metagenomics researchers. This resource, MGportal, will utilise user-friendly interfaces, state-of-the-art algorithms and the EBI's unique position as a hub of biological information to measurably enhance the value of these researchers' data. It will be built in close collaboration with the Genomic Standards Consortium (GSC). MGportal will consist of software tools to enable metagenomics researchers to upload their data to the raw nucleotide sequence archives, data analysis pipelines to predict what potential genes are present in the data and what their function is, plus a web interface which will display these data and results in a way that is easy to browse and query. We will hold training courses and a workshop to gain input from the scientific community about the portal. It is hoped that MGportal will eventually allow researchers to understand the results of their metagenomics experiments, as well as seeing how those results compare with the outcomes of other studies.

Technical Summary

While genomes represent the full genetic (DNA) complement of a single organism, metagenomes represent the DNA of an entire community of organisms. Interest in improved sampling of diverse environments (e.g. hosts/gut, plants, soil, etc) combined with advances in the development and application of ultra-high throughput sequence methodologies is set to vastly accelerate the pace at which new metagenomes are generated. Combined with other types of 'omic data, metagenomes hold the promise of unparalleled insights into fundamental questions across a range of fields including evolution, ecology, environment biology, health and medicine. To fully exploit the promise of these data we need both scientific innovation and community agreement on how to provide appropriate stewardship of these resources for the benefit of all. In this three year collaborative project we aim to build an international data resource and portal for metagenomic data at the European Bioinformatics Institute. This portal will manage the submission, storage, dissemination and mining of metagenomic data from data providers across the world. The portal will focus on the capture of rich in contextual information (metadata), working in close collaboration with the Genomic Standards Consortium (GSC) an international working body creating and implementing standards to describe genomes, metagenomes and marker gene sequences. Further, the collaborative use of the ISA Infrastructure software suite for metadata capture will enable capture and sharing of standards compliant data and integration with a range of other data types. The resulting MGPortal will be a major new resource at the EBI. The combined MGPortal Team will engage in a range of community-building activities, including hosting workshops and training activities that both educate data submitters and users and will ensure the portal develops in line with community needs.

Planned Impact

The full impact of this work is described in the impact statement of the lead institute, the EBI. Here we elaborate on the specific impact of the work to be completed in this project under the auspices of the Genomic Standards Consortium and the ISA Infrastructure project. The primary impact of the proposed tight collaboration between these groups and the EBI is the increased level of community involvement in the creation of resources that serve community needs. This is a pioneering aspect of this proposal. Community-level consensus: This project will help to continue fund these key grass-roots activities, thus strengthening them and their ability to give a voice to the wider scientific community on issues of data stewardship, standardization and sharing. Specifically, this project will directly fund core activities with the GSC (i.e. through Peter Sterk's role as Secretary of the GSC) and most importantly provide funds to implement GSC recommended standards and the international level. This is a key step on the path towards international adoption of standards that will underpin future data sharing. It will also ensure the usage of a premier example of standards-compliant tools in the creation of this portal. The ISA Infrastructure, already funded by the BBSRC in the past BBR round, is a complete suite of tools for capturing and disseminating standards-compliant metadata. Its use in this project paves the way for universal sharing of metadata about sampled and data types as this work will increase the chances that other projects will adopt this shared aprpoach. Data Sharing. The adoption of these community-defined approaches is also in direct support of the strong BBSRC data sharing policy. Putting this standards-compliant infrastructure into place will ensure compliance with policy of making data freely available in re-useable form. Policy makers. The production of more-richly annotated bioinvestigations will improve the evidence base for policy makers by providing greater interpretability of experimental context, simplifying the job of data integration and study comparison. More detail for those forming policy on biological and biomedical issues should produce better decisions. Journals. The current trend shows that, like funders, journals increasingly require that firstly, researchers make more of their data public, for example by submitting it to public repositories, and that secondly, they begin to comply with community-defined standards. However 'non-compliance' may be difficult to overcome: experimental metadata are still normally sparse in publications and the supplementary data that sometimes accompany them, limiting data accessibility and utility. This is because of the lack of (i) reviewer time and expertise - they are not trained to check compliance, (ii) awareness of the existence of an appropriate reporting standards, (iii) access to freely available tools implementing standards, and (iv) adequate data management resources at the local and community levels. Greater automation of the reporting processes is required. The only feasible solution is better annotation and education at source (i.e., by providing data producers with a straightforward way in which to use community annotation standards), assisted by some form of automated content validation. Through this collaboration we will disseminate this best practice by building compliance with standards into the MGPortal. Outreach. The high profile nature of this project (a major new database/portal at the EBI) will help to spread the word about the importance of standards in the community. Finally, the planned workshops and interactions with the existing GSC and ISA communities with succeed in engaging a larger proportion of bench scientists in efforts to provide the best possible stewardship of our collective data assets.

Publications


10 25 50
publication icon
Baker NA (2013) Standardizing data. in Nature nanotechnology

publication icon
Bandrowski A (2016) The Ontology for Biomedical Investigations. in PloS one

publication icon
Brandizi M (2012) graph2tab, a library to convert experimental workflow graphs into tabular formats. in Bioinformatics (Oxford, England)

publication icon
Chervitz SA (2011) Data standards for Omics data: the basis of data sharing and reuse. in Methods in molecular biology (Clifton, N.J.)



publication icon
Field D. (2011) Data standards in Scientist


publication icon
Gaudet P (2011) Towards BioDBcore: a community-defined information specification for biological databases. in Database : the journal of biological databases and curation


 
Description We have contributed to the development of a public repository for metagenomics data at the EBI; specifically, we have refined a set of tools to help researchers to collect, annotate and submit their datasets to this repository.
Exploitation Route This is a public data deposition service and the tools are freely available to researchers for their continued use in managing and sharing their metagenomics datasets.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Pharmaceuticals and Medical Biotechnology
URL https://www.ebi.ac.uk/metagenomics
 
Description The portal is maturing and currently serves as a key community portal for this dat type. Its use will continue to increase the effectiveness of data sharing and the reuse.
First Year Of Impact 2013
Sector Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Pharmaceuticals and Medical Biotechnology
Impact Types Cultural
 
Description Advised NPG Scientific Data journal on its data policy
Geographic Reach Multiple continents/international 
Policy Influence Type Membership of a guideline committee
Impact Increase in data publication and sharing via public, community-approved repositories
URL http://www.nature.com/sdata/data-policies
 
Description EC H2020 - INFRADEV-3-2015 - ELIXIR EXCELERATE
Amount € 240,000 (EUR)
Organisation European Commission (EC) 
Department Horizon 2020
Sector Public
Country European Union (EU)
Start 09/2015 
End 08/2019
 
Title BioSharing 
Description Registry of standards and databases linked to data policies by funders and journals. 
Type Of Material Improvements to research infrastructure 
Year Produced 2011 
Provided To Others? Yes  
Impact Launched in 2011, the BioSharing portal (https://biosharing.org) of interrelated standards, databases, and policies has 53,741 users and is a resource of the ELIXIR UK Node and the ELIXIR Interoperability Platform. Endorsed by a community of 68 organizations, including publishers (embedded in the data policies of 600 Springer Nature's journals, also PloS, EMBO press, BMJ, F1000Research, BioMedCentral, Oxford University Press, Wellcome Trust Open Research), standardization groups, and research data management support initiatives and libraries (such as those at JISC, Stanford, Cambridge and the Oxford Universities). 
URL http://biosharing.org/
 
Title ISA tools 
Description Tools to collect, annotate, store, share and publish datasets 
Type Of Material Improvements to research infrastructure 
Year Produced 2010 
Provided To Others? Yes  
Impact Running since 2007, the open source metadata reporting ISA software suite has a user base ranging from hundreds to thousands of users from diverse domains (http://isa-tools.org), and is a resource of the ELIXIR UK Node. Currently it is embedded in 27 public resources (institute-based, project/consortium-based or global repositories, including some based at EBI, in USA, Japan, China and Australia), supports two data-driven journals (Springer Nature Scientific Data, Oxford University Press GigaScience), and complements 9 internal data platforms (also at the FDA National Centre for Toxicological Resources and Janssen R&D)- http://www.isacommons.org. The extension of the ISA metadata representation format for nanotechnology applications became a formal ASTM standard in 2013. 
URL http://www.isa-tools.org
 
Description ELIXIR UK Node 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC)
Department Earlham Institute
Country United Kingdom of Great Britain & Northern Ireland (UK) 
Sector Academic/University 
PI Contribution Help create the ELIXIR UK Node
Collaborator Contribution Contribute to the creation of the ELIXIR UK Node
Impact Creation of a virtual entity that represents UK strengths in bioinformatics and provides a route for UK bioinformatics resources to participate in, and benefit from, ELIXIR. The Node is currently being formalized.
Start Year 2012
 
Description ELIXIR UK Node 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC)
Department Rothamsted Research
Country United Kingdom of Great Britain & Northern Ireland (UK) 
Sector Academic/University 
PI Contribution Help create the ELIXIR UK Node
Collaborator Contribution Contribute to the creation of the ELIXIR UK Node
Impact Creation of a virtual entity that represents UK strengths in bioinformatics and provides a route for UK bioinformatics resources to participate in, and benefit from, ELIXIR. The Node is currently being formalized.
Start Year 2012
 
Description ELIXIR UK Node 
Organisation Edinburgh Genomics
Country United Kingdom of Great Britain & Northern Ireland (UK) 
Sector Private 
PI Contribution Help create the ELIXIR UK Node
Collaborator Contribution Contribute to the creation of the ELIXIR UK Node
Impact Creation of a virtual entity that represents UK strengths in bioinformatics and provides a route for UK bioinformatics resources to participate in, and benefit from, ELIXIR. The Node is currently being formalized.
Start Year 2012
 
Description ELIXIR UK Node 
Organisation Heriot-Watt University
Country United Kingdom of Great Britain & Northern Ireland (UK) 
Sector Academic/University 
PI Contribution Help create the ELIXIR UK Node
Collaborator Contribution Contribute to the creation of the ELIXIR UK Node
Impact Creation of a virtual entity that represents UK strengths in bioinformatics and provides a route for UK bioinformatics resources to participate in, and benefit from, ELIXIR. The Node is currently being formalized.
Start Year 2012
 
Description ELIXIR UK Node 
Organisation Imperial College London (ICL)
Country United Kingdom of Great Britain & Northern Ireland (UK) 
Sector Academic/University 
PI Contribution Help create the ELIXIR UK Node
Collaborator Contribution Contribute to the creation of the ELIXIR UK Node
Impact Creation of a virtual entity that represents UK strengths in bioinformatics and provides a route for UK bioinformatics resources to participate in, and benefit from, ELIXIR. The Node is currently being formalized.
Start Year 2012
 
Description ELIXIR UK Node 
Organisation Newcastle University
Country United Kingdom of Great Britain & Northern Ireland (UK) 
Sector Academic/University 
PI Contribution Help create the ELIXIR UK Node
Collaborator Contribution Contribute to the creation of the ELIXIR UK Node
Impact Creation of a virtual entity that represents UK strengths in bioinformatics and provides a route for UK bioinformatics resources to participate in, and benefit from, ELIXIR. The Node is currently being formalized.
Start Year 2012
 
Description ELIXIR UK Node 
Organisation University College London (UCL)
Country United Kingdom of Great Britain & Northern Ireland (UK) 
Sector Academic/University 
PI Contribution Help create the ELIXIR UK Node
Collaborator Contribution Contribute to the creation of the ELIXIR UK Node
Impact Creation of a virtual entity that represents UK strengths in bioinformatics and provides a route for UK bioinformatics resources to participate in, and benefit from, ELIXIR. The Node is currently being formalized.
Start Year 2012
 
Description ELIXIR UK Node 
Organisation University of Birmingham
Country United Kingdom of Great Britain & Northern Ireland (UK) 
Sector Academic/University 
PI Contribution Help create the ELIXIR UK Node
Collaborator Contribution Contribute to the creation of the ELIXIR UK Node
Impact Creation of a virtual entity that represents UK strengths in bioinformatics and provides a route for UK bioinformatics resources to participate in, and benefit from, ELIXIR. The Node is currently being formalized.
Start Year 2012
 
Description ELIXIR UK Node 
Organisation University of Cambridge
Country United Kingdom of Great Britain & Northern Ireland (UK) 
Sector Academic/University 
PI Contribution Help create the ELIXIR UK Node
Collaborator Contribution Contribute to the creation of the ELIXIR UK Node
Impact Creation of a virtual entity that represents UK strengths in bioinformatics and provides a route for UK bioinformatics resources to participate in, and benefit from, ELIXIR. The Node is currently being formalized.
Start Year 2012
 
Description ELIXIR UK Node 
Organisation University of Dundee
Country United Kingdom of Great Britain & Northern Ireland (UK) 
Sector Academic/University 
PI Contribution Help create the ELIXIR UK Node
Collaborator Contribution Contribute to the creation of the ELIXIR UK Node
Impact Creation of a virtual entity that represents UK strengths in bioinformatics and provides a route for UK bioinformatics resources to participate in, and benefit from, ELIXIR. The Node is currently being formalized.
Start Year 2012
 
Description ELIXIR UK Node 
Organisation University of Edinburgh
Country United Kingdom of Great Britain & Northern Ireland (UK) 
Sector Academic/University 
PI Contribution Help create the ELIXIR UK Node
Collaborator Contribution Contribute to the creation of the ELIXIR UK Node
Impact Creation of a virtual entity that represents UK strengths in bioinformatics and provides a route for UK bioinformatics resources to participate in, and benefit from, ELIXIR. The Node is currently being formalized.
Start Year 2012
 
Description ELIXIR UK Node 
Organisation University of Liverpool
Country United Kingdom of Great Britain & Northern Ireland (UK) 
Sector Academic/University 
PI Contribution Help create the ELIXIR UK Node
Collaborator Contribution Contribute to the creation of the ELIXIR UK Node
Impact Creation of a virtual entity that represents UK strengths in bioinformatics and provides a route for UK bioinformatics resources to participate in, and benefit from, ELIXIR. The Node is currently being formalized.
Start Year 2012
 
Description ELIXIR UK Node 
Organisation University of Manchester
Country United Kingdom of Great Britain & Northern Ireland (UK) 
Sector Academic/University 
PI Contribution Help create the ELIXIR UK Node
Collaborator Contribution Contribute to the creation of the ELIXIR UK Node
Impact Creation of a virtual entity that represents UK strengths in bioinformatics and provides a route for UK bioinformatics resources to participate in, and benefit from, ELIXIR. The Node is currently being formalized.
Start Year 2012
 
Title ISA tools 
Description Software suite to collect, annotate, store, share and publish datasets 
Type Of Technology Software 
Year Produced 2010 
Impact Growing number of users, as listed at http://isacommons.org; but also of co-developers have and are contributing to the collaborative enhancements. 
URL https://github.com/ISA-tools