Federating access to wheat data services for efficient genome-specific marker design

Lead Research Organisation: The Genome Analysis Centre
Department Name: Research Faculty


Wheat is the most widely grown crop worldwide that provides 20% of the calories to the growing human population. It is estimated that the average person will consume the grain of 50 wheat plants per day (https://www.jic.ac.uk/calculations/), and to support this the UK exports 15-20% (~ 2m tonnes) of its yearly crop to over 20 countries worldwide [1], as well as providing for the UK market. Research into breeding programmes over the last decade has made large improvements in key traits such as yield, and growing ability in tough conditions for world market viability. It is strongly predicted that rapid climate change, newly emerging wheat diseases, and reliance on a small set of wheat varieties will greatly challenge modern day agriculture and food production.

The availability of information about wheat genomes and the differences between them (variation) are leading a breakthrough in wheat research. Current services that share information about wheat genomes and these differences give researchers the ability to find regions of interest that match their research goals, and to understand and exploit characteristics of these regions for improving the crop. Such information can then be used in breeding programmes to design genetic markers for traits of interest, akin to marking Points of Interest on a map or navigation system. Once these markers have been discovered, robotic platforms can take this information and can screen thousands of wheat lines a day to look for matches, and hence potential knowledge about how that plant may perform in breeding experiments under different conditions.

Tools and resources that harness the power of breeding data and analysis packages, both openly available to academics and industry alike, are key to accelerating wheat breeding programmes in the coming years. There are many web-based databases and information services that exist for housing and exposing wheat data. However, the stages leading up to screening the wheat lines involve intensive and laborious manual processes, and the availability of this information and the way it is represented is not consistent which makes it difficult for researchers and breeders to effectively utilise it for their research. Users must submit information at each step to multiple online or local analysis tools, run multiple queries and analyses, and manually process the results in desktop computer applications to ensure that they can be fed into the next tools in the workflow.

Our project will remove these manual steps by developing software to automate the required interactions with commonly used online wheat data resources. As such, we will build software tools that are able to automatically connect each wheat data service in turn to form a workflow, understanding and processing the data produced by a previous service to provide the input data to the next service. This will free up valuable researcher time and, due to the removal of necessary human intervention and management of potentially complex data files, will result in a more robust and reproducible workflow.

Technical Summary

Research datasets generated from non conventional model organisms are as rarely curated and accessible as those generated as part of the Human Genome Project, ENCODE, and Ensembl. Most research data is either housed in public repositories which have the goal of archiving data for long periods of time to allow researchers to download their own copies (e.g. EMBL-EBI European Nucleotide Archive), or institutional repositories and databases that have specific points of access and do not typically offer data integration services. Therefore, it is common practice for researchers to use a multitude of services, often linking them together with manual conversion and formatting of the data in intermediate steps. This "context switching" between services is a bottleneck for research data sharing, subsequent reuse of data in analyses, and scientific reproducibility in general. This is even more of a burden on those researchers that do not have a computational background, and rely on bioinformaticians and/or specialist tools to assist them in their investigations.

In the current "big data" multi-disciplinary research environment, access to information stored in single standalone databases is not sufficient to undertake the integrative aspects of modern computational analysis. Efforts such as Ensembl Plants have made significant inroads into providing a system that allows comparative analysis across multiple plant species, and CerealsDB aims to expose a large amount of informative wheat variation data freely and openly. However, these systems are not able to intercommunicate easily, with users often having to manually undertake multiple analysis steps. The integrative workflow proposed in this project will provide the necessary infrastructure to connect and query multiple resources of genomic information in order to make the process of marker-assisted primer design in wheat faster, more efficient, and more comprehensive than is currently available.

Planned Impact

Academic and Commercial Impacts:
Within academic institutes, researchers are very well accustomed to finding and using the latest resources, datasets and methods that are disseminated through traditional publication routes. As such, open and efficient access to these elements is crucial to promote uptake, realise impact, and continue to foster the global effort in wheat genomics. However, not all research groups have access to computational expertise in order to carry out and streamline what can be complex pipelines of data retrieval, conversion, analysis and exploitation. Open reusable workflows packaged up in easy-to-use well-designed web-based solutions are a vital part in the toolbox of researchers to maximise their time and promote scientific reproducibility.

Within breeding companies, the number of highly trained researchers who keep abreast of the newest wheat genomic resources is limited. Typically, it is these researchers and their teams that need to dedicate a large proportion of their work time to painstakingly navigating the available resources to design just one marker. Providing breeding companies with automated open access workflows to carry out these manual steps will increase efficiency and therefore improve productivity of breeding new varieties. In addition, many beneficial traits in wheat are introduced from other cultivars or wild wheat relatives. The ability to generate markers that tag the gene of interest as precisely as possible and in a reproducible manner will help to reduce introgressed regions in newly bred varieties, and therefore a larger number of improved varieties will be released onto the market. This has a direct and tangible impact to both the drive towards food security, but also the public perception of wheat breeding efforts.

Societal impacts:
The wheat community has seen reluctance to adopt open data conventions and widespread data sharing. This is understandable, given the direct applications of the translational aspects of wheat genomics research to breeding programmes. However, it is becoming clear that the field of food sustainability and security requires openness and transparency in order to enhance the public perception of crop improvement. The objectives of this proposal will not only make researchers' lives easier through the application of bioinformatics techniques, but will also highlight to the public the increasing need for computational infrastructures to improve the efficiency of crop research and handling the increasingly large datasets it produces. In this way, this project will contribute to the movement towards a more open and inclusive approach to wheat data sharing.

Drs Davey and Krasileva are committed to rapid dissemination of fundamental research into crop improvement, alongside open source computer software development, to multi-disciplinary beneficiaries. This is evidenced in Dr Davey's membership of the Open Bioinformatics Foundation (OBF) and both investigators' existing collaborations in functional wheat genomics, crop data infrastructure development, large-scale analytical platforms, and data sharing projects. Similarly, TGAC's body of freely available bioinformatics tools and resources reflect the commitment of the institute to furthering life science research through open science and computational excellence.


10 25 50