Automated building of carbohydrate molecules using X-ray crystallography data

Lead Research Organisation: University of York
Department Name: Chemistry


Carbohydrate molecules are an essential part of the living world, making up the sugars in our food, the fibres in clothes and the cell walls of green plants. The interaction of carbohydrates with other biological molecules, and in particular proteins, is an important part of many biological processes. Understanding this interaction is important for understanding how cells work and interact with one another, as well as being important to diverse bio-technologies such as the breaking down of fibrous landfill waste and the development of biological washing powders. Many proteins secreted by higher organisms have carbohydrates built directly into their structure and those incorporated into the cell membrane contribute to cell-cell recognition.

X-ray crystallography - essentially an extremely powerful microscope - allows us to see the atoms in the 3D structures of biological macromolecules such as proteins. This knowledge is vital to an understanding of such molecules carry out their tasks. Protein-sugar complexes and glycoproteins can be studied using crystallography, but while the protein can often be seen fairly clearly in the resulting 3D structure, the sugar is often blurry because carbohydrates are often rather flexible. Interpreting the magnified image in terms of atoms and bonds can be a time consuming project, and the results somewhat subjective.

The aim of this project is to provide an automated method for performing this interpretation. While automation is of value in that it frees up researcher time to concentrate on the scientific problems, a more important benefit is that it allows many possible interpretations of the magnified electron density image to be explored. In difficult cases this larger starting set of models is more likely to contain the correct answer that a single model built by a crystallographer. The different models can then be ranked to choose the best one.

The structure of the protein is easily built by known methods, leaving a 'blob' of unaccounted for electron density into which the sugar must be placed. An initial set of possible structures for the sugar will be determined using existing web-based software and the Protein Data Bank (PDB). Dr. Cowtan at York University has previously written computer software which successfully identifyies the sugar rings in nucleic acids (including DNA which carries the genetic information) from their electron density alone. The 'fingerprint' technique involves the identification of shapes which are always present when the sugar is present. This approach will be modified to identify sugar rings in the carbohydrates. Each possible ring shape will be tested against the X-ray result, and neighbouring rings will be linked together. This will give a large pool of possible structures which can be ranked by automatic scoring methods based on the 3D X-ray maps and the plausibility of the chemistry.

The resulting methods will be applied to two problems. The building of (1) carbohydrate molecules (such as enzyme substrates) crystallised in complex with proteins and (2) the sugars which are an integral part of glycoproteins. The software will be implemented in the ubiquitous 'Coot' graphical model building software, and will thus be made available to whole user community, both academic and commercial.

The result will be a more reliable and more objective interpretation of sugars in 3D structures of macromolecules from living organisms, which in turn will enable greater understanding of the roles of these sugars in essential biological processes.

Technical Summary

The project is made up of four steps.

1. Devise a test suite of datasets which are representative of the problems faced in building sugars in macromolecular structures, and a library of well refined carbohydrate structures. These will be mined from the PDB and assembled into a curated library. The test suite will be augmented by structures previously determined at YSBL.

2. Explore the conformational space of a given carbohydrate. An initial model will be obtained from the library above, from online resources such as the GLYCAN database, or from the supplied description. Local conformational variability around any position in the molecule will then be explored using a library of disaccharide fragments, or if necessary a grid search of ring conformations and linkage torsions.

3. Fitting of the model into the electron density will be accomplished by finding a start position using either a YSBL-developed tool to detect the electron density fingerprint of pyranose rings for carbohydrate complexes, or from the location of the glycosylated residue for glycoproteins. The conformational search will be used iteratively to add successive units from the start position, using a look-ahead search.

4. A non-interactive version of this software will be used to generate an ensemble of solutions. These will be forwarded to refinement in the 'refmac' software and evaluated on the basis of the X-ray data using difference density maps, and stereochemistry using the MolProbity score. Scoring functions for filtering the list of conformations before the slow refinement step will be examined.

Graphical user interfaces will be developed to the widely used and freely distributed 'Coot' software for building macromolecular structures, and the CCP4 suite.

Planned Impact

The structures of mammalian proteins, which may be glycosylated, are increasingly used by the pharmaceutical industry. There is also pressure to ensure that therapeutic compounds adopt the correct glycoforms. Finally, the development of biofuels is driving interest in cellulose-digesting enzymes. As a result there is significant industrial interest in carbohydrates.

Industrial users in all of these fields depend on accurate structural studies including carbohydrate chains, and some are involved in performing such studies. Consumers of the structural studies will draw more accurate conclusions if the source data is also more accurate. Companies performing such studies will see direct benefits from automation in the form of reduced labour and reduced error rates. This is of particular relevance to pharmaceutical applications, given that FDA regulations now require the extensive characterization of the glycoform profiles of therapeutic glycoproteins.

The CCP4 suite already has significant commercial impact (~100 commercial licensees paying £9500pa). YSBL has played a significant role in this impact, with two YSBL-originated developments (the 'refmac' and 'coot' software) being the most-used tools in their field. We will actively engage this user base. This will be achieved as follows:

CCP4 developers, including the York group, conduct an annual meeting with structural biologists at GSK to guide future developments. Plans and developments will be presented at this meeting. There are occasional visits to other customers.

Users from two further commercial licensees responded positively to an initial message placed on the CCP4 bulletin board asking if there was a demand for carbohydrate building software. These users will be contacted before the grant begins to clarify their use cases, and site visits will be arranged to ensure that the software can address their needs.

The full list of CCP4 commercial licensees will be contacted and offered the chance to provide inpt on the project.

From the responses from all these groups, a small group of user champions will be identified. These will receive test versions of the software as it is developed to ensure user feedback. They will be invited to attend a progress and steering meeting around the middle of the grant.

Once the software has achieved a basic level of usability, the developer will participate in CCP4 site visits to interested commercial users to demonstrate and train staff in the use of the software.

This represents a greater level of engagement with commercial users than has been normal in previous CCP4 projects, which is appropriate given the commercial relevance of this project.
In addition to these direct links with commercial users, the PDRA will attend CCP4 workshops to teach users how to apply the software to their problems. These workshops have proven to be vital in bringing together users and developers and ensuring that the software addresses real world problems, and most importantly will be aimed at both academic and commercial laboratories.

The resulting software will be added to the COOT package (for the interactive tools) and to the CCP4 suite (for non-interactive tools). Both packages are in use world-wide and are available on Windows, Linux and Mac_OS platforms, providing a direct distribution channel to the vast majority of macromolecular crystallographers. Both are updated with major version releases roughly every year. COOT provides nightly releases to ensure users can access the latest developments; the CCP4 suite is on track to do the same in the near future. As a result, once the software has been added to these packages it will within months be available to most of the user community.


10 25 50
publication icon
Agirre J (2016) Three-dimensional structures of two heavily N-glycosylated Aspergillus sp. family GH3 ß-D-glucosidases. in Acta crystallographica. Section D, Structural biology

publication icon
Agirre J (2015) Carbohydrate anomalies in the PDB. in Nature chemical biology

publication icon
Agirre J (2015) Privateer: software for the conformational validation of carbohydrate structures. in Nature structural & molecular biology

publication icon
Agirre J (2017) Strategies for carbohydrate model building, refinement and validation. in Acta crystallographica. Section D, Structural biology

publication icon
Agirre J (2017) Carbohydrate structure: the rocky road to automation. in Current opinion in structural biology

publication icon
Hudson KL (2015) Carbohydrate-Aromatic Interactions in Proteins. in Journal of the American Chemical Society

publication icon
Jon Agirre (2014) Validation of carbohydrate structures: not just nomenclature in International Union of Crystallography, world meeing.

publication icon
McNicholas S (2017) Glycoblocks: a schematic three-dimensional representation for glycans and their interactions. in Acta crystallographica. Section D, Structural biology

publication icon
Nnamchi CI (2016) Structural and spectroscopic characterisation of a heme peroxidase from sorghum. in Journal of biological inorganic chemistry : JBIC : a publication of the Society of Biological Inorganic Chemistry

Description The methods we have developed synthesise descriptions (fingerprints) of each cyclic carbohydrate's structure and environment using experimental data provided by public sources (wwPDB and PDB_REDO). Therefore, there is a very strong dependence on the quality of the data being fed into the synthesis process. To ensure that only chemically correct data were processed, a state-of-the-art validation software has been developed.

This validation software, which has been used to analyse the aforementioned databanks, has revealed a number of important issues that affect the quality of carbohydrate models. We have established the cause for most of these unexpected problems and reported them to the respective authors of relevant model building and refinement programs. As a result, future versions of these programs will produce more realistic models. Also, people from wwPDB have expressed have expressed an interest on integrating our software as part of their existing validation pipeline. In addition to that, the Collaborative Computational Project 4 (CCP4) has included it into their current release (7.0).

We have also generalised and automated the fingerprinting technique. This will allow our software to detect not only sugars, but theoretically any kind of ligand.
Exploitation Route Our software may be readily extended by others. It has a modular design, which has facilitated its incorporation into other programs such as the widely used Coot and CCP4mg packages.
Sectors Pharmaceuticals and Medical Biotechnology
Description Determining and validating sugar conformation is crucial for designing inhibitors for carbohydrate-active enzymes, such as glycosylhydrolases - key actors in biomass processing - and glycosyltransferases, which are frequent targets for many drugs that inhibit bacterial cell-wall formation. The automated detection and building of glycan models will also save a considerable amount of time to industrial users: N-glycosylation is the single most frequent protein modification in eukaryotic cells, which are of critical interest to the pharmaceutical industry. The Privateer software is now part of the standard structure deposition pathway at the worldwide Protein Databank (wwPDB), and will therefore be applied as a matter of course to all new carbohydrate-containing structures, which in turn are of considerable interest to the pharmaceutical industry.
First Year Of Impact 2015
Sector Pharmaceuticals and Medical Biotechnology
Impact Types Economic
Title Privateer software for validation of carbohydrate structures 
Description Software to validate the carbohydrate structures in atomic models in the PDB. The software currently automatically detects nomenclature errors which are widespread, and also analyses ring puckering to detect high energy conformations. Alpha release of software to other developers 
Type Of Technology software 
Year Produced 2014 
Impact No actual Impacts realised to date 
Title privateer-validate version 6.5: Software for the validation of carbohydrate structures 
Description Privateer-validate checks the conformation of carbohydrate ligands and glycans in macromolecular structures, identifying chemically relevant nomenclature and stereochemical errors. 
Type Of Technology Software 
Year Produced 2015 
Impact The software has been adopted by the world wide protein data bank (wwPDB) for the validation of all new structures deposited in the wwPDB. 
Title privateer-validate version 7.0: Software for the validation of carbohydrate structures 
Description Privateer-validate checks the conformation of carbohydrate ligands and glycans in macromolecular structures, identifying chemically relevant nomenclature and stereochemical errors. The latest version includes methods to generate restraint dictionaries for the refinement of carbohydrates, to aid the correct solution of new structures and the remediation of existing structures. 
Type Of Technology Software 
Year Produced 2016 
Impact The software has been adopted by the world wide protein data bank (wwPDB) for the validation of all new structures deposited in the wwPDB. 
Description DLS-CCP4 Data Collection and Structure Solution Workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact DLS-CCP4 Data Collection and Structure Solution Workshop
Delivered a 45 minute lecture on carbohydrate building and validation.
Year(s) Of Engagement Activity 2015
Description Press Release: Sugar molecules lose their 'Cinderella' status. Now a team from the York Structural Biology Laboratory (YSBL) in the Department of Chemistry at the University of York has produced user-friendly software called Privateer that enables scientists to analyse and study the three-dimensional structure of carbohydrates facilitating their exploitation in academic and modern medicine. The work is published in Nature Structure and Molecular Biology. 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Reported by six news outlets. Altmetic page:
Year(s) Of Engagement Activity 2015