Joining the dots: from data to insight

Lead Research Organisation: University of Southampton
Department Name: School of Mathematics


The relentless growth of the amount, variety, availability, and the rate of change of data has profoundly transformed essentially all aspects of human life. The Big Data revolution has created a paradox: While we create and collect more data than ever before, it is not always easy to unlock the information it contains. To turn the easy availability of data into a major scientific and economic advantage, it is imperative that we create analytic tools that would be equal to the challenge presented by the complexity of modern data.
In recent years, breakthroughs in topological data analysis and machine learning have paved the way for significant progress towards creating efficient and reliable tools to extract information from data.

Our proposal has been designed to address the scope of the call as follows.
To 'convert the vast amounts of data produced into understandable, actionable information' we will create a powerful fusion of machine learning, statistics, and topological data analysis. This combination of statistical insight, with computational power of machine learning with the flexibility, scalability, and visualisation tools of topology will allow a significant reduction of complexity of the data under study. The results will be output in a form that is best suited to the intended application or a scientific problem at hand. This way, we will create a seamless pathway from data analysis to implementation, which will allow us to control every step of this process. In particular, the intended end user will be able to query the results of the analysis to extract the information relevant to them. In summary, our work will provide tools to extract information from complex data sets to support user investigations or decisions.

It is now well established that a main challenge of Big Data is how 'to efficiently and intelligently extract knowledge from heterogeneous, distributed data while retaining the context necessary for its interpretation'. This will be addressed first of all by developing techniques for dealing with heterogenous data. A main strength of topology is its ability to identify simple components in complex systems. It can also provide guiding principles on how to combine elements to create a model of a complex system. It also provides numerical techniques to control the overall shape of the resulting model to ensure that it fits with the original constraints. We will use the particular strengths of machine learning, statistics and topology to identify the main properties of data, which will then be combined to provide an overall analysis of the data. For example, a collection of text documents can be analysed using machine learning techniques to create a graph which captures similarities between documents in a topological way. This is an efficient way to classify a corpus of documents according to a desired set of keywords. An important part of our investigation will be to develop robust techniques of data fusion. This is important in many applications. One of our main applications will address the problem of creating a set of descriptors to diagnose and treat asthma. There are five main pathways for clinical diagnosis of asthma, each supported by data. To create a coherent picture of the disease we need to understand how to combine the information contained in these separate data sets to create the so called 'asthma handprint' which is a major challenge in this part of medicine.

Every novel methodology of data analysis has to prove that its 'techniques are realistic, compatible and scalable with real- world services and hardware systems'. The best way to do that is to engage from the outset with challenging applications , and to ensure that theoretic and modelling solutions fit well the intended applications. We offer a unique synergy between theory and modelling as well as world-class facilities in medicine and chemistry which will provide a strict test for our ideas and results.

Planned Impact

It is difficult to think of an area of human activity that has not been profoundly changed by the relentless flow of data. Large, complex, heterogenous data sets are now ubiquitous, and the lack of robust, powerful tools capable of dealing with data is now a serious obstacle to progress.

This proposal is very firmly focused on long-term impact of the proposed research. It is clear to us that the only ideas that will stand the test of time will be those that have been robustly tested on challenging problems emerging from key areas of application. Our proposal has been designed to ensure that our work creates significant impact within academia and far beyond.

It is our ambition to create a seamless pathway from theory to applications that will ensure a lasting and substantial impact of the proposed work. Within academia, we will communicate our results through research papers, conference talks, invited talks, web sites and blogs. To ensure that our methods are realistic, scalable, and useful, we will concentrate on specific real-world problems. We will use our extensive network of scientific and business connections to reach out to potential end users and to identify opportunities for implementing our findings. Our results will be of direct importance in medicine, and are likely to lead to implementations of new procedures or algorithms.

The importance of big data in everyday life in the UK and globally is well documented, and is an important reason behind this call. In creating this proposal we were guided by the long-term needs of the sciences involved, as well as the broader society. This work will lead to significant results, that will be demonstrated on really important areas of application where our contribution will have impact well beyond the academic community. In our selection we have been guided by the long term development of the sciences involved. This proposal is ambitious and internationally competitive and can establish UK science as a leader in this area. We bring together a number of the key disciplines in EPSRC's portfolio: mathematics, statistics and applied probability, computer science, chemistry and will make significant contributions to each. Furthermore, the proposal addresses societal challenges by addressing key problems in medicine and is likely to have impact to personalised health care. This work has the potential to contribute to the UK economy through possible implementations of the best algorithmic results. Finally, this proposal fits very well with research supported by the EPSRC and other RCUK councils.

We have set aside funds within our budget to be used specifically on activities likely to strengthen the impact of this proposal. They will be used to organise concentrated workshops that will bring together top data scientists, to arrange small scale meetings with members of the UBIOPRED consortium and medical practitioners. We will be proactive in investigating new routs of implementation of the most promising results. In this we will be supported by the vast network of thriving collaborations with business and industry that exist at the University of Southampton.


10 25 50
Smyth, C (2016) The shape of the Zooniverse
Description 1. We have implemented a novel technique based on persistent homology to analyse CT scans of lungs of patients with COPD. The results are very promising as they demonstrate a correlation between topological characteristics derived from the CT scans and clinical information about the patients.

2. We have created a pipeline for computing persistence for a large class of molecules and demonstrated their relevance for detecting solubility.

3. We have completed a study of cyclic homology of crossed product finite type algebras and proved a detailed formula that expresses that homology in terms of the orbifold cohomology of the underlying orbifold.
Exploitation Route We are developing a clinical and diagnostic pathways to help clinicians working with COPD patients.
Sectors Chemicals,Healthcare,Government, Democracy and Justice,Pharmaceuticals and Medical Biotechnology
Description We are developing a new programme to enable patients to monitor the status of their asthma and similar conditions and to communicate with their data to the GPs. This project is in the early stages of development and this will be updated in later submissions.
First Year Of Impact 2017
Sector Healthcare,Pharmaceuticals and Medical Biotechnology
Impact Types Societal
Description Applied Algebraic Topology (LMS) 
Organisation Queen Mary University of London (QMUL)
Department School of Mathematical Sciences
Country United Kingdom of Great Britain & Northern Ireland (UK) 
Sector Academic/University 
PI Contribution This is a collaborative research network established to support work in applied algebraic topology.
Collaborator Contribution We have jointly organised a number of research meetings.
Impact We have jointly organised seven research meetings to date.
Start Year 2014
Description Cafe Scientifique talk 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact I have been invited to give a popular talk based on my current research, the title was "Measuring the world: From Pythagoras to Big Data"
Year(s) Of Engagement Activity 2016
Description STEM for Britain 2017 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Policymakers/politicians
Results and Impact STEM for Britain is a national poster competition for early career scientists, mathematicians, engineers. My team have submitted a poster describing our early findings on using persistent homology to classify CT scans of lungs. We were very pleased that the poster was selected for the final, and it was presented 13 March 2017 at Westminster. The poster was presented by Dr Francisco Belchi-Guillamon and it attracted a lot of interest from the participating audience. The final was very well attended.
Year(s) Of Engagement Activity 2017
Description Southampton Science and Engineering Festival 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact The Southampton Science and Engineering Festival (SOTSEF) and engineering festival has been organised annually by the University of Southampton. The 2017 edition is the fifteenth time the festival took place and this year it was the largest event in the history of SOTSEF. It has been timed to coincide with the British Science Festival. The PI was invited to give a talk on "Measuring the World: from Pythagoras to Big Data". The open day is attended by thousands of participants from across the region, and includes school children, prospective and current students, parents and interested members of the general public. The lecture is full to capacity of the lecture theatre (about 300).
Year(s) Of Engagement Activity 2017