Towards visually-driven speech enhancement for cognitively-inspired multi-modal hearing-aid devices (AV-COGHEAR)

Lead Research Organisation: University of Stirling
Department Name: Computing Science and Mathematics

Abstract

Current commercial hearing aids use a number of sophisticated enhancement techniques to try and improve the quality of speech signals. However, today's best aids fail to work well in many everyday situations. In particular, they fail in busy social situations where there are many competing speech sources; they fail if the speaker is too far from the listener and swamped by noise. We have identified an opportunity to solve this problem by building hearing aids that can 'see'.

This ambitious project aims to develop a new generation of hearing aid technology that extracts speech from noise by using a camera to see what the talker is saying. The wearer of the device will be able to focus their hearing on a target talker and the device will filter out competing sound. This ability, which is beyond that of current technology, has the potential to improve the quality of life of the millions suffering from hearing loss (over 10m in the UK alone).

Our approach is consistent with normal hearing. Listeners naturally combine information from both their ears and eyes: we use our eyes to help us hear. When listening to speech, eyes follow the movements of the face and mouth and a sophisticated, multi-stage process uses this information to separate speech from the noise and fill in any gaps. Our hearing aid will act in much the same way. It will exploit visual information from a camera (e.g.using a Google Glass like system), and novel algorithms for intelligently combining audio and visual information, in order to improve speech quality and intelligibility in real-world noisy environments.

The project is bringing together a critical mass of researchers with the complementary expertise necessary to make the audio-visual hearing-aid possible. The project will combine new contrasting approaches to audio-visual speech enhancement that have been developed by the Cognitive Computing group at Stirling and the Speech and Hearing Group at Sheffield. The Stirling approach uses the visual signal to filter out noise; whereas the Sheffield approach uses the visual signal to fill in 'gaps' in the speech. The vision processing needed to track a speaker's lip and face movement will use a revolutionary 'bar code' representation developed by the Psychology Division at Stirling. The MRC Institute of Hearing Research (IHR) will provide the expertise needed to evaluate the approach on real hearing loss sufferers. Phonak AG, a leading international hearing aid manufacturer, will provide the advice and guidance necessary to maximise potential for industrial impact.

The project has been designed as a series of four workpackages that consider the key research challenges related to each component of the device's design. These questions have been identified by preliminary work at Sheffield and Stirling. Among the challenges are developing improved techniques for visually-driven audio-analysis; designing better metrics for weighting audio and visual evidence; developing techniques for optimally combining the noise-filtering and gap-filling approaches. A further key challenge is that, for a hearing aid to be effective, the processing cannot delay the signal by more than 10ms.

In the final year of the project a full integrated, software prototype will be clinically evaluated using listening tests with hearing-impaired volunteers in a range of modern noisy reverberant environments. Evaluation will use a new purpose-built speech corpus that will be designed specifically for testing this new class of multimodal device. The project's clinical research partner, the Scottish Section of MRC IHR, will provide advice on the experimental design and analysis aspects throughout the trials. Industry leader Phonak AG will provide advice and technical support for benchmarking real-time hearing devices. The final clinically-tested prototype will be made available to the whole hearing community as a testbed for further research, development, evaluation and benchmarking.

Planned Impact

This mufti-disciplinary project has been designed to have impact beyond the academic environment:

*Sufferers of hearing loss*

The aim of the proposal is to demonstrate a totally new class of hearing device that, by using visual input, is able to deliver an unparalleled level of speech intelligibility in noisy situations where current audio-only hearing aids are known to fail. The proposal, by supplying the enabling research for this technology, has potential for significant long term societal impact. Reduced ability to understand speech in noise is one of the most debilitating symptoms of hearing loss. An effective hearing device would improve the quality of life of millions of hearing loss suffers (over 10m in the UK alone [1], receiving around £300M of treatment from the NHS annually [2]). Notably even mild age-related hearing loss (something which effects us all) can cause speech to become hard to understand in situations where many people are speaking at the same time (e.g., social gatherings); or where speech is heard from a distance and degraded by reverberation (e.g., classrooms). Even a small improvement in performance could be the difference that allows someone to continue in their job (e.g., a teacher in a noisy classroom) or to remain socially active, avoiding potential isolation and depression. Note that as the device will work by complementing the visual processing performed routinely in speech perception, it will be of particular benefit to hearing loss suffers who are also visually impaired.

*The hearing aid industry*

A new class of audio-visual (AV) hearing aid would have impact on the hearing aid industry itself: demand for AV aids would rapidly displace inferior audio-only devices. There are clear precedents for hearing science rapidly transforming hearing technology, e.g., multiple microphone processing and frequency compression have been commercialised to great effect. We foresee AV processing as the next step forward. Previous barriers to V processing are falling: reliable wireless technology frees the computation from having to be performed on the device itself; wearable computing devices are becoming sufficiently powerful to perform real-time face tracking and feature extraction. AV aids will also impact on industry standards for hearing aid evaluation and clinical standards for hearing loss assessment. Plans for realising these industrial impacts (including through an International Workshop, AV hearing device Challenge/Competition and open-source dissemination) are detailed in the Pathways to Impact document.

*Applications beyond hearing aids*

The project has potential impact in other speech processing applications, including,

-Cochlear implants (CI). CI users have even more severe problems coping with noise. With further research the technologies we are proposing could be used directly in CI signal processing.

-Telecommunications. Here we imagine video signals captured at the transmission-end being used to filter and enhance acoustic signals arriving at the receiver-end. Note, this could be useful either in teleconferencing, or built into conventional audio-only receivers, or for people with visual impairment who are unable to see visual cues directly.

-Speech-enhancement for normal hearing. AV speech intelligibility enhancement may be useful for users with no hearing loss in certain situations, e.g., in situations where ear defenders are being worn - factories, emergency response, military, etc.

[1] http://www.patient.co.uk/doctor/deafness-in-adults
[2] http://www.publications.parliament.uk/pa/cm201415/cmhansrd/cm140624/text/140624w0002.htm#140624w0002.htm_wqn4
 
Description Cognitive SenticNet and Multimodal Topic Structure Parsing Techniques for Chinese and English Languages (joint-project with Tsinghua University, Beijing, China, under the RSE-NNSFC Joint Research Scheme)
Amount £12,000 (GBP)
Organisation The Royal Society of Edinburgh (RSE) and The National Natural Science Foundation of China(NNSCF) 
Sector Public
Country United Kingdom of Great Britain & Northern Ireland (UK)
Start 07/2014 
End 03/2017
 
Title A Novel Enhanced Visually-Derived Wiener Filtering Approach for Speech Enhancement 
Description A novel two-stage enhanced Visually-derived Wiener Filtering (EVWF) approach for speech enhancement is being developed. The first stage employs a neural network based data driven approach to approximate clean audio features using temporal visual only features (lip reading). The next stage proposes the novel use of inverse filterbank (FB) transformation in place of cubic spline interpolation employed in the state-of-the-art visually-derived Wiener Filtering (VWF) approach. The novel EVWF has demonstrated an enhanced capability to estimate the clean high-dimensional audio power spectrum, from low-dimensional visually-derived audio filterbank features, compared to the state-of-the-art VWF. approach 
Type Of Material Improvements to research infrastructure 
Provided To Others? No  
Impact Ongoing performance evaluation in reverberant domestic environments with multiple real-world sound sources (using the benchmark CHiME2 and audio-visual GRID corpuses), has shown that the proposed EVWF method is more reliable compared to both state-of-the-art, visually-derived weiner filtering, and audio-only speech enhancement methods, such as spectral subtraction and log-minimum mean-square error, with significant performance improvements demonstrated in terms of quantitative and qualitative speech enhancement measures. 
 
Title Novel Long Short-Term Memory based Lip-Reading For Speech Enhancement in Cognitively-Inspired Multi-Modal Hearing-Aids 
Description A novel Long Short-Term Memory (LSTM) based data-driven approach has been developed to approximate clean audio features, using temporal visual features (lip reading). 
Type Of Material Improvements to research infrastructure 
Provided To Others? No  
Impact We have carried out preliminary simulation experiments using a new audiovisual (AV) dataset developed for AV speech mapping, based on the benchmark AV GRID corpus (that was originally developed, by our project partners at Sheffield, for speech perception and automatic speech recognition). Comparative results show that the proposed LSTM AV mapping model can deliver significantly enhanced clean audio features estimation, compared to our previously reported, benchmark Multi-Layered Perceptron (MLP) based AV speech modelling approach. 
 
Title Audiovisual Dataset for audiovisual speech mapping using the Grid Corpus 
Description This new publicly available dataset is based on the benchmark audio-visual GRID corpus, which was originally developed by our project partners at Sheffield for speech perception and automatic speech recognition. The new dataset contains a range of joint audiovisual vectors, in the form of 2D-DCT visual features, and the equivalent audio log-filterbank vector. All visual vectors were extracted by tracking and cropping the lip region of a range of Grid videos (1000 videos from five speakers, giving a total of 5000 videos), and then transforming the region with 2D-DCT. The audio vector was extracted by windowing the audio signal, and transforming each frame into a log-filterbank vector. The visual signal was then interpolated to match the audio, and a number of large datasets were created, with the frames shuffled randomly to prevent bias, and with different pairings, including multiple visual frames to estimate a single audio frame (from one visual to one audio pairings, to 28 visual to one audio pairings). 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
Impact This new publicly available dataset is developed as a benchmark for the speech enhancement community. It enables researchers to evaluate how well audio speech can be estimated using visual information only. Specifically, the application of novel speech enhancement algorithms (including those based on advanced machine learning), can be used to evaluate the potential of exploiting visual cues for speech enhancement. 
URL http://hdl.handle.net/11667/81
 
Description MRC Network (Cardiff) 
Organisation Cardiff University
Country United Kingdom of Great Britain & Northern Ireland (UK) 
Sector Academic/University 
PI Contribution We are collaborators on a MRC Network grant for a Hearing Aid Research Network, contributing presentations and discussions on the topic of disruptive technologies for hearing aids.
Collaborator Contribution Attendance, presentations at meetings, etc.
Impact No outputs yet.
Start Year 2015
 
Description British Society of Audiology Conference 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Industry/Business
Results and Impact We will be attending the British Society of Audiology Conference, 25-27 April 2016, which is a mixture of researchers, clinical practitioners, and industry presenters, to present an initial poster, titled "Audiovisual Speech Processing: Exploiting visual features for joint-vector modelling". The aim is to present the new project to a wider audience.
Year(s) Of Engagement Activity 2016
URL https://www.eventsforce.net/fitwise/frontend/reg/thome.csp?pageID=127483&eventID=323&eventID=323
 
Description CHAT-2017 Workshop, Stockholm, August 2017 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact We have organised a 1-day international workshop entitled 'Challenges for Hearing Assistive Technology, CHAT-2016' that will run as a satellite event of the large week-long Interspeech Conference in Stockholm in August. The workshop has been granted recognition and financial support from the International Speech Communication Association (ISCA). We have recruited a Scientific Committee of 25 leading international researchers representing both academia and the hearing aid industry. The purpose of the workshop will be as a meeting place for the hearing aid industry and researchers working in speech technology. We hope that interaction between these communities can stimulate fresh ideas for new directions in hearing assistive technology. The workshop will also provide an opportunity to promote the work being conducted under our EPSRC-funded AV-COGHEAR project.
Year(s) Of Engagement Activity 2017
URL http://spandh.dcs.shef.ac.uk/chat2017/
 
Description Impact for Access - Stirling University 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Schools
Results and Impact Approximately 70 school pupils (of ages between 14 and 16) attended a visit to the research organisation (University of Stirling), to learn more about studying Computing Science. As part of this, an image processing tutorial and interactive demo was given to small groups, which resulted in questions and discussions about both the direct research topic (signal processing), and studying Computing Science more generally.
Year(s) Of Engagement Activity 2016
 
Description MRC Hearing Aid Network (Cardiff) Presentation 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Discussed work of Stirling project on behalf of Prof Hussain at Cardiff MRC Network meeting (held by Prof. Culling)
Year(s) Of Engagement Activity 2015
URL https://www.mrc.ac.uk/documents/pdf/hearing-aid-research-networks/
 
Description MRC-EPSRC Network workshop meeting on Microphone Technologies for Hearing Aids, Cardiff, September 21, 2017 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Along with other speakers (Daniel Robert, Allan Belcher, and John Culling), Ahsan Adeel presented the EPSRC AV-COGHEAR vision, objectives, ongoing-future work, immediate challenges, and video/audio-visual information in hearing aids: A brief review and some future directions. The meeting was attended by around 30-40 people consisting of academics from other funded projects and staff from the EPSRC and MRC. The meeting aimed to review grant impact mechanisms and discuss impact emphasis and requirements.
Year(s) Of Engagement Activity 2016
 
Description MRC-EPSRC Network workshop meeting on Microphone Technologies for Hearing Aids, organized at Stirling, 2 Feb 2017 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact AV-COGHEAR PI: Prof. A. Hussain (Stirling), jointly with MRC Network PI: Prof J. Culling (Cardiff), organized the MRC-EPSRC Network meeting at Stirling, in the form of a one day interactive Workshop, attended by approximately 40 participants from multi-disciplinary backgrounds. All AV-COGHEAR project partners, including PI: Prof Hussain, CI: Dr J. Barker (Sheffield), project postdocs, Dr A. Ahsan, Dr R. Marxer, Dr A. Abel, the project's clinical partner: Dr W. Whitmer (MRC IHR Glasgow) and the industry partner, Dr P. Derleth (Phonak Hearing), attended the Workshop meeting and actively participated in discussions to explore and develop synergies between AV-COGHEAR and other related MRC Network and EPSRC project partners. In particular, plans for developing and sharing audio-visual speech data were discussed and agreed, along with a detailed proposal for jointly-organizing a first of its kind, international Workshop (chaired by A. Hussain, J. Barker, J. Culling and J. Hansen) on "Challenges for Hearing Assistive Technology (CHAT-2017)", as part of INTERSPEECH'2017 at Stockholm, Sweden, on 19th August 2017.
Year(s) Of Engagement Activity 2017
 
Description MRC/EPSRC workshop on hearing aid technology research, June 2016 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Dr Jon Barker presented the aims and initial progress of the project at a joint EPSRC/MRC workshop on hearing aid technology research. The meeting was attended by around 60-70 people consisting of academics from other funded projects and staff from the EPSRC and MRC.

The meeting was useful in that it allowed us to identify links with ongoing projects working with audio-visual speech data.
Year(s) Of Engagement Activity 2016