Data-driven articulatory modelling: foundations for a new generation of speech synthesis

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Informatics

Abstract

Technology to automatically generate artificial speech (speech synthesis) has come to sound natural enough within the past five years that its use has widened dramatically. Leaders in industry have integrated text-to-speech (TTS) systems into useful real-world applications, such as automated call-centres and call routing, telephone-based information systems (e.g. telephone banking or news services), readers for the visually impaired, and hands-free interfaces, such as car navigation systems.However, in spite of this success, state-of-the-art TTS systems are still severely limited in terms of control. In short, we can readily control what synthesisers say, but not how they say it. Therefore, although such systems are suitable for giving factual information in speech form, they are completely inadequate where a high level of expressiveness is required. By expressiveness we mean the ability to indicate questions or emphasis on selected words, or to convey emotion. Furthermore, the process of generating new synthetic voices is costly and labour-intensive. It is the aim of this project to develop an alternative to current speech synthesis technology with a comparable level of intelligibility and naturalness, but which affords far greater flexibility and control.Unit selection uses large collections of pre-recorded speech to perform synthesis by merely gluing together appropriate fragments in sequence. There is in effect little or no modelling of speech involved. In contrast, this project aims to develop a new model which is trained on pre-recorded speech and interprets it in a novel way: on the basis of its underlying articulation. The aim of this model is to produce synthetic speech which not only retains the qualities of the original speech used for training, but which also is much more versatile and therefore has the potential to be used in new and exciting ways.
 
Description At the time this project began, the dominant method for text-to-speech synthesis, whereby a computer is made to convert text to audible artificial speech, was called unit selection. This method relies on gluing together fragments of speech carefully chosen from several hours of recordings of a real human talking. The benefits of the approach are its simplicity and that it sounds exactly like the human that made the original recordings. The downsides, though, are the limited scope for changing the qualities of the synthesised speech, and the expense of making large numbers of high quality audio recordings for each new synthetic voice.

This project has pursued a different approach to synthesising speech, which is generally termed statistical parametric synthesis. Instead of merely glueing together pre-recorded snippets of speech, this approach applies powerful statistical models to examples of spoken sentences in order to learn how to produce new speech. For example, the model will learn how the underlying sounds of English combine to produce a word, and how words combine to produce a natural sounding sentence. Over the course of this project, this approach has rapidly gained popularity, and research in this direction has intensified around the world.

However, though the new statistical models indeed offer a great deal more flexibility in theory, in practice they are hugely complex and can be unwieldy to control, so exploiting that flexibility can be difficult. The major aim of this project has been to address this problem and to find ways to incorporate extra information into the statistical model, which can in turn be used to control and manipulate synthetic speech in a straightforward, transparent way. Specifically, we have sought to incorporate information about the human speech production mechanism (i.e. the articulators, such as the tongue, lips and jaw, and the vocal chords).

The most critical key findings of this project are the ways that have been demonstrated to show this may be successfully achieved. First, for example, it has been demonstrated that knowledge about the vibrations of the vocal chords can be incorporated in order to improve the voice quality of the synthesised speech, as well as offering explicit control over how the voice sounds. Second, it has been found possible to incorporate information about movements of the mouth, and then to control synthesis in terms of mouth movements. As examples of this control, we have demonstrated changing one sound to another, thus changing the identity of a word, for example changing bed to bad, simply by changing the position of the model's tongue. We have also shown it is possible to create speech sounds that are completely new and which match the general quality of the synthetic voice. This means the accent of the speech synthesiser may be modified, or foreign sounds may be incorporated seamlessly into the synthetic speech, allowing the synthesiser to speak in multiple languages and accents with the same voice.

Though this project has dealt primarily with articulatory data as the extra information, the general approach has since been expanded to work with other representations. For example, work is currently being undertaken to capture the noise environment to use as additional information for the synthesiser to use. When human talks in a noisy environment, they are known to change the way they speak. Ideally, a computer synthesiser will do the same to make synthetic speech more intelligible in varying noise conditions.

Beyond speech synthesis alone, an additional key finding of the project is a method to accurately predict articulatory movements from text. This is especially useful, for example, in applications such as animated talking heads and computer animation for films and games.

As a final key finding, the research work conducted during this project has confirmed both how useful articulatory data is, but also how difficult it is to collect in large quantities. Because of this, only small amounts of articulatory data have previously been released to the research community. Therefore, data recorded as part of this project has been released to share with other researchers worldwide at no cost.
Exploitation Route These findings have been taken forward in a number of ways. The work on articulatory speech synthesis, supported by a Royal Society of Edinburgh / National Science Foundation China grant, has been further developed by researchers at Edinburgh and at the University of Science and Technology China.

Work on articulatory modelling has had impact on speech therapy, through collaboration with Queen Margaret University, in part supported by the EPSRC project Ultrax.

Work on statistical speech synthesis, has resulted in a new area - voice banking - which has enabled the construction of natural-sounding, personalised synthetic voices from recordings of speech from people with disordered speech due to conditions such as Parkinson's disease or motor neurone disease. These synthetic voices are used in assistive technology devices that allow sufferers of these conditions to communicate more easily and effectively.
Sectors Creative Economy,Digital/Communication/Information Technologies (including Software),Healthcare
URL http://www.mngu0.org
 
Description The speech synthesis techniques developed in this project have been integrated in the Festival and HTS open source toolkits. The mngu0 corpus is a collection of articulatory data of different forms (electromagnetic articulography, 3D MRI, video, 3D scans of upper/lower jaw, audio etc.) acquired from one male British English speaker. This data provides a unique resource that shows how a typical talker's mouth is used when producing speech. On one hand, for example, the 3D MRI head scans together with the dental scans provide a detailed measurement of the anatomy of the speaker. Meanwhile, electromagnetic articulography captures dynamic movements of the speech articulators. In this technique, sensors are placed on the articulators (e.g. tongue, lips, jaw etc), and the movements of these points may be recorded. The mngu0 corpus contains the largest amount of speech recorded in this way in one sitting that is yet available (over an hour of continuous speech).Taken together, the different modalities of data in this corpus offer an unparalleled resource to support research into speech production and for developing multiple speech technology applications by incorporating knowledge of human speech production. This corpus has been released via a dedicated, forum-style web site under a licence that allows it to be used free of charge for research purposes. Beneficiaries: Researchers working in the field of speech production research, and speech technology applications.
First Year Of Impact 2010
Sector Digital/Communication/Information Technologies (including Software),Healthcare
Impact Types Cultural,Societal,Economic
 
Title The Festival Speech Synthesis System 
Description Festival offers a general framework for building speech synthesis systems as well as including examples of various modules. As a whole it offers full text to speech through a number APIs: from shell level, though a Scheme command interpreter, as a C++ library, from Java, and an Emacs interface. Festival is multi-lingual (currently English (British and American), and Spanish) though English is the most advanced. Other groups release new languages for the system. And full tools and documentation for build new voices are available through Carnegie Mellon's FestVox project (http://festvox.org). The software was first released in the 1990s, but has been under continuous development, improvement, and maintenance since then. v2.1 q was released in November 2010. 
Type Of Technology Software 
Open Source License? Yes  
Impact Festival is distributed as default in a number of standard Linux distributions including Arch Linux, Fedora, CentOS, RHEL, Scientific Linux, Debian, Ubuntu, openSUSE, Mandriva, Mageia and Slackware, and can easily be installed on any Linux distribution that supports apt-get. More recently our work on statistical parametric speech synthesis and the algorithms for adaptation have been incorporated in the HTS toolkit (one of the coordinators (Yamagishi) is from Edinburgh), which integrates with Festival. These toolkits are the most used open-source speech synthesis systems and have also formed the high performing baseline systems for the international Blizzard evaluation of (commercial and research) speech synthesis also organised by Edinburgh. 
URL http://www.cstr.ed.ac.uk/projects/festival/