Article - Issue 20, August/September 2004

Lip-reading by phone

Fabian Acker

Download the article (144 KB)

Breakthroughs in telecommunications technology will soon enable hard-of-hearing people to ‘lip-read’ telephone conversations. Acker reports on RNID’s advanced tools that allow people to use the telephone via an avatar that ‘mouths’ incoming conversations and mimics the phonemes used in most European languages.

Hard-of-hearing people may soon be able to get help understanding a telephone conversation from an electronic face that they can lip-read as it appears to speak the words they can’t quite hear. The Royal National Institute for Deaf People (RNID) has just announced the development of a powerful tool that will allow the hard-of-hearing (about 9 million in the UK, and expected to rise to 10 million in the next ten years) to use the telephone routinely by providing an avatar (an electronically generated face) mouthing an incoming conversation so as to enable the viewer to lip-read.

This provides a more cost-effective solution than communicating via videophones, which require compatible equipment at both ends, and offers an alternative to communicating from a phone to a textphone (or vice versa) through a relay service such as RNID Typetalk.

The key lies in an electronic face displayed on a computer screen, programmed to turn words into lip movements. Viewing the avatar in conjunction with some audible cues, the viewer/listener will be able to understand a conversation with almost the same degree of confidence as with conventional face-to-face lip-reading.

The basis of the system is standard speech recognition software; this is analysed electronically into phonemes, the individual units of sound that make up all words in most European languages (English has 45 phonemes). The phonemes are then converted into visiemes (their visual equivalent), and displayed on the screen by the avatar moving its lips and mouth to correspond with the incoming conversation. But to make that conversion from phonemes to visiemes, a human speaker first had to be videoed speaking, and her (it was a woman) face and lip movements captured electronically.

To do this, 28 motion sensors were attached to various parts of the face surrounding the lips and the lips themselves. An induction sensor was also attached to the tongue so that its motion could be detected, even while the mouth was closed. Their 3-D coordinates were recorded at a rate of 60 frames per second (see Figure 1).

All the movements made in the lower half of the face when pronouncing a phoneme are now recorded, and these are run in sequence to make lip movements that can be ‘read’ by a lip-reader. A 200 ms delay built into the speech reception allows time for the processor to search, recognise and move the lips in perfect synchronism with the speech.

Because the screen is flat, and the visiemes have been recorded in 3-D, it is often easier to see the face in three-quarter profile and it is possible to rotate the face on the screen through a wide angle, allowing the viewer to see all of the movements in a quasi-3-D mode.

The project, which started in October 2001, is funded by the EU. The R&D is being carried out by KTH, a University in Stockholm; University College London; Viatal, a Dutch Deaf organisation; and Babel Infobox, a commercial organisation, along with RNID. Much of RNID’s work has involved arranging for a panel of the hard-of-hearing to use the system and gauge its usefulness.

As a result of the Dutch and Swedish input, the system can work in Dutch and Swedish as well as English.

The general view so far is that it is helpful, but needs to be more accurate. That’s down to the speech recognition software, rather than the conversion into lip movements, which although not always accurate, is translated very efficiently into lip movements.

As communications move into the broadband area, the use of a Synface, as it is called, will need less hardware at the user’s end. At the moment, a computer loaded with the appropriate software is needed, but in the three or four years probably needed to bring the system into general use, more advanced communications techniques will make this unnecessary.

Fabian Acker

Freelance Science Writer

[Top of the page]