
Hard of Hearing
- Article 11 of 26
- M-iD, January 2005
Speech recognition can be infuriating, but it is improving fast.
Almost everyone who has used a speech recognition system has been singularly unimpressed, it seems. Yet, this perception of speech technology is rapidly becoming outdated, as many organisations are discovering to their profit.
“People's negative experiences are based on early interactive voice response (IVR) systems, where you were asked to respond to a specific question and were then frustrated that the answer played back wasn't what you said,” says Alan Barr, chief operating officer of Streamdoor and a veteran of the speech recognition industry. “You'd try to book a cinema ticket and be asked what cinema you wanted; you'd say 'Epsom' and it would respond with Liverpool'. People immediately thought the technology wasn't up to the job.”
However, the speech recognition systems of even three or four years ago bear little resemblance to those being deployed today. Systems can now boast accuracy rates of up to 99%, far greater flexibility in coping with different accents and many more possible applications, all for less work and less expenditure by the organisations deploying them.
That computers can recognise speech at all is impressive. People speak at different rates, with different dialects and emotions and in different pitches. Even the same person can say the same words in many different ways, and then there is the problem of separating speech from background noise.
There have been two main approaches to speech recognition in the last decade. The first involves 'neural networks' - a branch of artificial intelligence based on computer simulations of the brain that can learn from their experiences. By exposing neural nets to spoken speech and allowing them to reorganise themselves in the same way as neurons in the brain do, researchers have been able to get reasonable recognition of limited sets of words; the nets have even been able to extrapolate language and derive past tenses of verbs from their own self-taught rules.
Using this technique, researchers at the University of Southern California have developed a system that they claim is better than human listeners at picking out a limited range of words from background noise. Yet while this approach is elegant and may prove more successful in the long-term, it faces considerable obstacles to success in the short-term.
Neural networks only model small parts of the brain so whether they will be able to interpret the full range of vocabulary, accents and voice tones is debatable. It also takes humans many years of learning, exposure to different people and education to build up this ability; it will take a similar amount of time to train a neural network, even once one is sufficiently powerful.
So instead commercial systems have focused on pattern recognition as the best current means to speech recognition. Pattern recognition is based on 'hidden Markov models' that represent typical speech sounds mathematically, but include some leeway to take into account the differences in how people may speak those sounds. To trigger recognition, the sound must be within a certain range of frequencies and a certain range of energies.
To build up these models of speech sounds, vendors have to record huge databases of speech and analyse them for these sounds. Making it even harder is the fact that many of these sounds are pronounced differently, depending upon the other sounds near them.
Streamdoor's Barr trialled one of the first speech recognition systems in the UK for Argos, which would handle store enquires from shoppers. He discovered first-hand how much speech sampling was needed for a successful application. “We discovered that the UK, out of all the European countries, has the most dialects with the exception of Italy,” he says. “Dialects make a huge difference. In that testing environment, we had to make 100,000 different calls with different dialects to get an algorithm with 90% recognition.”
Telephonetics' director Paul Welham has a similar story to tell. “There are a large number of hospital sites using speech for call routing in the UK. They have to understand what the general public is going to ask for, but even simple requests aren't easy to predict. We discovered that when people wanted to talk to the bedding department in a hospital, there were 100 ways for them to actually ask for that.”
Careful design of the questions used to elicit responses in speech recognition applications is key to improving accuracy and reduces the amount of sampling required. By phrasing questions in particular ways, it is possible to limit the possible range of words used, reduce the size of the speech database that needs to be searched and improve both recognition accuracy and the maximum speech rate possible.
But pattern matching approaches to speech recognition require one thing in great abundance: real-time number crunching capability. In combination with the huge amounts of speech data that vendors have been collecting to improve their models, it has been the great leaps in processing power made available at low cost over the last few years that have made it possible for more sophisticated algorithms and larger speech databases to be used in real-time and to make speech recognition viable.
As well as improvements to the underlying hardware and software, changes within the industry have also made speech recognition more viable. In particular, open standards have made it easier for integrators and end-users to put together applications with a voice interface (see box, VoiceXML versus SALT) and at lower cost. Vendors have also begun to put together packaged applications oriented to particular vertical industries with appropriate vocabulary databases. The increasing number of deployments of speech technology has also made it easier for vendors and developers to build up databases of speech and applications applicable to other users.
The result is that speech recognition is being deployed in many more applications than it used to be. Simon Edwards, director of international marketing at Intervoice, highlights a few applications that his company has implemented. “Customer satisfaction surveys at the end of call centre calls get far higher response rates because of their context and it is far cheaper if speech technology is used. Manufacturers that want information on their end users are able to get more data if they provide a telephone number for product registration than if they provide an easily disposable registration card. The vast majority of calls to helpdesks are for password and PIN information, something easily automated with a speech system. And phone directories, whether customer-facing or employee-facing, are relatively easy to speech enable.”
Call routing, ticket booking and automated helpdesk enquiries are the main applications at the moment, but companies such as IBM and Aungate, a division of search engine company Autonomy, have also started to put speech recognition to use in other areas.
IBM is trialling the use of speech recognition technology with search engine technology to search call centre databases during calls with customers: the speech recognition software picks up keywords in the caller's speech to identify relevant information and displays it on the operator's screen before the caller has even finished explaining what they want.
Similarly, Aungate's technology tries to get the general meaning of the caller's conversation and then looks to see if other similar calls have been received recently. It means trends can be picked up quickly and information relevant to both management and caller can be passed on, making speech recognition a valuable business intelligence tool.
Speech enablement is now a viable option for many organisations and the list of applications is growing. Whether it will ever be as good as human speech recognition remains to be seen. But the days of Epsom being mixed up with Liverpool are long gone.