Logo Rob Buckley – Freelance Journalist and Editor

Hard of Hearing

Hard of Hearing

Speech recognition can be infuriating, but it is improving fast.

Page 1 | Page 2 | All 2 Pages

Almost everyone who has used a speech recognition system has been singularly unimpressed, it seems. Yet, this perception of speech technology is rapidly becoming outdated, as many organisations are discovering to their profit.

“People's negative experiences are based on early interactive voice response (IVR) systems, where you were asked to respond to a specific question and were then frustrated that the answer played back wasn't what you said,” says Alan Barr, chief operating officer of Streamdoor and a veteran of the speech recognition industry. “You'd try to book a cinema ticket and be asked what cinema you wanted; you'd say 'Epsom' and it would respond with Liverpool'. People immediately thought the technology wasn't up to the job.”

However, the speech recognition systems of even three or four years ago bear little resemblance to those being deployed today. Systems can now boast accuracy rates of up to 99%, far greater flexibility in coping with different accents and many more possible applications, all for less work and less expenditure by the organisations deploying them.

That computers can recognise speech at all is impressive. People speak at different rates, with different dialects and emotions and in different pitches. Even the same person can say the same words in many different ways, and then there is the problem of separating speech from background noise.

There have been two main approaches to speech recognition in the last decade. The first involves 'neural networks' - a branch of artificial intelligence based on computer simulations of the brain that can learn from their experiences. By exposing neural nets to spoken speech and allowing them to reorganise themselves in the same way as neurons in the brain do, researchers have been able to get reasonable recognition of limited sets of words; the nets have even been able to extrapolate language and derive past tenses of verbs from their own self-taught rules.

Using this technique, researchers at the University of Southern California have developed a system that they claim is better than human listeners at picking out a limited range of words from background noise. Yet while this approach is elegant and may prove more successful in the long-term, it faces considerable obstacles to success in the short-term.

Neural networks only model small parts of the brain so whether they will be able to interpret the full range of vocabulary, accents and voice tones is debatable. It also takes humans many years of learning, exposure to different people and education to build up this ability; it will take a similar amount of time to train a neural network, even once one is sufficiently powerful.

So instead commercial systems have focused on pattern recognition as the best current means to speech recognition. Pattern recognition is based on 'hidden Markov models' that represent typical speech sounds mathematically, but include some leeway to take into account the differences in how people may speak those sounds. To trigger recognition, the sound must be within a certain range of frequencies and a certain range of energies.

To build up these models of speech sounds, vendors have to record huge databases of speech and analyse them for these sounds. Making it even harder is the fact that many of these sounds are pronounced differently, depending upon the other sounds near them.

Streamdoor's Barr trialled one of the first speech recognition systems in the UK for Argos, which would handle store enquires from shoppers. He discovered first-hand how much speech sampling was needed for a successful application. “We discovered that the UK, out of all the European countries, has the most dialects with the exception of Italy,” he says. “Dialects make a huge difference. In that testing environment, we had to make 100,000 different calls with different dialects to get an algorithm with 90% recognition.”

Telephonetics' director Paul Welham has a similar story to tell. “There are a large number of hospital sites using speech for call routing in the UK. They have to understand what the general public is going to ask for, but even simple requests aren't easy to predict. We discovered that when people wanted to talk to the bedding department in a hospital, there were 100 ways for them to actually ask for that.”

Page 1 | Page 2 | All 2 Pages

Interested in commissioning a similar article? Please contact me to discuss details. Alternatively, return to the main gallery or search for another article: