Using DSP Technology to Optimize Speech Recognition Performance
Overcoming Background Noise
One of the major obstacles to achieving high performance speech recognition is 'noise'. In an in-car situation, this noise comes from a number of sources; the road, the engine, the radio, the wind and maybe even the passengers. On a mobile phone, this noise might be background music, traffic, wind or passers-by talking.
Noise is a problem because it affects the acoustic characteristics extracted from the speech to make the sequence of feature vectors. This then introduces errors in the feature vectors and their corresponding phonemes.
Early attempts to apply noise reduction software techniques to enhance speech recognition performance had limited success since, in most cases, these noise reduction technologies had been developed to improve human-to-human communication systems. With such technology, there is always some misidentification of 'noise' and 'speech'. Noise that is misidentified as speech will be transmitted leading to speech-like artefacts that can sound like a babbling brook, very disturbing for a human listener. On the other hand, speech that is mis-identified as noise will be removed, potentially causing the speech to sound distorted.
Achieving the optimal performance from a noise reduction technology involves a trade-off between introducing watery artefacts and causing speech distortion. In human-to-human communication, watery aretfacts are usually more unacceptable than losing small parts of speech, particularly since the brain, to some extent, tends to fill in the missing bits of speech to make sense of the output. On the other hand, in a speech recognition system, even a small amount of speech distortion can result in words being unrecognisable, while watery artefacts are often ignored. Consequently it is usually necessary to design noise reduction technology specifically for enhancing the performance of speech recognition systems.
Another interesting aspect is the fact that in normal human-to-human conversation, say using a hands-free in-car phone, we talk over each other only about 6% of the time. During this time echo cancellation technology removes one of the voices to avoid echo and enhance clarity. For a speech recognition system, there is often competing background noise and 'echo' for 100% of the time, whether it is from the radio, the speech recognition system itself or even a nearby passenger.
Due to these different operational requirements, noise and echo cancellation solutions for enhancing speech recognition need to be optimized differently and solutions aimed at human listeners are often non-ideal. The quality of such solutions is dependent on how cleverly they minimise the distortion to the speech whilst reducing background noise and echo.
If you found this page useful, bookmark and share it on:![]()
If you are familiar with RSS feeds, you can also sign up for our free blog feed. Our RSS feed is updated in real-time while our newsletter is updated daily.
