Using DSP Technology to Optimize Speech Recognition Performance

By: Rishi Nag, Senior DSP Software Engineer, NCT


It might be, that in 50 years time, we'll have a family android who will converse with us about the weather or even our favorite sport team's mid-season performance. If this is the case, then an important component of this icon of the future will be its ability to recognize speech the same way as humans. For the moment though, 'speech recognition' is an important emerging technology that is playing a key role in automotive telematics, mobile phone technology, conferencing systems and similar telecom applications. This article discusses some of the obstacles that such systems need to overcome in order to move forward, towards a human level of performance, and how DSP noise reduction can help optimize the performance of such systems.

The Main Principals of Speech Recognition

Speech Recognition is the process of converting a talker's sampled speech into the sequence of words representing what the talker has said. The basic building block of speech is the phoneme. There is one phoneme for every basic sound in the language. For example, the word 'cat' is constructed from three phonemes -'k', 'a' and 't'. A Speech Recognition Engine will need to construct the sequence of the phonemes in the speech, before it can produce the sequence of words. This is typically carried out in a number of distinct stages.

Firstly, each short segment of speech is analysed and its important acoustic characteristics are placed into a feature vector. The feature vector is compared to a database of feature vectors for the various phonemes, in order to find the closest match. This process is repeated for each short segment of speech to produce a sequence of phonemes.

The next stage involves use of a pronunciation dictionary to create a number of possible word sequences. A pronunciation dictionary contains a list of words and the sequence of phonemes corresponding to the pronunciation of the word. Using this dictionary in reverse, the phoneme sequences are put together to make known words. A single sequence of phonemes can, however, correspond to a number of different word sequences that have the same pronunciation. For example 'car key' is pronounced the same as 'khaki'. Consequently this stage will result in a number of alternative word sequences.

A language model then examines the context, and possibly the grammar, of the suggested strings of words to narrow the possibilities down to a word sequence that makes sense, the recognized word sequence.

To summarize, a typical speech recognition engine breaks down the speech into a sequence of feature vectors, capturing the important acoustic characteristics of the speech. The feature vectors are converted into a sequence of phonemes, which are built up into suggested sequences of words. These word sequences are then narrowed down to the recognized sentence.

Previous Page | Next Page

1 | 2 | 3 | 4

If you found this page useful, bookmark and share it on: