Extending the Bandwidth of NarrowBand Speech Using Cepstral Linear Prediction

6. Objective Results

Simulations are performed using 16-bit speech files sampled at 16 kHz from the TIMIT (Texas Instruments M.I.T.) database.

The enhanced speech outputs with high spectral envelope extension are then subject to both objective and subjective quality testing with respect to the original 16 kHz speech signal. For the objective testing part, we utilize an average log spectral distortion measure.

Figure 3 shows the spectrogram plot of an original speech file sampled at 16 kHz.

Spectrogram of original speech file (sa1.wav) sampled at 16 kHz.

Figure 3: Spectrogram of original speech file (sa1.wav) sampled at 16 kHz.

Here it is noted that the fricatives of voiced speech occupy the greater energy in the high sub bands.

Figure 4 shows the output of our proposed technique which shows good reproduction of the high energy bands based on spectral extrapolation using the cepstral method.

Figure 5 shows the output of enhanced speech utilizing spectral folding technique and spectral shaping as proposed in [2]. It is noted here from the spectrogram plot that the high frequency re-synthesis using spectral folding method is not pronounced. We should also note from the spectrogram plot that the spectral folding inherently has a limitation of not being able to reproduce fricative spectrum typically in the 4 - 5 kHz range as the fold occurs at the Nyquist interval as mentioned in Heide et al in [5].

Figure 4: Spectrogram of enhanced speech file (sa1.wav) using cepstral linear prediction.

Figure 5: Spectrogram of enhanced speech (sa1.wav) using spectral folding method.

Table 1 compares the average log spectral distortion measure relative to the original speech file sampled at 16 kHz for speech enhanced using cepstral linear prediction, zero crossing enhanced spectral folding method proposed in [7] and the narrow band signal for 10 speech files from the TIMIT database.

The first five files examined belong to the female gender while the last five files belong to the male gender.

It should be noted that spectral distortion should exist as we are using a simple linear method of mapping. Since we are using a common W due to computational ease to address both voiced and unvoiced frames. Generally, voiced frames tend to show a monotonic decreasing spectral shape while unvoiced show a monotonic increasing one so using a common W linear predictor will carry a certain level of distortion in the re-synthesis.

The results however clearly show low spectral distortion when using the cepstral linear prediction method as compared to both the zero cross enhanced spectral folding method proposed in [7] and the narrow band signal when compared relative to the original wideband speech signal.

The objective test implemented is based on the average log spectral distance (LSD) computed in the Fourier domain. This is used instead of the computation of SNR in time domain, which will not produce meaningful results as the phase of the estimated high frequency components will not match that of the original 16kHz signal. The average log spectral distance, LSD, between the original wideband speech, A, and the reconstructed wideband speech, B, is defined as:

average log spectral distance

where K is the number of frames.

Table 1: Log Spectral Distortion Comparison

 TIMIT Speech Files used  LSD of cepstral method  LSD of ZCR  LSD of narrow band 
 Sa1.wav (female)1.02141.04192.1203
 Sa2.wav (female)1.02650.98532.3559
 Fdac1_sa1.wav (female)1.03431.10532.4812
 Si908.wav (female)1.04191.15662.3093
 Si1538.wav (female)1.05091.14392.3310
 Mwbt0_sa2.wav (male)1.08561.52992.7205
 Mdab0_sx139.wav (male)0.97761.06052.1092
 Mwbt0_sa1.wav (male)1.06021.45772.6536
 Sx370.wav (male)0.94101.16472.1678
 Sx280.wav (male)0.98681.11002.3482
 Average1.022621.175582.3597

Previous Page | Next Page
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

If you found this page useful, bookmark and share it on: