Simulations are performed using 16-bit speech files sampled at 16 kHz from the TIMIT (Texas Instruments M.I.T.) database.
The enhanced speech outputs with high spectral envelope extension are then subject to both objective and subjective quality testing with respect to the original 16 kHz speech signal. For the objective testing part, we utilize an average log spectral distortion measure.
Figure 3 shows the spectrogram plot of an original speech file sampled at 16 kHz.
Figure 3: Spectrogram of original speech file (sa1.wav) sampled at 16 kHz.
Here it is noted that the fricatives of voiced speech occupy the greater energy in the high sub bands.
Figure 4 shows the output of our proposed technique which shows good reproduction of the high energy bands based on spectral extrapolation using the cepstral method.
Figure 5 shows the output of enhanced speech utilizing spectral folding technique and spectral shaping as proposed in . It is noted here from the spectrogram plot that the high frequency re-synthesis using spectral folding method is not pronounced. We should also note from the spectrogram plot that the spectral folding inherently has a limitation of not being able to reproduce fricative spectrum typically in the 4 - 5 kHz range as the fold occurs at the Nyquist interval as mentioned in Heide et al in .
Figure 4: Spectrogram of enhanced speech file (sa1.wav) using cepstral linear prediction.
Figure 5: Spectrogram of enhanced speech (sa1.wav) using spectral folding method.
Table 1 compares the average log spectral distortion measure relative to the original speech file sampled at 16 kHz for speech enhanced using cepstral linear prediction, zero crossing enhanced spectral folding method proposed in  and the narrow band signal for 10 speech files from the TIMIT database.
The first five files examined belong to the female gender while the last five files belong to the male gender.
It should be noted that spectral distortion should exist as we are using a simple linear method of mapping. Since we are using a common W due to computational ease to address both voiced and unvoiced frames. Generally, voiced frames tend to show a monotonic decreasing spectral shape while unvoiced show a monotonic increasing one so using a common W linear predictor will carry a certain level of distortion in the re-synthesis.
The results however clearly show low spectral distortion when using the cepstral linear prediction method as compared to both the zero cross enhanced spectral folding method proposed in  and the narrow band signal when compared relative to the original wideband speech signal.
The objective test implemented is based on the average log spectral distance (LSD) computed in the Fourier domain. This is used instead of the computation of SNR in time domain, which will not produce meaningful results as the phase of the estimated high frequency components will not match that of the original 16kHz signal. The average log spectral distance, LSD, between the original wideband speech, A, and the reconstructed wideband speech, B, is defined as:
where K is the number of frames.
Table 1: Log Spectral Distortion Comparison
|TIMIT Speech Files used||LSD of cepstral method||LSD of ZCR||LSD of narrow band|
Previous Page | Next Page
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8
If you found this page useful, bookmark and share it on: