Hello, and welcome to Paper of the Day (Po’D): Missing Data Imputation in Noise Robust Speech Recognition Edition. Today we have another paper from the PSI Speech research group at the Katholieke Universiteit, Leuven Beligium. Today’s paper is: J. F. Gemmeke, H. Van hamme, B. Cranen and L. Boves, “Compressive Sensing for Missing Data Imputation in Noise Robust Speech Recognition”, IEEE J. Sel. Topics Signal Process. vol. 4, no. 2, pp. 272-287, Apr. 2010.
For noisy speech recognition based on time-frequency exemplars (see Po’D here), one must invariably address the cold fact that portions of the data can be so corrupted by noise that it is either useless in the best case, or works against recognition in the worst case.
Here, Gemmeke et al. adapt the concept of image inpainting by sparse methods to approximate the degraded portions of the time-frequency features (e.g., spectrograms).
As long as one can pinpoint the locations of bad data,
one can predict them by assuming the entire spectrogram is sparsely represented
in some dictionary of clean features.
Thus, this approach attempts to find a good sparse approximation of the reliable dimensions while ignoring the bad ones,
and, once done, fills in the bad dimensions with the relevant elements of the features used in the approximation.
These corrected features are then used for speech recognition.
This approach is also taken in the more recent paper on audio inpainting, which I review here. In that work, the authors perform data imputation in the time domain rather than a feature domain, and for listenability not classification.
While finding clipped samples in time-domain signals is easy,
it is not so easy to find time-frequency features corrupted by noise.
The hard part here is figuring out which data are and are not reliable.
Gemmeke et al. compare a couple of different methods for estimating a mask to an oracle mask, which they find from synthesizing their test data.
This provides a good indication of how much better things can be made.
They also compare the results to two other imputation techniques.
Their results clearly show that sparse imputation (with the oracle mask) at an SNR of -5 dB (which translates to about 82% of the features being imputed) improves speech recognition (single digits) by over 30% compared to the other methods (reaching 92%).
When we have to estimate the mask, however, things are not so rosy.
Now, (with a mask estimated using harmonicity) the sparse approximation approach to imputation begins to perform about 5% worse than the best method at even 5 dB SNR (about 85% accuracy versus 90%).
Gemmeke et al. explain this turn of events to false positive, false negatives,
and time-frequency overlap of noise and speech. This seems entirely reasonable to me.
So clearly, the area which needs the most help here is estimating that mask… or perhaps dropping that and finding sparse approximations of speech and noise with exemplar dictionaries of each.