Hello, and welcome to Paper of the Day (Po’D): Practical Implementations of Exemplar-based Noise Robust Automatic Speech Recognition Edition. Today I begin reviewing a series of interesting papers coming from the PSI Speech research group at the Katholieke Universiteit, Leuven Beligium. Today’s paper is: J. F. Gemmeke, A. Hurmalainen, T. Virtanen and Y. Sun, “Toward a practical implementation of exemplar-based noise robust ASR”, in Proc. EUSIPCO, pp. 1490–1494, 2011.
I have talked about Gemmeke’s excellent work before: here and here and here.
In this paper, he and his collaborators discuss in part
the very real problem of making a practical implementation of speech recognition by
sparse classification with large dictionaries of speech and noise exemplars.
To reach accuracies of over 90% mean accuracy for speech with noise
(not your ordinary AWGN, but real noise, like subways and babbling) at 5 dB SNR,
the authors needed their dictionary to contain 16,000 exemplars of speech spectral features and 4,000 exemplars of noise spectral features, each of dimensionality 690 (I think). That is a matrix of 13.8 million entries — not a really a big matrix, but big enough that it is going to make “real-time” hard to achieve.
Gemmeke et al. have looked at how fast things get when parallelism is brought into play by using a GPU. Specifically, using a 1GB GTX460 GPU.
Luckily, programming for such a platform isn’t as problematic as it once was.
We can continue to use the safe prototyping environment of MATLAB with only a few minor changes, and the use of GPUmat.
For a dictionary of size 8000 exemplars, Gemmeke et al. find what took over 92 seconds now takes just over 3 (for a speech sample of duration 1.82 seconds). What is more, though the GPU works only with single precision floating point, Gemmeke et al. find that this does not significantly hurt recognition accuracy.
This paper contains many other interesting experiments related to practical sparse-classification-based noise-robust speech recognition: how to increase noise robustness by augmenting the noise exemplar dictionary with artificial noise exemplars; the optimization of the Lagrangian weights in their objective function; and how mean accuracy decreases with the dictionary size. In this case, we see that at 5 dB SNR mean accuracy has only decreased from 90% to 84% with a dictionary of only 1000 speech exemplars (the noise dictionary size remained 4000). Since for real-time 100 such decompositions must be performed each second, parallel computing with GPUs brings us closer.
A curious thing we see in these experiments, however, is that mean accuracy for clean speech is about 2% higher using dictionaries of exemplars that contain 20 frames of speech data than those that contain 30 frames.
Does this have to do with the dictionary being fatter and more redundant in the former case?
In other words, the set of possible speech signals observed over 20 frames (200 ms) is smaller than that which happens over 30 frames (300 ms).
I will have to ask Jort about this.