# Paper of the Day (Po’D): Exemplar-based Sparse Representations for Noise Robust Automatic Speech Recognition

Hello, and welcome to Paper of the Day (Po’D): Exemplar-based Sparse Representations for Noise Robust Automatic Speech Recognition. Today’s paper is a very recent contribution: J. F. Gemmeke, T. Virtanen and A. Hurmalainen, “Exemplar-based Sparse Representations for Noise Robust Automatic Speech Recognition,” IEEE Trans. Audio, Speech, and Lang. Process., vol. 19, no. 7, pp. 2067-2080, Sep. 2011.
Subsumed into the work described in this paper are the following works:

• J. Gemmeke and T. Virtanen, “Noise robust exemplar-based connected digit recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., (Dallas, TX, USA), March 2010 (Po’D here)
• J. Gemmeke, L. ten Bosch, L.Boves, and B. Cranen, “Using sparse representations for exemplar based continuous digit recognition”, in Proc. EUSIPCO, pp. 1755-1759, Glasgow, Scotland, Aug., 2009 (Po’D here)
• J. Gemmeke and B. Cranen, “Noise robust digit recognition using sparse representations,” in Proc. of ISCA Speech Analysis and Processing for Knowledge Discovery, 2008.
• J. Gemmeke, H. Van hamme, B. Cranen and L. Boves, “Compressive sensing for missing data imputation in noise robust speech recognition,” IEEE J. Selected Topics in Signal Process., vol. 4, no. 2, pp. 272-287, 2010.

This paper shows how sparse representation of speech in a dictionary of speech and noise features can be used to automatically recognize speech in difficult environments. The features are concatenated time-frequency energy distributions of short signal segments, not unlike the shingles proposed in Slaney et al. The authors construct two dictionaries, one containing exemplars of speech (which come from clean speech), and another containing exemplars of noise (which come from only noise). With these, one can build a sparse representation of noisy speech, and then work only on the sparse representation of the speech portion to determine its content by looking at the labels associated with the exemplars. Thus, Gemmeke et al. are using sparse representation for source separation, as well as for content classification.

For a real noisy speech segment $$\vy$$, Gemmeke et al. propose to find its sparse representation in a speech and noise exemplar dictionary $$\MA$$ of $$N$$ real exemplars of dimension $$M$$, by solving the following problem
$$\begin{multline} \min \sum_{m=1}^{M} [\MA\vx]_m – [\vy]_m \left [1 – \log ( [\vy]_m/[\MA\vx]_m) \right ] + \sum_{n=1}^N \lambda_n [\vx]_n \\ \textrm{subject to} \; \vx \succeq 0. \end{multline}$$
First, the non-negativity constraints makes sure the solution adds together exemplars, instead of subtracts them. This mimics the nature of additive signals in the energy domain. Second, the generalized Kullback-Leibler divergence is being used to measure the distortion between the noisy speech segment and its representation. The authors motivate its use because it has shown great promise in source separation. It also has the property of treating large and small values in a more balanced way than, e.g., the Euclidean norm of the difference. Finally, the sparsity is being controlled by a weighted $$\ell_1$$-norm. Here the authors define differently the weights for speech and noise exemplars. Thus, giving more weight to the speech exemplars will push the sparse representation to be more sparse in its selection of good speech units than in noise. That is a very unique and intuitive concept. And, believe it or not, solving the above has a simple iterative structure.

With a sparse representation in terms of the speech dictionary,
the authors create label matrices to describe the likelihood of particular speech states through the entire time covered by the set of frames in $$\vy$$.
This then can be used to find the most likely state sequence, and thus the content of the noisy speech signal.

In their experiments, the authors find that as the signal to noise ratio degrades, exemplar-based sparse representation begins to perform much better than baseline approaches. Also, as we increase the number of time frames in the exemplars, performance also increases. This shows that one can indeed create a noise robust speech recognizer by sparse representation with long exemplars. Interestingly, with clean signals, this approach does not perform as well as the baseline systems.
I wonder if this is due to some effort being lost on trying to find noise in the representation, which is what the authors also posit. Gemmeke provide two other possibilities for this: the dictionary does not cover the entire space of vocalizations, and might itself have mislabeled content; and/or it could be a problem with using magnitude spectra as features.

Regardless, this is fantastic work that has a lot of potential. I wondered aloud to Jort about problems with the specificity of the features. For instance, taking magnitude spectra of several people speaking words and fitting those to completely different people speaking words at different speeds, etc. Jort said that it still works… which is the proof that is in this pudding.