Paper of the Day (Po’D): Speech Recognition by Sparse Approximation Edition, No. 2

Now that I am finished with a submission to Asilomar 2010, I have a brief respite before beginning to read 8 student project reports, and preparing two grant proposals. So, about 10 minutes.

Hello, and welcome to Paper of the Day (Po’D): Speech Recognition by by Sparse Approximation Edition, No. 2. Today’s paper describes using sparse approximation to aid in the automatic recognition of spoken connected digits corrupted by noise: J. Gemmeke and T. Virtanen, “Noise robust exemplar-based connected digit recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., (Dallas, TX, USA), March 2010. This work extends that which I reviewed in a previous Po’D here.

In this work, the authors adopt a slightly different approach to decomposing and classifying frames of speech to recognition spoken digits. They are still looking at features derived from a time-frequency energy distribution (I still don’t know the window size and hop), but here they are using mel frequency linear power instead of the log power of their previous approach. This is to take into account that the signal of interest is corrupted by additive noise. (Is this also the reason why they are not using cepstra?) However, they are not just dealing with noise, as in your average stationary noise distributed as a Gaussian. They are dealing with interfering signals described as “subway, car, babble, exhibition hall, restaurant, street, airport, and train station.” Hence, it is no longer acceptable to assume that we will have low magnitude correlations between the dictionary of speech exemplars and the noise. Thus, the authors augment the dictionary of speech exemplars \(\MA\) with labeled features derived from these different types of interfering signals \(\MN\). (How they acquire those exemplars I don’t know. Are they taken from the speech signals themselves when no speech is present?)

Anyhow, as before we seek to express some known set of vectorized speech linear power spectrograms \(\MX\) as a linear addition of labeled vectors in the dictionary \([\MA | \MN]\). Thus, the signal model is given by
$$\MX = [\MA | \MN]\MS \; \textrm{subject to} \; \MS \ge 0$$
where the constraint means that all elements of \(\MS\) are greater than or equal to zero.
In this form, we can see that to solve this problem we seek a non-negative matrix factorization of our set of signals \(\MX\). The matrix \(\MS\) then describes the “activiations” of each of the spectrograms in our dictionary of labeled templates.
The authors solve this factorization using an iterative approach that minimizes a cost function based on the Kullback-Leibler divergence between two linear power spectra, and emphasize a sparsity in this process. (In their previous paper, the authors used the Euclidean distance between log power spectra.)
Finally, as in the previous paper, each of the unlabeled speech segments are mapped to state likelihoods by using the activations in \(\MS\).

The authors show for six different SNR values in \([-5, 20]\) dB, and dictionaries of size 8000 exemplars (4000 speech, and 4000 noise) that the performance of their digit recognition approach is affected by interfering noise to a much smaller extent than the baseline approach (I don’t know what the baseline approach is). At a signal to interference power ratio of -5 dB, their approach using long scale features (30 frames, how long is a frame?) has an accuracy of 55.8%, while the baseline is severely punished with an accuracy of 17.1%. At 10 frames the accuracy of their approach is still at 41.0%.

This work shows that sparse approximation of unlabeled features using additive labeled features provides a good robustness to recognizing speech embedded in interfering signals that can be as non-stationary as speech. However, a problem here is that the dictionary of speech exemplars must either be very uncorrelated with the interfering signals present, or the dictionary must contain features that can represent the interfering signals more efficiently than can the speech features. In other words, we hope that in the non-negative matrix factorization, the noise features are activated before the speech features. Otherwise, non-speech will be labeled speech — which might not be so bad when using a state-based decoding process offered by HMMs trained on each digit. So I wonder, if a speech signal is corrupted by “rodeo”, and the dictionary only has noise exemplars “subway, car, babble, exhibition hall, restaurant, street, airport, and train station”, how will this method perform?

Ok, now onto writing some grant proposals. And then some project reports.

This following content was added after private correspondence with Jort, June 3, 2010

Hop is given on page 3 and is 10ms. Window size somehow disappeared but
should be 25ms.

Also on page 3, you can read we make the noise exemplars from the noises
used aurora-2 multicondition trainset which contains the same noise types as
testset A. Incidentally, we have submitted a paper to Interspeech in which
we do extract noise exemplars from the noisy speech itself, as well as
considering the use of artificial noise exemplars.

(I don’t know what the baseline approach is)

Also in the experimental setup section on page 3 (I know, those sections are
boring…) it is written the baseline is a MDT system using the missing data
mask described in ref [15]

>> So I wonder, if a speech signal is corrupted by “rodeo”, and the
dictionary only has noise exemplars “subway, car, babble, exhibition hall,
restaurant, street, airport, and train station”, how will this method

Well, given the task, this should be ok, especially since some of the
phones in rodeo don’t even exist in the digit-task. Aside from that; even if
nothing of rodeo is captured in the noise dictionary, we would hope we would
simply get competing state activations during the interfering word. Its then
up to Viterbi whether or not the competing state activations can form a path
with high enough likelihood to compete with the actual spoken digit.

Off course, if we would place extra constraints on the minimization, such as
group sparsity (for example only female/male exemplars allowed at once) or
constraints on transitions between exemplars, this should only improve


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s