Hello, and welcome to Paper of the Day (Po’D): Speech Recognition by by Sparse Approximation Edition. Today’s paper describes using sparse approximation to aid in the automatic recognition of spoken connected digits: J. Gemmeke, L. ten Bosch, L.Boves, and B. Cranen, “Using sparse representations for exemplar based continuous digit recognition”, in Proc. EUSIPCO, pp. 1755-1759, Glasgow, Scotland, Aug., 2009 (pdf). I see that this method is related to that in: P. Leveau, E. Vincent, G. Richard, and L. Daudet, “Instrument-specific harmonic atoms for mid-level music representation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, pp. 116-128, Jan. 2008. Instead of recognizing words, the authors are recognizing musical instruments from essentially atomic “exemplars” learned from sparse approximations of a training dataset. (And they even use the Viterbi algorithm!) Also closely related is non-negative matrix factorization of musical spectra, for example: C. Fevotte, N. Bertin, and J.-L. Durrieu, “Nonnegative Matrix Factorization with the Itakura-Saito Divergence. With Application to Music Analysis,” Neural Computation, vol. 21, no. 3, Mar. 2009.

Gemmeke et al. approach the problem of speech recognition by comparing time-frequency features of an unknown to those of “exemplars,” which are labeled. They classify the unlabeled data by path finding using a modified Viterbi algorithm. The authors compare two approaches to word recognition (word-based and state-based) using features computed from either sparse approximation, or other K-nearest neighbors. They find that the sparse approach with a state-based model consistently produces the higher accuracy of the two feature design methods. While they only use single spoken digits, their method is trivially extensible to accommodate other means of speaking numbers, e.g., “twenty-one hundred and eleven.” It is just a question of having exemplars of the speech we expect. Having painted this broad overview, I will now dig deeper into what the authors are really doing, and some of the problems I see with this approach.

Let’s take a sampled speech signal of the spoken digit “one,” and find its spectrogram — which is the magnitudes of the short-time Fourier transform at positive frequencies — and let’s convert this two-dimensional matrix into a column vector of length \(K\) by concatenating all of its columns. Now, consider that we do this for \(N_1\) different spoken versions of the same number (with the same time and frequency resolution). Sometimes this digit is spoken as a single digit, thus separated from others, and the other times it is naturally connected to other digits, like “one-oh”, or “eightwon”. So, we will associate with each columnized spectrogram a label, or binary vector of length 12 (1, 2, … 9, oh, zero, silence) describing the presence of each digit, or no digit at all.

This creates the set of exemplars for the class “one”:

$$\mathcal{E}_1 = \{ \{\va_i \in \mathcal{R}^K,\vy_i \in \mathcal{B}^{12}\} : i = 1, \ldots, N_1, ||\va_i||_2 = 1\}$$

where \(\mathcal{B} = \{0,1\}\).

Implicit here is that each spectrogram must be the same size, or equivalently that zeros are appended so that each vector is of length \(K\).

Once we do this for all the words in our expected vocabulary,

we obtain the database of exemplars \(\mathcal{E} = \{\mathcal{E}_1, \ldots, \mathcal{E}_Z, \mathcal{E}_s\}\).

Alternatively, we denote the matrix of \(N := N_1 + \ldots + N_Z + N_s\) exemplar columnized spectrograms as \(\MA\), and the binary class vectors as \(\MY\).

Now suppose that we have sampled some speech signal of an unspecified number of digits. We find its magnitude short-time Fourier transform with the same time and frequency resolution of the exemplars, and columnize it into several \(K)-length vectors, one of which is \(\vx_j\).

To determine which digits are present in \(\vx_j\) we could simply find the \(k\) nearest neighbors in \(\mathcal{E}\) with respect to some distance measure in \(\mathcal{R}^K\), e.g., \(|| \va_i – \vx_j||_2\). Then we can assign a score to our unknown vector by summing together the label vectors of the \(k\) closest neighbors to create the label vector \(\vf_j\).

(I wonder why the authors do not create \(\vf_j\) by a weighted sum of the label vectors, or at least a mean?)

The authors propose an alternative approach using sparse approximation.

If we see \(\MA\) as a dictionary, we can build \(\vx_j\) as a linear combination

$$\vx_j = \MA\vs + \vr$$

where \(\vs \in \mathcal{R}^{N}\) is a vector of weights,

and \(\vr \in \mathcal{R}^K\) is a residual vector.

Since we assume a single voice is speaking at any time, we might assume our unknown speech signal can be well-approximated by only a few columns of \(\MA\). And thus, the authors propose solving the exact sparse representation problem

$$\min ||\vs_j||_0 \; \text{subject to} \; \vx_j = \MA\vs_j.$$

(Here it is unclear why the authors want an exact sparse solution, and not just a small yet good subset of the dictionary of exemplars.

Also around this time, the authors confuse the idea of \(N > K\) with overcompleteness (I doubt their exemplar dictionary is even complete),

and state that work in compressed sensing shows the solution to this (NP-hard) problem can be found. No, no, no. There is no compressed sensing here, no random matrices, and you cannot guarantee recovery of \(\vs_j\).)

Anyhow, we can use our favorite algorithm (the authors use the lasso implementation in SparseLab, and stop after 30 iterations) to generate a sparse *approximation* of \(\vx_j\) such that \(||\vx_j – \MA\vs_j||_2 \le \epsilon\) for some maximum error \(\epsilon \ge 0\).

From the solution, the authors then create a score for this vector by \(\vf_j = \MY|\vs_j|\)

where \(|\vs_j|\) is just the magnitude of the solution to the sparse approximation problem.

The authors do this because a negative weight makes no sense within the “label space.” (However, I do not see why the authors did not pose this as a constraint when solving the sparse representation problem above using the relaxation of the 0-pseudonorm with the \(\ell_1\)-norm.)

Regardless of using either k-NN or the sparse approximation approach, the authors finally create the binary label vector for \(\vx_j\) by

$$[\vy_j]_l = \begin{cases}

1, & [\vf_j]_l \ne 0 \\

0, & [\vf_j]_l = 0

\end{cases}, l=1, \ldots, 12$$

which describes which digit classes could be present.

(However, the authors a few sentences later say that they leave the sparsely created features, e.g., \(\vf_j\), as positive weights and do not convert them to binary as they do for the k-NN features. Why? I think this could make the k-NN features much less specific.)

Finally, for an entire utterance of length \(W\) feature vectors, we create a \(12 \times W\) matrix of these features \(F\). Within this matrix the authors perform path finding, according to one of two paradigms: 1) “word-based” where each feature vector corresponds to a word; 2) “state-based” where each feature vector corresponds to a specific state in a word. While the word-based approach is straight-forward, how the authors perform the state-based approach is missing, e.g., how are the binary (or not) features associated with a particular state in a particular word? Anyhow, the authors detail how they have modified the Viterbi algorithm to accommodate their particular problem.

In their experiments, the authors as features the 23-band Mel-frequency log power spectra, starting at 100 Hz, and with a time resolution of 10 ms. (What was the window size?) Their dataset is a clean connected spoken digit set, that was automatically labeled by using a HMM (I think). The set of exemplars was created by random selections of 16,000 windows (of length?) from a training set. (Were all pieces of all digit classes equally represented?) The authors state that they used six different window lengths (all with 10 ms resolution?) from 1 to 50 “frames”. (How long is a frame? 1 sample?) To create the data features, they use either 30 nearest neighbors, or find a 30-sparse solution. (Why 30?) It is not clear how large the test set is in relation to the subset of exemplars, or how many trials they ran. Finally, the results show that the best performance is found for state-based word models with large sets of exemplars found from sparsely-built middle-sized “frames.” The largest accuracy we see is 98.2%, compared to 99.5% using HMMs and other features. (What was the precision?)

I have several concerns about this approach.

First of all, the dictionary of exemplars is translation variant, which can significantly effect the performance of any approach to sparse representation. Consider, for instance, the sibilant “t” in “eight”. Depending on the analysis window size, this will be well-localized in time. Shouldn’t we then consider many possible shifts of this sound, than just some arbitrary ones?

Second, the features the authors use are not timbre invariant.

Why not use cepstra to better capture the characteristics of the formants? Are the high accuracies we observe here due to homogeneity in the test and training sets? It seems to me that the dictionary can be significantly reduced in size if cepstra are used.

Third, some words have similar sounds, like the “f” in “four” and “five.” So why should we label the “f” sound as belong to one of those classes and not the other. For this reason it is clear to me that a state-based model will perform better than a word-based model for shorter durations of observation.

And finally, why should we consider each feature vector to be composed of 30 exemplar feature vectors?

Anyhow, I am interested to see how this approach performs with speech in a difficult environment. I would also like to see if there are any real benefits to using \(\ell_1\) minimization in building the sparse representation than by a more simple iterative approache, such as matching pursuit (MP), or orthogonal MP. (I predict that there will not be much.)