# A Funny Thing Happened on the Way to the Computer

In the classic paper cited often as “Mallat and Zhang” (S. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,” IEEE Trans. Signal Process., vol. 41, pp. 3397-3415, Dec. 1993), the authors present a mathematically simple way to proceed in calculating the inner products of the new intermediate residual signal $$\vr(n+1)$$ with all the dictionary elements given the previous residual signal $$\vr(n)$$ and the selected atom $$\vh_n$$. We can thus find the next atom simply by $$\vh_{n+1} := \arg \max_{\vd \in \mathcal{D}} | \langle \vr(n+1), \vd \rangle | = \arg \max_{\vd \in \mathcal{D}} |\langle \vr(n), \vd \rangle – \langle \vr(n), \vh_n \rangle \langle \vh_n, \vd \rangle|.$$ This comes about because we define the $$n$$th-order residual signal as $$\vr(n+1) := \vr(n) – \langle \vr(n), \vh_n \rangle \vh_n.$$ The implication of this is that all one needs to do in an implementation of matching pursuit (MP) to find each atom is to first, compute and store the set of inner products $$\{ \langle \vr(0), \vd \rangle : \vd \in \mathcal{D}\}$$, and perhaps the dictionary Gramian $$\MG := \MD^H\MD$$; then after each atom selection, update the set of inner products of the new residual and each dictionary element by 1 multiply, 1 addition, and 1 table-lookup into $$\MG$$. Thus, we have significantly reduced the computational complexity of the already computationally burdensome MP by avoiding the direct computation of the set $$\{ \langle \vr(n), \vd \rangle : \vd \in \mathcal{D}\}$$ at each iteration.

On paper at least; and until you attempt to implement MP yourself using large (i.e., useful) dictionaries for high-dimensional signals with the complexity of audio data.

In regards to research in audio signal processing, this particular optimization (in particular precomputing and storing the dictionary Gramian) has been referenced numerous times. For instance, in 1997

1. M. M. Goodwin and M. Vetterli, “Atomic decompositions of audio signals,” in Proc. IEEE Workshop Appl. Signal Process. Audio Acoust., (New Paltz, New York), pp. 4-8, Oct. 1997

and in 1999 for a dictionary of damped sinusoids

1. M. M. Goodwin and M. Vetterli, “Matching pursuit and atomic signal models based on recursive filter banks,” IEEE Trans. Signal Process., vol. 47, pp. 1890-1902, July 1999

and in 2004, 2005, and 2006

1. P. Vera-Candeas, N. Ruiz-Reyes, M. Rosa-Zurera, D. Martinez-Muñoz, and F. Lopez-Ferreras, “Transient modeling by matching pursuits with a wavelet dictionary for parametric audio coding,” IEEE Signal Process. Lett., vol. 11, pp. 349-352, Mar. 2004.
2. P. Vera-Candeas, N. Ruiz-Reyes, M. Rosa-Zurera, D. Martinez-Muñoz, and J. Curpián-Alonso, “New matching pursuit based sinusoidal modelling method for audio coding,” IEEE Proc. – Visual Image Signal Processing, vol. 151, no. 1, pp. 21-28, 2004.
3. P. Vera-Candeas, N. Ruiz-Reyes, M. Rosa-Zurera, J. C. Cuevas-Martínez, and F. López-Ferreras, “Adaptive signal models for wide-band speech and audio compression,” in Pattern Recognition and Image Analysis, vol. 3523/2005, pp. 571-578, Springer Berlin, 2005.
4. P. Vera-Candeas, N. Ruiz-Reyes, J. C. Cuevas-Martínez, M. Rosa-Zurera, and F. López-Ferreras, “Sinusoidal modeling using perceptual matching pursuits in the bark scale for parametric audio coding,” IEE Proc.-Visual Image Signal Process., vol. 153, pp. 431-435, Aug. 2006.

and most recently in 2010

1. N. Ruiz-Reyes and P Vera-Candeas, “Adaptive signal modeling based on sparse approximations for scalable parametric audio coding,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, pp. 447-460, Mar. 2010.

But this optimization is completely disregarded in the documentation of the de facto standard implementation of MP for audio signals, MPTK (S. Krstulovic and R. Gribonval, “MPTK: Matching pursuit made tractable,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., vol. 3, (Toulouse, France), pp. 496-499, Apr. 2006). Why?

The reason is patently obvious: when creating a multiresolution and overcomplete dictionary that is useful for audio decomposition (e.g., one that provides shift-invariant decompositions), the dictionary will be huge (hundreds of millions), and thus the Gramian of the dictionary will be that size squared. At such sizes it does not make sense to precompute and store the set of inner products of all pairs of the atoms — even when the vast majority of these inner products could be zero (which increases as the coherence decreases). The memory requirements and access time outweighs the benefit. It may be that there exists a closed-form solution for the inner product of two real discrete Gabor atoms, as shown in Mallat and Zhang, or other special atom shapes. But there is no such formulation for dictionaries of learned atoms. Instead, the frequency-domain approach taken in MPTK is, I believe, one great way to reduce the computational complexity of MP for audio data using shift-invariant scale-time-frequency dictionaries; and it is even amenable to implementing orthogonal MP (B. Mailhé, R. Gribonval, F. Bimbot, and P. Vandergheynst, “Local orthogonal greedy pursuits for scalable sparse approximation of large signals with shift-invariant dictionaries,” in Proc. Signal Process. Adaptive Sparse Structured Representations, (St. Malo, France), 2009).

But I wonder if there is a way to take advantage of the structure and sparsity of the dictionary Gramian matrix to make the “only-on-paper” optimization above a possibility…

## 2 thoughts on “A Funny Thing Happened on the Way to the Computer”

1. Bob,
Indeed, the Gramian of a dictionary can be sparse, with a minimum sparsity of $$1$$ out of $$N$$ for an orthogonal dictionary of $$N$$ atoms (coherence (Donoho) $$\mu = 0$$) spanning $$R^N$$. If we take the union of two orthogonal bases both spanning $$R^N$$, e.g., sines and spikes providing a minimal increase in coherence of $$\mu = 1/\sqrt{N}$$, then the sparsity of the Gramian will be greatly diminished to less than $$N+1$$ out of $$2N$$. If we instead consider the union of two Gabor bases of different scales, both spanning $$R^N$$, then the Gramian will be more sparse than this by virtue that the atoms are time-localized, but the dictionary will have a higher coherence ($$\mu \gg 1/\sqrt{N}$$). Maybe then the coherence has nothing so definite to say about the sparsity of the Gramian… but I do see what you are about to say: potato chips and salsa. I was going to say it too. ;)