BLS edit (02122010): added description of \(n\) parameter in algorithm.

Hello and welcome to Paper of the Day (Po’D): Probabilistic Matching Pursuit Edition. I learned of today’s paper from Igor Carron and his excellent, check-more-often-than-the-New-York-Times blog, Nuit Blanche: A. Divekar and O. K. Ersoy, “Probabilistic Matching Pursuit for Compressive Sensing,” Electrical and Computer Engineering Technical Report, Purdue University, May 2010. They have made available some attendant code as well.

This work is interesting for a couple of reasons. First of all, Matching Pursuit (MP) is what I like to call a “SCLIDMSDoHCD” (pronounced “sclidmsdohcd”): Straightforward and Computationally-light Iterative Descent Method for Signal Decomposition over Highly Coherent Dictionaries. Do you want to find a multiresolution time-frequency approximation of a high-dimensional audio signal that has the structurally-diverse content that drives orthogonal transforms crazy? Then MP is the only way to go; and MPTK is the only tool out there up for the heavy lifting. However, MP is a 10-kilo sledgehammer when what we really need is a 200-gram ball peen, maybe also equipped with a laser.

A usual day at the lab running MP on the latest song by Lady Ga\(^T\)Ga. (She truly deserves to win a “Gramian.”) |

MP, like Mr. T, pities the fool that is non-compressible signals, and still busts all up in there without regard for blueprints, or appropriate safety precautions. In the end, we may be left with a nice flat spectrum, but we will be surrounded by splinters and shards of useless material that aren’t even fit for an incinerator in Baton Rouge. (MP, like Mr. T, can bring a true environmental disaster, allegorically.) Excusing my dramatic license with a lifeless signal processing algorithm, MP, as well as its highfalutin cityslickin cousins, orthogonal MP (OMP), and optimized OMP (OOMP — she is the one with the Mercedes, sauna, and room for a pony), frequently make egregious errors in finding good descriptions of any but the most trivial signals. And my work shows how MP will always be prone to problems. (Less so for OMP and OOMP, but still, as it “runs in the family,” they have problems too.) Hence, I am always interested in ways to improve MP, OMP, and OOMP, and make them act like the precision instrument I need.

The second reason I am interested in this particular work is its use of the notion of probability. With regards to how MP operates, I personally believe that MP are unable to do so because our education like such as in, South Africa and, the Iraq and everywhere like such as. Surely, a better way to decompose a signal over an overcomplete dictionary is by using information in the residual *and* the growing signal model. That is in a sense what I attempt to do with “interference adaptation” (B. L. Sturm and J. J. Shynk, “Sparse approximation and the pursuit of meaningful signal models with interference adaptation,” IEEE Transactions on Acoustics, Speech, and Language Processing, vol. 18, no. 3, pp. 461-472, Mar. 2010). So I am intrigued by how this Probabilistic MP performs.

There have been other attempts at incorporating probability and/or non-determinism into MP:

- S. E. Ferrando, E. J. Doolittle, A. J. Bernal, and L. J. Bernal, “Probabilistic matching pursuit with Gabor dictionaries”, Signal Processing, vol. 80, no. 10, pp. 2099-2120, Oct. 2000.
- P. J. Durka, D. Ircha, and K. J. Blinowska, “Stochastic time-frequency dictionaries for matching pursuit,” IEEE Trans. Signal Process., vol. 49, pp. 507-510, Mar. 2001.

In both of these works, the authors argue for an averaging of many decompositions found by randomizing the parameters in the dictionary. The first proposes it as a way to denoise a signal; and the second proposes it as a way to remove the presence of artifacts introduced by MP. But it doesn’t appear to me, on first pass, that either attempts to do what the authors of today’s Po’D are doing: incorporating the notion of probability into the atom selection criterion of OMP.

Now, having completed that extremely long-winded introduction, I will formulate the algorithm, PrOMP, in today’s Po’D (or what I think it is), using my notation (which is completely different from that used by the authors, but I think is much clearer).

Let us define the \(n\)th-order *representation*

of a signal \(\vx \in \mathcal{R}^K\) by

\(\mathcal{X}_n := \bigl \{ \MH(n), \va(n), \vr(n) \bigr \}\).

This embodies the \(n\)th order signal *model* \(\vx = \MH(n)\va(n)+\vr(n)\).

The \(n\) elements of the *representation basis*, \(\mathcal{H}_n\),

expressed as the \(n\) columns in the matrix \(\MH(n)\),

are \(n\) atoms selected from the overcomplete dictionary of \(N\) atoms

\(\mathcal{D}_N := \{ \vd_i \in \mathcal{R}^K : || \vd_i ||_2 = 1\}\).

“Overcomplete” means not only that \(N > K\),

but also that for any \(\vx \in \mathcal{R}^K\) there always exists

a way to represent it using elements of \(\mathcal{D}_N\)

such that the norm of the residual is 0.

Now, OMP updates \(\mathcal{X}_n\) according to:

$$

\mathcal{X}_{n+1} = \left \{

\begin{array}{@{}r@{\;}l@{}}

\MH(n+1) = & [ \MH(n) | \vh_{n} ], \\

\va(n+1) = & \MH^\dagger(n+1)\vx, \\

\vr(n+1) = & \vx – \vx_{\mathcal{H}_{n+1}}

\end{array}

\right \}.

$$

where the pseudoinverse \(\MH^\dagger(n) := \left [ \MH^T(n)\MH(n) \right ]^{-1}\MH^T(n)\),

and \(\vx_{\mathcal{H}_n} := \MH(n)\MH^\dagger(n)\vx\) is the orthogonal projection of the signal \(x\) onto the span of \(\mathcal{H}_n\).

The atom selection criterion of OMP is defined at iteration \(n\) as

$$

\vh_{n} = \arg \max_{\vd \in \mathcal{D}_N}

| \langle \vr(n), \vd \rangle |.

$$

OMP proceeds like this until our norm residual is “small enough for tax purposes”, or we have performed enough iterations.

To make PrOMP, the authors change two things in OMP. First, they redefine the atom selection criterion in terms of a probability. And then they prescribe several re-initialized PrOMP decompositions (with the new atom selection criterion) until the residual error decreases sufficiently — which kind of reminds me of simulated annealing. The authors say, we have found the solution, i.e., the “true components” in \(\vx\), when \(||\vx – \vx_{\mathcal{H}_{n}}||_2 = 0\). (Within compressive sensing this claim makes much more sense than in sparse representation of audio signals, which should make no claim as to the real existence of said atoms in the signals.)

Thus, instead of the atom selection criterion of OMP, the authors propose a maximum likelihood rule:

$$

\vh_{n} = \arg \max_{\vd \in \mathcal{D}_N} P[ \vd \in \mathcal{H}^* | \vr(n) ]

$$

where the set \(\mathcal{H}^*\) are the “true components” in the signal, and they define the probability measure on the dictionary as simply

$$

P[ \vd \in \mathcal{H}^* | \vr(n) ] := \begin{cases}

0, & | \langle \vr(n), \vd \rangle | = 0 \\

(1-\epsilon)/m, & | \langle \vr(n), \vd \rangle | \ge f_m \\

\epsilon/(N-n-m), & 0 < | \langle \vr(n), \vd \rangle | 0\) is small. (With this set-up, there is no need for all the crazy index sets the authors use.) We must subtract \(n\) because we have already selected \(n\) atoms. So what this probability measure implies is that in the dictionary there are three groups of atoms: 1) atoms with zero probability; 2) \(m\) highly correlated atoms with the same high probability; 3) \(N-n-m\) low correlated atoms with the same low probability. What is not clear from this article is what atom is selected when there are several with the same high probability. I assume (and I will soon look in the attendant code) that they select an atom randomly from the \(m\) atoms with the highest magnitude correlations, which would thus make each PrOMP decomposition non-deterministic. Finally, this algorithm is run several times with the same initialization, and each time with the same number of iterations \(n_{\max}\) — which is also the sparsity of the solution—until the residual norm is zero.

The authors show averaged results of several experiments of PrOMP with different values: \(K = 800\), \(N \in \{900, 1000, 1100\}\), \(n_{\max} \in \{400, 450, 500, 550\}\). They also set \(\epsilon = 0.01\), and \(m = 8\). The dictionaries are constructed by concatenating a Dirac basis (\(K\times K\) identity) with a Bernoulli random dictionary of \(N-K\) atoms having elements identically distributed as \(\{\pm 1/\sqrt{K}\}\). (50 realizations of the dictionaries were used.) Their results show that most of the time the “true components” are recovered, even when more than half the dictionary is in the signal; and the mean number of times it is necessary to run PrOMP is nearly one. This latter result is very strange to me considering that PrOMP may be selecting randomly from a set of 8 atoms with the highest magnitude correlations, but with magnitude correlations very unevenly spread; and also since I know OMP is just as prone as MP to selecting atoms poorly. However, the mean coherence of this dictionary is rather small, I think not much larger than \(1/\sqrt{K}\).) So in their case the propensity of OMP to select bad atoms might be less problematic.

When it comes to compressive sensing, or any other problem where we know a signal is an exact linear combination of a known number of terms, then I can see the utility of this approach to discovering that combination without using convex optimization — and hence for improving the probability of recovery when sparsity is too low for \(\ell_1\)-minimization to work. For decomposing a complex audio signal over a highly coherent dictionary however, and where we are not sure of the “correct” model order, I don’t see much value to PrOMP. Even bad atoms have high correlations with the signal, or really, the residual, which might contain errors from the residue of previous atoms. Furthermore, like MP and OMP, PrOMP is still only using the information in the residual and is not taking into consideration the characteristics of atoms selected before. I predict that PrOMP is a suboptimal approach for sparse representation, but maybe with a little tweaking it might discover better models. Now it is time to experiment.

I am also intrigued by MP type methods and by probabilistic approaches. How I understood the PrOMP approach, you need to select elements not based on the highest probability in each iteration, but randomly, where each element has a probability of being selected proportional to that probability. Otherwise, the algorithm becomes deterministic and will select the same elements in each iteration, so running it for several iterations does not help, i.e. you would select exactly the same elements again.

Also, why not use the full Bayesian approach and a Gibbs sampler started with the zero vector. Wouldn’t that do something similar?

LikeLike

Hello random commenter,

You may be right. This week I will be looking into it.

I will be looking more in the Bayesian approaches as well.

Cheers.

LikeLike