# Greedy Methods, Dynamic Range, and Coherence

Warning: I go on and on, without any real direction, on various topics. Finally, I present some interesting results, which may or may not be related to something I have overlooked in my code.

Commenter Phil Schniter sent me the following explanation of my results from the past few weeks, with an insightful cross-disciplinary link:

Greedy methods are said to work better when there is a large dynamic
range among the (active) coefficient magnitudes. Intuitively, if the
largest coefficient is easy to spot, then a greedy method can identify
it and subtract its contribution from the measurements. Then, if the
largest coefficient in the residual is easy to spot, the greedy method
will next identify it, and so on. But if, instead, there are multiple
large coefficients with similar magnitudes, a greedy method can get
confused. This behavior has been rigorously analyzed in the context of
multiuser communications under the guise of “successive interference
cancellation for CDMA”.

I think this points in the right direction, but misses the importance of dictionary coherence. A greedy algorithm can make a bad choice when several atoms in the dictionary are highly correlated, yet only one of them is involved in the original signal. Of course, it only takes one of these errors to render a recovery unsuccessful. As I am not incorporating noise in my simulations, this points to the “EXACT-SPARSE” condition in J. Tropp, “Greed is good: Algorithmic results for sparse approximation,” IEEE Trans. Info. Theory, vol. 50, pp. 2231-2242, Oct. 2004.

In this paper, Tropp shows “the best possible” performance for OMP for all $$m$$-sparse signals, i.e.,
OMP will recover exactly a unique $$m$$-sparse signal when (Theorem 3.1, sufficient condition)
$$\max_{\vd \in \MD\backslash\MH(m)} \left | \left| [\MH^H(m)\MH(m)]^{-1}\MH^H(m) \vd \right |\right|_1 < 1$$
where $$\MD$$ is our fat dictionary (or sensing matrix),
$$\MH(m)$$ is the optimal set of $$m$$ atoms that represent our measurement precisely (a skinny matrix),
and $$\MD\backslash\MH(m)$$ are the remaining dictionary atoms.
(Forgive the extreme abuse of notation!)
This says, if the maximum sum of the magnitude projection weights is small enough, or if the remaining atom most similar to the ones we have selected is not that similar, then OMP will recover the $$m$$-sparse signal.

To make this theorem useful (how do we know $$\MH(m)$$?), Tropp finds the following test (Theorem 3.5):
The exact recovery condition above holds for every $$m$$-sparse signal when
$$\mu(m-1) + \mu(m) < 1$$
where $$\mu(m)$$ is the $$m$$-th order cumulative coherence of the dictionary defined
$$\mu(m) := \max_{\vh \in \MH(m)} \max_{\vd_i \in \MD\backslash\MH(m)} \sum_{i=1}^m | \langle \vd_i, \vh \rangle | \le m \mu(1)$$
for $$m = 1, 2, \ldots$$.

The proof of this goes essentially like Phil suggests above: one after another, the projections of each residual onto the correct atoms remaining have to be larger than on the incorrect atoms remaining, and so on until $$m$$ atoms have been recovered after $$m$$ iterations. (Is there any turning back for OMP? Once it has selected an incorrect atom, can OMP remove it by nulling its weight and the select the correct one? My experiments are showing that if I run OMP to at least the number of measurements — instead of stopping after $$m$$ atoms have been found for a signal known to be $$m$$-sparse — I have a much higher success rate. Note to self: investigate my test for successful recovery for any funny business.)

From these, Tropp shows that with OMP (and BP) we can recover any $$m$$-sparse signal in a dictionary $$\MD$$ as long as (Corollary 3.6):
$$m < 0.5 \left (1 + \frac{1}{\mu(1)} \right ).$$
Nothing is mentioned here about the prior distribution of the sparse coefficients; corollary 3.6 says verbatim, "any superposition of $$m$$ atoms from $$\MD$$."
Thus, I predict that one of these conditions is failing for my dictionary and OMP, and that such problems with coherence are exacerbated by the sparse vector being distributed in a particular way.
I am finding for my 50×250 measurement matrices (or dictionaries) that their coherence is in the neighborhood of $$\mu(1) \approx 0.55$$.
This means that I am guaranteed exact recovery by OMP only when sensing a 1-sparse signal! Woohoo! Not a very useful result. But hey, these are sufficient conditions, and not necessary conditions.

Phil’s insight into OMP-like techniques used in collision detection multiple access (CDMA) communications is greatly appreciated; I will look into that literature soon.

Now for some results of experiments I ran today with 13-sparse vectors (a little under 50% recovery rate for BP) distributed in different ways sensed by the same 50×250 sensing matrix (elements iid standard normal, columns normalized).

Here we see the recovery rates of 2000 independent trials. OMP is run at most 100 iterations.
I used the following distributions for generating the non-zero elements (the positions of which are uniformly distributed over 250 samples) of the sparse vector:

• N1 ~ standard normal
• N2 ~ zero mean Gaussian with variance 4
• N3 ~ zero mean Gaussian with variance 1/4
• N4 ~ sum of two unit-variance Gaussian, one with mean +3, one with mean -3
• L1 ~ Laplacian with $$\lambda = 1$$
• L2 ~ Laplacian with $$\lambda = 1/10$$
• L3 ~ Laplacian with $$\lambda = 10$$
• U1 ~ uniform over $$[-2, -1] \cup [1, 2]$$
• U2 ~ uniform over $$[-1, 1]$$
• U3 ~ uniform over $$[-1, -0.005] \cup [0.005, 1]$$
• U4 ~ uniform over $$[-1, -0.1] \cup [0.1, 1]$$
• D1 ~ equiprobable Bernoulli on $$\{-1, 1\}$$
• D2 ~ discrete uniform on $$\{-2, -1, 1, 2\}$$

BP stays remarkably stable across all these distributions.
However, we see OMP has significant trouble with N4, U1, D1, and D2.
The lack of performance change in response to variance (N1-N3, L1-L3)
tells me that this is not an issue with my success criteria, i.e., $$||\vs – \vs’||_2^2 \le 10^{-10}$$.
The recovery rates for N4 and U1 are still strange to me.
Why do these bimodal distribution cause problems?
U3 and U4 are also bimodal, but not as “strongly” as U1.
On top of this, I am finding no significant differences in the dynamic ranges between the successful and failed signal reconstructions, or high coherences between the atoms in $$\MH(m)$$ above (plots to come).
And can we get higher success rates for some other distribution (other than all probability mass at 0, ha ha)?

This requires a deeper reading of Tropp, the CDMA literature, and the odd answers in the back of the book (but these pages seem to be missing!).