Algorithm Power Hour: Regularized Orthogonal Matching Pursuit

This algorithm power hour is devoted to Regularized Orthogonal Matching Pursuit (ROMP), presented in the paper: D. Needell and R. Vershynin, “Signal recovery from incomplete and inaccurate measurements via regularized orthogonal matching pursuit,” IEEE J. Selected Topics in Signal Process., vol. 4, pp. 310-316, Apr. 2010.

Given an $$s$$-sparse vector $$\vx \in \mathcal{R}^N$$,
we sense it by a matrix $$\mathbf{\Phi} \in \mathcal{R}^{m\times N}$$ with $$N > m$$,
which produces the measurement vector $$\vu = \mathbf{\Phi}\vx$$.
Define the index of the columns of $$\mathbf{\Phi}$$ as $$\Omega = \{1, 2, \ldots, N\}$$.
ROMP proceeds in the following way.
Given the index set $$\Omega_k \subset \Omega$$ at iteration $$k$$,
and defining $$\mathbf{\Phi}_{\Omega_k}$$ as the columns from $$\mathbf{\Phi}$$ indexed by $$\Omega_k$$,
ROMP augments this set $$\Omega_{k+1} = \Omega_k \cup J^*$$
where $$J^* \subset \Omega$$ is a set of indices that collectively satisfy an adaptive threshold. (This is not described clearly in the paper, or the arxiv paper, so I am translating the code.
Also, from where does this regularization come? Why 1/2?)

First, ROMP defines the set
$$I := \{i \in \Omega : |[\mathbf{\Phi}^*\vr_k]_i| > 0\}$$
where $$\vr_k$$ is the orthogonal projection of $$\vx$$ onto the range of $$\mathbf{\Phi}_{\Omega_k}$$.
Obviously, $$I\cap\Omega_k = \emptyset$$.
Next, ROMP finds the set of disjoint sets $$\mathcal{J} = \{J_1, J_2, \ldots\}$$ where
\begin{align} J_1 & := \{ j \in I : | [\mathbf{\Phi}^*\vr_k]_j | \ge \frac{1}{2} \max_{i\in I} | [\mathbf{\Phi}^*\vr_k]_i | \} \\ J_2 & := \{ j \in I\backslash J_1 : | [\mathbf{\Phi}^*\vr_k]_j | \ge \frac{1}{2} \max_{i\in I\backslash J_1} | [\mathbf{\Phi}^*\vr_k]_i | \} \\ J_3 & := \{ j \in I\backslash J_1\cup J_2 : | [\mathbf{\Phi}^*\vr_k]_j | \ge \frac{1}{2} \max_{i\in I\backslash J_1\cup J_2} | [\mathbf{\Phi}^*\vr_k]_i | \} \\ \vdots \end{align}
Each $$J_i \in \mathcal{J}$$ describes the indices that contain a similar amount of “transform energy”, which Needell and Vershynin call “comparable coordinates.”
I see this as a type of adaptive thresholding.
Take all magnitude projections,
order them from biggest to largest,
then find the last index that has a magnitude above half the largest magnitude.
That is one set.
Then start from the index not in that set that has the greatest remaining magnitude, find the last index that has a magnitude half that size, and that is another set.
Continue.
From these sets, ROMP now chooses the best set of indices by the criterion
$$J^* := \arg \max_{J \in \mathcal{J} } \sum_{j \in J} | [\mathbf{\Phi}^*\vr_k]_j |^2.$$
ROMP then updates the residual, and repeats the steps above
until either $$|| \vr_k ||_2$$ is below a threshold,
or a particular number of iterations have occurred,
i.e., when $$|\Omega_k| = s$$.
For initialization, $$\Omega_0 = \emptyset$$, and $$\vr_0 = \vu$$.

A few things worry me.
First, when the dictionary is not orthonormal, ROMP is not computing the real energy in the transform domain. Thus, two atoms that are close to the signal will have large projection magnitudes; but why should both atoms be included in the representation?
Second, in the code I see that all sets in $$\mathcal{J}$$ are disjoint.
Why shouldn’t there be some overlap, e.g.,

w = 1;
best = -1;
J0 = zeros(1);
counter = 1;
while w <= n
first = Jvals(w);
firstw = w;
energy = 0;
while ( w = 1/2 * first )
energy = energy + Jvals(w)^2;
w = w+1;
end
if energy > best
best = energy;
J0 = J(firstw:w-1);
end
counter = counter + 1;
w = counter;
end


w=1;
best = -1;
J0 = zeros(1);
while w <= n
first = Jvals(w);
firstw = w;
energy = 0;
while ( w = 1/2 * first )
energy = energy + Jvals(w)^2;
w = w+1;
end
if energy > best
best = energy;
J0 = J(firstw:w-1);
end
end


When I try this change, the empirical phase transition looks the same at low problem indeterminaces, but worse at others.
In the figure below, I compare the empirical phase transitions of ROMP with and without my “correction.”
My sparse signals have an ambient dimension $$N=400$$,
with non-zero elements distributed Bernoulli in $$\{-1,1\}$$ over a variety of indeterminaces and sparsities. I average the results from 50 realizations.
(Notice the y-axis is not scaled to [0, 1].)

I wonder if this is coming from not choosing the atom with the largest projection. Digging around a bit more, I tried the following change, where I am now comparing the squared coefficients:

w=1;
best = -1;
J0 = zeros(1);
while w <= n
first = Jvals(w);
firstw = w;
energy = 0;
while ( w = 1/2 * first^2 )
energy = energy + Jvals(w)^2;
w = w+1;
end
if energy > best
best = energy;
J0 = J(firstw:w-1);
end
end


This produces the results below:

What if we were to select subsets based on the following
\begin{align} J_1 & := \{ j \in I : | [\mathbf{\Phi}^*\vr_k]_j | \ge \beta \max_{i\in I} | [\mathbf{\Phi}^*\vr_k]_i | \} \\ J_2 & := \{ j \in I\backslash J_1 : | [\mathbf{\Phi}^*\vr_k]_j | \ge \beta \max_{i\in I\backslash J_1} | [\mathbf{\Phi}^*\vr_k]_i | \} \\ J_3 & := \{ j \in I\backslash J_1\cup J_2 : | [\mathbf{\Phi}^*\vr_k]_j | \ge \beta \max_{i\in I\backslash J_1\cup J_2} | [\mathbf{\Phi}^*\vr_k]_i | \} \\ \vdots \end{align}
When $$\beta = 1$$, then we have OMP.
When $$\beta = 1/2$$, we have ROMP.
What about $$\beta = 3/4$$?
Below we see those results:

And $$\beta = 1/3$$?
Below we see those results:

So, what’s so special about $$\beta = 1/2$$?
I also tried $$\beta = 1/3$$, but comparing the squared magnitude projections?Below we see those results:

Anyhow, it seems to me that ROMP may have good guarantees, but it requires tuning, and is not performing as well as most of the other algorithms I am currently studying.