# Paper of the Day (Po’D): Enhancing Sparsity by Reweighted $$\ell_1$$-norm Minimization Edition

Welcome to Paper of the Day (Po’D): Enhancing Sparsity by Reweighted $$\ell_1$$-norm Minimization Edition: E. J. Candès, M. B. Wakin, and S. P. Boyd, “Enhancing sparsity by reweighted minimization,” J. Fourier Analysis Applications, vol. 14, no. 5-6, pp. 877-905, 2008. This paper presents an intriguing way to make $$\ell_1$$-norm minimization act more in the favor of sparsity than it does, at the cost of more computational complexity. However, with fast convex optimization algorithms at our disposal this may not be so serious for some applications.

Given the observed vector $$\vy \in \mathcal{R}^M$$ formed by
$$\vy = \bf{\Phi}\vx_0$$
where $$\vx_0 \in \mathcal{R}^K$$ is a $$k$$-sparse vector,
i.e., $$||\vx_0||_0 :=| \{i : x_i \ne 0 \} | = k \le K$$,
and $$\bf{\Phi} \in \mathcal{R}^{(M\times K)}$$ has full row-rank,
we would like to recover $$\vx_0$$ even though $$M < K$$,
i.e., there are more unknowns than equations.
In compressed sensing, $$\bf{\Phi}$$ is a {\em measurement matrix},
with entries that are independently and identically distributed.
This matrix “samples'' the sparse signal $$\vx_0$$,
and maps it to a lower-dimensional space.
In sparse representation, $$\bf{\Phi}$$ is an {\em overcomplete dictionary}
that efficiently describes our observed vector $$\vy$$.
Its columns are the atoms we use to build up the observed signal.
(We could say $$\bf{\Phi}$$ is really $$\bf{\Phi}\bf{\Psi}$$ where $$\bf{\Psi}$$ is the overcomplete dictionary in which the signal we are observing is at least compressible. In this case, $$\bf{\Phi}\bf{\Psi}$$ is the "sensed dictionary.")
We consider the "best" solution to both of these problems as the sparsest solution,
which is given by the program
$$\min_{\vx \in \mathcal{R}^K} ||\vx||_0 \; \textrm{subject to} \; \vy = \bf{\Phi}\vx$$
which is more difficult to solve than a sudoku puzzle initialized by only half a number.

It has been shown that we can recover with high probability
any solution to the underdetermined system $$\vy = \bf{\Phi}\vx_0$$
having sparsity less than $$M/\mathcal{O}(\log K/M)$$
by the convex optimization program
$$\min_{\vx \in \mathcal{R}^K} ||\vx||_1 \; \textrm{subject to} \; \vy = \bf{\Phi}\vx$$
as long as the measurement matrix $$\bf{\Phi}$$ satisfies a stability property called the Restricted Isometry Property.
Alternatively, in sparse representation,
this means (I think, and correct me if I am wrong!) that the atoms in $$\bf{\Phi}$$ (or $$\bf{\Psi}$$) are not too “coherent,”
i.e., $$||\bf{\Psi}^T\bf{\Psi} – \MI||_\infty := \max_{ij} | [\bf{\Psi}^T\bf{\Psi} – \MI ]_{ij} | \ll \textrm{1}$$
(assuming all atoms have unit norm).

The authors of this paper show that
we can improve upon the results from this $$\ell_1$$-norm minimization program
to reduce the number of measurements we need $$M$$ —
which is equivalent to increasing the number of atoms in the dictionary
(and consequently increasing its coherence).
IR$$\ell_1$$M is extremely simple,
and proceeds as follows.
With an initial weighting matrix $$\MW_0 = \MI$$,
in the $$i$$th step:

1. $$\hat \vx_i \leftarrow \arg \min_{\vx} || \MW_i \vx ||_1 \; \text{subject to} \; \vy = \bf{\Phi}\vx$$
2. $$\MW_{i+1} \leftarrow \text{diag}(|\hat \vx_i| + \epsilon)^{-1}$$
3. $$i \leftarrow i+1$$ and end if, e.g., $$i = i_{\max}$$

where $$\epsilon > 0$$ is “wisely” set to avoid local minima.
The use of this weighting matrix
causes the program to progressively act as if
it is solving (\ref{eq:solutionell0normminimization}).
The authors show that this is equivalent to a
majorization-minimization algorithm solving by gradient descent the problem
$$\min_{\vx \in \mathcal{R}^K} \bf{1}^T \log(|\vx| + \epsilon) \; \textrm{subject to} \; \vy = \bf{\Phi}\vx.$$
I really like their figure explaining the action of the weighting, shown below.

If we have observed the $$1$$-sparse vector $$\vx_0 = [0, 1, 0]^T$$
using the measurement matrix
$$\bf{\Phi} = \left [ \begin{array}{ccc} 2 & 1 & 1 \\ 1 & 1 & 2 \end{array} \right ]$$
then our observed vector is $$\vy = [1, 1]^T$$.
The set of all solutions to this problem, the feasible set, is given by $$\{\vx \in \mathcal{R}^3 : \bf{\Phi} \vx = [1, 1]^T\}$$, which is the red line in each plot. On the left we see the $$\ell_1$$ unit ball in $$\mathcal{R}^3$$, which has on its surface the set of points $$\{\vx \in \mathcal{R}^3 : ||\vx||_1 = 1\}$$, of which $$\vx_0$$ is a member. Now, the $$\ell_1$$-norm minimization program will find the point of the feasible set with the smallest $$\ell_1$$-norm, which here lies inside the $$\ell_1$$ unit ball (blue dot). This solution is not wrong in the sense that it does in fact have a smaller $$\ell_1$$-norm than $$\vx_0$$, the most sparse solution; but here we clearly see the dangers of conflating minimizing the $$\ell_1$$-norm with maximizing the sparsity of the solution. We can avoid this scenario if we properly adjust the action of the measurement matrix, i.e, $$\bf{\Phi}’ \leftarrow \bf{\Phi}\MW^{-1}$$, to squeeze the unit ball along those dimensions until the feasible set of solutions intersects it at one point.

The authors provide numerous experimental results that show the benefits of using IR$$\ell_1$$M, for instance, a reduced oversampling factor. Even though the computational complexity has increased, it appears that a large performance gain is seen even with a few iterations. In the case of sparse Bernoulli random signals ($$\pm a, a > 0$$), we do not see much improvement over the $$\ell_1$$-norm minimization program. They show too that when the measurement vector is embedded in noise, IR$$\ell_1$$M (with the constraint $$||\vy – \bf{\Phi}\vx||_2 < \delta$$) often finds solutions with smaller reconstruction error ($$\ell_2$$-norm).