# Paper of the Day (Po’D): Sparse Decomposition Based on Smoothed $$\ell_0$$-norm Edition

Hello, and welcome to Paper of the Day (Po’D): Sparse Decomposition Based on Smoothed $$\ell_0$$-norm Edition. In the forthcoming weeks, I am preparing for the first part of my scientific mission at INRIA in Rennes, France (details to come). Today’s paper has fantastic results: G. H. Mohimani, M. Babaie-Zadeh, and C. Jutten, “A fast approach for overcomplete sparse decomposition based on smoothed $$\ell$$_0 norm,” IEEE Trans. Signal Process., vol. 57, pp. 289-301, Jan. 2009. In addition to this paper, there is also “Robust SL0” presented in: A. Eftekhari, M. Babaie-Zadeh, C. Jutten, and H. Abrishami Moghaddam, “Robust-SL0 for Stable Sparse Representation in Noisy Settings,” in Proc. ICASSP, Taipei, Taiwan, Apr. 2009. In the interest of extreme brevity, here is my one line description of the work in these papers:

Approach $$\ell_0$$-norm minimization by majorization, et voilà — we find sparse solutions with scalable complexity.

Given some overcomplete dictionary $$\MA \in \mathcal{R}^{n\times m}$$, and some signal $$\vx$$ that can be sparsely represented in $$\MA$$, i.e., $$\vx – \MA\vs = \ve$$ where the number of non-zero elements in $$\vs$$ is small, and the error $$\ve$$ has some desirable properties, e.g., variance smaller than $$\epsilon$$, we want to find the sparse representation $$\vs$$ having the fewest non-zero elements. The problem then could be stated,
$$\vs = \arg \min_\vs || \vs ||_0 \; \text{subject to} \; || \vx – \MA\vs ||_2^2 \le \epsilon$$
where $$|| \vs ||_0$$ counts the number of non-zero elements in $$\vs$$.
Since strict sense sparsity makes this impossible to solve in good company, i.e., with methods that are not combinatorial, e.g., convex optimization, Chen et al. 1993 suggested replacing it with a relaxed measure of sparsity, such as $$||\vs||_1$$ (BP).
With that, the problem is solvable in good time;
and the theory of compressed sensing tells us the conditions on $$\vx$$ and $$\MA$$ that should be satisfied in order to have a high probability of finding $$\vs$$ in this way.
However, replacing the zero-norm with the $$\ell_1$$-norm is only one kind of “majorization” — that is, replacing a difficult-to-minimize function with one that is easier to minimize but is at least as large as the other. (For a brief review, see P. Stoica and Y. Selén, “Cyclic minimizers, majorization techniques, and the expectation-maximization algorithm: A refresher,” IEEE Sig. Process. Mag., vol. 21, pp. 112-114, Jan. 2004.)

In this paper, the authors propose using majorization of the zero-norm to find the solution above.
Similar approaches to this problem include, reweighted-$$\ell_1$$, and
FOCUSS (B. D. Rao and K. Kreutz-Delgado, “An affine scaling methodology for best basis selection,” IEEE Trans. Signal Process., vol. 47, pp. 187-200, Jan. 1999),
but the authors show they are much slower than SL0.
Among several candidate functions, the authors here focus on one:
$$||\vs||_0 = m – \lim_{\sigma \to 0} J(\vs; \sigma)$$
where $$s_i$$ is the $$i$$th component of $$\vs$$, and
$$J(\vs; \sigma) := \sum_{i=1}^m \exp \left [ -s_i^2/ 2\sigma^2 \right ].$$
Substituting this above, and setting $$\epsilon = 0$$, the problem becomes dependent on $$\sigma$$:
$$\vs(\sigma) = \arg \max_\vs J(\vs; \sigma) \; \text{subject to} \; \vx = \MA\vs.$$
In their paper on Robust SL0 (RSL0), the authors make a slight adjustment to the SL0 algorithm. Instead, RSL0 now searches each iteration for the solution to
$$\vs(\sigma) = \arg \max_\vs J(\vs; \sigma) \; \text{subject to} \; || \vx – \MA\vs||_2 \le \epsilon$$
where $$\epsilon$$ is the standard deviation of the additive noise in the measurements.

The authors cast this in terms of Lagrange multipliers, and construct an algorithm (SL0) that uses steepest ascent and decreasing variances.
First, SL0 initializes with the minimum $$\ell_2$$-norm solution (which the authors show is equivalent to $$\vs(\infty)$$); and then SL0 constructs a set of $$J$$ decreasing variances (a geometrical sequence) based on the magnitude weights of that solution.
For each of these values, starting with the largest,
SL0 maximizes $$J(\vs; \sigma_j)$$ over $$L$$ steps using steepest ascent for the solutions in the feasible set ($$\vx = \MA\vs$$).
The steepest ascent approach involves soft-thresholding of the values in $$\vs$$ based on $$\sigma_j$$,
a small correction applied to $$\vs$$ (helping remove the small coefficients), and a reprojection back to the feasible set.
Finally, depending on the last $$\sigma_J$$, SL0 arrives at a solution that will be quite close to the unique sparsest solution.

In their analysis of SL0, the authors show that the solution by SL0 converges to the $$\ell_0$$ solution when every $$n\times n$$ submatrix of $$\MA$$ is invertible (and consequently, that the sparsest solution must be unique). The authors also provide a bound on the solution error when there is additive noise in the representation, i.e., $$|| \ve || > 0$$. The caveat with SL0 is that the approximation error depends on the size of $$\sigma_J$$. In order for the error to be small, this last value must be small, which requires many more iterations.

Even with this wrinkle, their experiments convincingly show SL0 to be a great competitor to BP, and BP denoising, not only in terms of speed (up to three orders of magnitude!), but also accuracy, and noise robustness. The authors test sparse vectors with elements distributed normal, with and without additive white Gaussian noise. We see that SL0 performs superbly better than BP for problem sparsities of $$0.25 < \rho < 0.5$$. (I wonder why, in their fig. 6, that the SNR is peculiarly low for small problem sparsities.)

The authors make their code available as well, which I will dutifully integrate into my own work and compare with other approaches. (And from that same page, I see that their 2006 submission to IEEE Signal Processing Letters was rejected! That does not bode well for my recent submission… I will start making a correspondence.)

## 6 thoughts on “Paper of the Day (Po’D): Sparse Decomposition Based on Smoothed $$\ell_0$$-norm Edition”

1. IgorCarron says:

Bob,
I had stronger words for this paper being rejected:
One thing about SL0 I like is how small it is and how it circumvents relying on L1 only. The downside in my view is that sometimes it fails on simple examples so I always use it as a first tool, then if it fails go for the larger codes.
One more thing, in order to get the most out of it (even if it is a little slower), I use the following parameters:
sigma_min = 0.00004;
sdf=0.95;
x = SL0(A, y, sigma_min,sdf);
sdf has to approach 1 from underneath. If now we could have a similar solver for rank minimization instead on having to rely on CVX (I love CVX but I am not against having a contender on the side).
Cheers,
Igor.

Like

2. Sergey Ten says:

What’s interesting here I suspect that J could be *any* redescending M-estimator.That’s one of the cases of clouse connection between compressive sensing and robust statistics. In this case it’s $$exp(-x^2/\sigma^2)$$ but I remember the paper about iterative reweighted least squares which correspond cauchy estimator $$1/(\sigma^2 + x^2)$$
used for SL0. M-estimators point of view can (may be) explain why it’s converge faster but sometimes fail. M-estimator derivative $$\psi = \partial J/\partial x$$ descend on big x, while L1 is constant, that allow for faster suppression of outliers for robust statistics or noise for compressive sensing.
Now 2 drawback of redescending M-estimator are:
1. possibility of multiple minima (non convexity) which could create problem with bad initial approximation.
2. vulnerability to structure of solution. For very small dispersion redescending M-estimator approximate mode of distribution – maximum value of distribution.For robust statistics it could create problem if outliers are concentrated and create false mode. I’m not sure how it translate to compressive sensing setting, have to think about it…

Like

3. Sergey Ten says:

I’m not sure my post is correct. I thought about norm in LASSO, but here it’s not the LASSO

Like

4. Thank you Graham for your insightful comments. Also, BP minimizes the $l_1$ and the other minimizes the majorization of the $l_0$, so a comparison of speed is further displaced. A better comparison would be between SL0 and reweighted $l_1$ since that also approaches minimizing the $l_0$.

Like

5. Graham says:

Hi Bob,
Nice as always to have your paper discussions, it is a good excuse for learning something new.
It is certainly good to have an interesting competitor/analogue to $l_1$ methods in relaxation of $l_0$. However, I don’t think the speed comparison made by the authors is fair.
From the abstract:

It is experimentally shown that the
proposed algorithm is about two to three orders of magnitude faster than the state-of-the-art interior-point
LP solvers, while providing the same (or better) accuracy.

So they are comparing their algorithm against interior point methods on L1 problems. And from the introduction:

By using
fast LP algorithms, specifically interior-point LP solvers, large-scale problems with thousands of sources
and mixtures become tractable.

It is the choice of interior-point methods to solve large-scale problems that is unfair. Interior-point methods are 2nd order, meaning they generate 2nd derivatives (Hessian) of size $N^2$. When $N$ is large, this can get to be a problem. As stated more eloquently by Becker in his solvers list:

I also distinguish interior point methods (IPM) from first-order methods (methods that only need the gradient, not the Hessian). Unless specified, all methods below are first-order. First-order methods typically require many more iterations than IPM, but each iteration is much, much cheaper. For these problems discussed here, when the problem size is large, the trade-off is almost always in the favor of first-order methods.

Their algorithm is based on gradient updates, so it is a first-order method. And comparing a second-order method against a first-order method for large problems seems unfair. Their speed test thus shows that they are faster than L1 methods implemented with interior point, not they are faster than L1 in general (for which they would need to compare with first-order L1).
Perhaps this is only clear in hindsight, because it seems like the first popular compressive solvers (l1-Magic) were based on interior point.
regards!
Graham

Like

6. IgorCarron says:

Graham,
I believe you are right, it was a question of time frame. When they made that statement, L1-magic was one of the few contenders. The first time I saw IRL1 (Candes) or IRLp (Chartrand) was 2007 while their first (rejected paper) showed up in 2006.
Cheers,
Igor.

Like