# Proper respect for the norms of atoms in dictionaries

Until recently, I have always thought about sparse approximation and recovery in terms of algorithms with dictionaries having atoms $$\{\phi_i\}$$ with the same $$\ell_2$$ norm. I find it quite typical in the literature to discuss and treat algorithms with such dictionaries. The assumption of unit normness is often explicitly stated in the very beginning of work in this area, such as in Tropp’s “Greed is Good,” and Mallat and Zhang’s “Matching Pursuit.” While it makes the mathematics cleaner, its ubiquity led to my lack of respect for the role of the norm of atoms.

To see what I mean, consider the exact sparse recovery problem:
$$\min_\vs \|\vs\|_0 \; \textrm{subject to} \; \vu = \MPhi \vs$$
where the columns of $$\MPhi$$ are made from $$\{\phi_i\}$$.
In many papers from the past 15 years, I see that authors (including me :) often say something like the following: the above problem is computationally difficult, and so we can replace the $$\ell_0$$-pseudonorm with its closest convex cousin, the $$\ell_1$$-norm:
$$\min_\vs \|\vs\|_1 \; \textrm{subject to} \; \vu = \MPhi \vs$$
which can be solved by numerous methods far less complex than that required to solve the other problem.
What is more, we know that there are many cases in which the solution to this solvable problem is identical to that of the harder problem.

However, the justification for posing the first problem as the second implicitly assumes the atoms $$\{\phi_i\}$$ have the same $$\ell_2$$-norm — otherwise, it is not correct!
When the atoms do not have the same norm, the geometry of the problem changes, and bad things happen if we do not take this into account.
In short, when the atoms in the dictionary have different norms, the $$\ell_1$$-norm of the solution does not act like its $$\ell_0$$-pseudonorm.
As posed above, the minimal $$\ell_1$$-norm solution will likely use atoms that are much longer than all the others, because their weights will be smaller than those for the shorter atoms.

Instead, the correct way to pose the convexification of the
exact sparse problem with the $$\ell_1$$-norm is
$$\min_\vs \|\MN\vs\|_1 \; \textrm{subject to} \; \vu = \MPhi \vs$$
where $$\MN$$ is a diagonal matrix with $$[\MN]_{ii} := \|\phi_i\|_2$$.
Now, the weights of all the atoms are treated equally, no matter their norms in the dictionary.
It may seem pedantic, but I have seen this implicit condition lead to some confusion about how to apply particular principles and conditions.
I have yet to find such a discussion, however; and so hope to fill this gap in a brief article I am writing.

## 2 thoughts on “Proper respect for the norms of atoms in dictionaries”

1. I just want to let you know that the paper of Bruckstein et al. “From Sparse Solutions of Systems of Equations to Sparse Modeling of Signals and Images” considers the non-normalized columns case. (See Eq. (8)). I think they don’t say much more to what you just said, though.
Note that we think that in this paper, their formulation of OMP for the non-normalized case has a typo. In the “exhibit 1” box (page 9), in the sweep step, you should divide by the norm of a, not by the norm of a square.
Also Michael Elad’s book (note that the paper mentioned above was the seed for that book), has some additional results related to this issue. He prove that if you normalize \Phi first and use OMP, you get the same result as if you use the original \Phi but use the modified OMP.
If your library is subscribed to Springer, it is possible to get Elad’s book for “free”

Like

2. Good! (8) shows the proper way to state (P0) using the constrained $$\ell_1$$-norm minimization. (I am happy I derived the same thing. :) That is not a typo in their explanation of OMP, though it is a funny way to state a simple algorithm. z_j is the weight of a_j in the updated model. You have two of a_j in that expression, and so you need the norm of it squared. Just plug the expression for z_j into epsilon(j) to see it. Or, alternatively, take the derivative of epsilon(j) with respect to z_j and find its optimal value. Then you will find the selection criterion as well.
I see in Devore and Temlyakov,
@article{DeVore1996,
Author = {R. A. DeVore and V. N. Temlyakov},