# Paper of the Day (Po’D): Compressed Sensing and Coherent Dictionaries Edition

Hello and welcome to Paper of the Day (Po’D): Compressed Sensing and Coherent Dictionaries Edition.Today’s paper was touted by Igor Carron and his excellent, tell-mom-to-wait-while-I-just-read-this-new-entry blog, Nuit Blanche: E. J. Candès, Y. C. Eldar, and D. Needell, “Compressed sensing with coherent and redundant dictionaries,” arXiv:1005.2613v1, May 2010. This paper is so new, it is in the process of being distributed to peers for review as we speak.

Consider that we model some real signal $$\vx \in \mathcal{R}^K$$ as a linear combination of atoms from some overcomplete dictionary $$\MD \in \mathcal{R}^{K\times N}$$. Let us assume our dictionary is “well-selected,” which means that we assume there exists for our signal a non-empty set of solutions
$$\mathcal{S}^* := \{\vs^* \in \mathcal{R}^N: || \vx – \MD\vs^* ||_2 \le \epsilon, \vs^* \;\text{is compressible}\}$$
where $$\epsilon > 0$$, and the compressibility part means, e.g., that $$||\vs^*||_0 \ll K$$, or less strictly, that we can order the magnitudes of each $$\vs^*$$ such that their decay is bounded from above by a decaying exponential function.
This generally means that for the dictionary, $$N \gg K$$,
which implies that the dictionary coherence is not zero (the maximum non-diagonal terms in $$\MD^T\MD > 0$$), and that there is a linear dependence among its columns.
Now, consider that we do not observe $$\vx$$ directly, but instead measure it using a fat random (zero mean, unit row variance) matrix $$\mathbf{\Phi} \in \mathcal{R}^{M\times K}$$ where $$M < K$$. Thus, our model becomes
$$\vy = \mathbf{\Phi}\vx + \vn = \mathbf{\Phi}\MD\vs + \vn$$
where $$\vn \in \mathcal{R}^M$$ is some zero-mean noise with power $$\epsilon^2$$.
Notice that our observation $$\vy$$ is in a smaller dimensional space than the column space of the dictionary. In this sense $$\vx$$ is "compressively sensed."

Now some important questions are the following:

1. When and how well can we recover $$\vx$$ from $$\vy$$, using the knowledge that $$\mathcal{S}^*$$ is not empty?
2. When and how can we find any element from $$\mathcal{S}^*$$?

In this paper, the authors show that we can solve the first question using $$\ell_1$$-analysis:
$$\hat \vx = \arg \min_{\vx’ \in \mathcal{R}^K} || \MD^T\vx’ ||_1 \; \text{subject to} \; ||\mathbf{\Phi} \vx’ – \vy ||_2 \le \epsilon$$
such that the solution has an upper bound on its error
$$|| \hat \vx – \vx ||_2 \le C_0 \epsilon + C_1 \frac{|| \MD^T\vx – (\MD^T\vx)_\Upsilon ||_1}{\sqrt{\Upsilon}}$$
where $$M = \mathcal{O}(\Upsilon\log(N/\Upsilon))$$, and $$\Upsilon$$ is estimated to be the minimum sparsity of the coefficient vector $$\vs$$, or at least the number of its significant components. (This error bound holds for a tight frame, but for a non-tight frame I believe this upper bound will be too liberal for some $$\vx \in \mathcal{R}^K$$, as long as the constants are found for the lower frame bound… but here I am just shooting from the hip.)
As the authors state, this error depends on the “tail” of the $$N-s$$ smallest coefficients. The more compressible a signal is in $$\MD$$, the smaller this error bound is given the same sparsity and noise conditions.

The good news here is that as long as we know our signal is compressible in some dictionary, and the magnitudes in $$\MD^T\vx$$ decay quickly e.g., $$\MD^T\vx$$ is itself sparse), we can recover the possibly non-sparse signal $$\vx$$ using $$\ell_1$$ analysis within some maximum error no matter how large is the coherency of the dictionary, e.g., $$1$$.

The authors do provide one instance of where the magnitudes in $$\MD^T\vx$$ do not decay quickly, but the original signal is sparse and still recoverable from projections onto random vectors. Unfortunately this example is artificial. Consider $$\vx$$ to be a modulo periodic Dirac comb with impulses spaced every $$\sqrt{n}$$ samples (assuming $$n$$ is a perfect square). Thus, $$\vx$$ is equally sparse in the standard basis (i.e., as itself), and in the Fourier series basis. (In this case we are assuming $$\epsilon = 0$$.) Now, if we concatenate these two bases into the dictionary $$\MD$$ of $$2n$$ atoms, we see that $$\MD^H\vx$$ will not have quickly decaying magnitudes. It will have $$2\sqrt{n}$$ of the same magnitudes (considering all atoms are unit norm). However, by the theorem above, if we make $$\Upsilon > 2\sqrt{n}$$ we can recover exactly $$\vx$$ using $$\ell_1$$ analysis with a number of measurements on the order of $$\Upsilon \log (2n/\Upsilon)$$. The second that $$\vx$$ displays a difference in the amplitude of any one of its impulses, or equivalently that $$\vx$$ takes on a non-zero mean, this guarantee will vanish since the tail of $$\MD^H\vx$$ will no longer be non-zero. So, in brief, this “little surprise” is indeed “little” but conceptually interesting. Note too that there exist an infinite number of ways to represent our comb signal in this dictionary, the two sparsest ways being $$\sqrt{n}$$ impulses or $$\sqrt{n}$$ sines. The sparsity of the next (infinite number of) sparest solutions to $$\vx = \MD\vs$$ jumps to $$2\sqrt{n}$$. Then after that all other solutions have a sparsity in the range $$[\sqrt{n} + 1 + n, 2n]$$. So there are large differences between the two sparsest solutions, and all others.

Though we have good news above, the bad news is that this approach says nothing about the second question: how can we find some of the efficient solutions in $$\mathcal{S}^*$$? This is the problem of sparse representation, and as a sparse approximologist specializing in the domain of high-dimensional audio signals, this is the question I am most interested in solving.

Instead of $$\ell_1$$ analysis though, we can consider $$\ell_1$$ synthesis, e.g., Basis Pursuit by a convex optimization with $$K$$ constraints:
$$\hat \vs = \arg \min_{\vs’ \in \mathcal{R}^N} || \vs’ ||_1 \; \text{subject to} \; \vx = \MD \vs’$$
and hopefully here, $$\hat \vs \in \mathcal{S}^*$$.
But consider what happens when we multiply our model from the left by a random sensing matrix $$\mathbf{\Phi} \in \mathcal{R}^{M\times K}$$. The problem above becomes equivalently
$$\hat \vs = \arg \min_{\vs’ \in \mathcal{R}^N} || \vs’ ||_1 \; \text{subject to} \; \mathbf{\Phi}\vx = \mathbf{\Phi}\MD \vs’$$
where now we have $$M < K$$ constraints.
This is the same problem as before as long as $$M$$ is large enough (which depends on the expected sparsity of $$\vs$$, but we have reduced the dimensionality of the problem.
Thereby we may more efficiently find a good solution to our model.
This approach is exactly that which is empirically considered in:
M. Christensen, J. Østergaard, and S. H. Jensen, "On compressed sensing and its application to speech and audio signals," in Proc. Asilomar Conf. Signals, Systems, and Computers, (Pacific Grove, CA), Nov. 2009. (This paper was my first Po’D.)