# Paper of the Day (Po’D): Bayesian Pursuit Algorithm for Sparse Representation Edition

Hello, and welcome to Paper of the Day (Po’D): Bayesian Pursuit Algorithm for Sparse Representation Edition. Today’s paper is: H. Zayyani, M. Babaie-Zadeh, and C. Jutten, “Bayesian pursuit algorithm for sparse representation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., (Taipei, Taiwan), pp. 1549-1552, Apr. 2009. I spoke about this paper on briefly in Paper of the Day (Po’D): The Other Probabilistic Matching Pursuits Edition and Paper of the Day (Po’D): The Other Other Probabilistic Matching Pursuits Edition. (To prepare myself more thoroughly for this task, I have been reading A First Course in Bayesian Statistical Methods.)

As we have seen many times before, the object is to find a sparse solution to the underdetermined problem
$$\vx = \MD\vs + \vr$$
where $$\vx$$ is our signal of interest that we model in terms of a residual $$\vr$$, and some linear combination of only a few columns from the fat but overcomplete dictionary $$\MD$$ of $$N$$ unit norm atoms. So, in one sense, I think we can say we want to find an expression for the posterior probability $$p(\vs | \vx)$$ so that we can find the best set of solutions pointed to by our signal. With Bayes’ rule we know
$$p(\vs | \vx) \propto p(\vx | \vs)p(\vs)$$
from which we see that we need to define a prior on the solution.

The authors employ a simple model where each element of the solution is independently and identically distributed with a probability distribution function
$$p(s_i) = p \delta(s_i) + (1-p) \mathcal{N}(0,\sigma^2)$$
where $$p$$ is the probability of any one of the solution elements being zero (which we call hypothesis $$H_2$$.
A non-zero element of the solution (hypothesis $$H_1$$) is distributed as zero mean normal with some variance.
We can express the model above, and these hypotheses, in terms of inner products between the signal and each of the $$N$$ dictionary atoms:
$$z_n := \langle \vx, \vd_n \rangle = s_n \langle \vd_n, \vd_n \rangle + \mathop{\sum_{i=1}^N}_{i\ne n} s_i \langle \vd_i, \vd_n \rangle + \langle \vr,\vd_n\rangle, \; 1 \le n \le N$$
and assuming that we have already estimated the other $$N-1$$ solution elements $$\{s_i’\}$$, we can say
$$z_n – \mathop{\sum_{i=1}^N}_{i\ne n} s_i’ \langle \vd_i, \vd_n \rangle = s_n + \mathop{\sum_{i=1}^N}_{i\ne n} (s_i – s_i’) \langle \vd_i, \vd_n \rangle + \langle \vr,\vd_n\rangle, \; 1 \le n \le N$$
($$\langle \vd_n, \vd_n \rangle=1$$).
Since we want to estimate $$s_n$$, our hypotheses of whether or not $$s_n$$ is “on” can be translated to the following:
\begin{align} H_1 (\textrm{on}) &: z_n – \mathop{\sum_{i=1}^N}_{i\ne n} s_i’ \langle \vd_i, \vd_n \rangle = s_n + \mathop{\sum_{i=1}^N}_{i\ne n} (s_i – s_i’) \langle \vd_i, \vd_n \rangle + \langle \vr,\vd_n\rangle \\ H_2 (\textrm{off}) &: z_n – \mathop{\sum_{i=1}^N}_{i\ne n} s_i’ \langle \vd_i, \vd_n \rangle = \mathop{\sum_{i=1}^N}_{i\ne n} (s_i – s_i’) \langle \vd_i, \vd_n \rangle + \langle \vr,\vd_n\rangle \end{align}
since $$s_n$$ will be zero when it is off.

Now, the authors propose to model each solution error $$s_i – s_i’$$ as Gaussian, and with each element of $$\vr$$ being Guassian and zero mean, we can assume
$$\mathop{\sum_{i=1}^N}_{i\ne n} (s_i – s_i’) \langle \vd_i, \vd_n \rangle + \langle \vr,\vd_n\rangle$$ will be distributed Gaussian as well, with some variance $$\sigma_j$$.
With this, and that we model each active solution element by a Gaussian with variance $$\sigma$$, we can write the distribution of $$z_n$$ conditioned on each of the hypotheses $$p(z_n|H_1), p(z_n|H_2)$$.
Given that we know the priors on the hypotheses, i.e., $$p(H_1) = p, p(H_2) = 1-p$$,
we can say that the $$n$$th solution element is active when $$p/(1-p) \frac{\sigma_n}{\sigma} \sqrt{2(\sigma^2+\sigma_n^2)\ln \left [ \frac{p}{1-p} \frac{\sqrt{\sigma^2+\sigma_n^2}}{\sigma_j}\right ] }. When I look at this I think “Oh, criminy!” But after a visit to the Wittenborg 7100, I look again and I see that the left hand side is essentially calculating how the signal is correlated with the \(n$$th dictionary atom, and how that atom is correlated with all others in the dictionary weighted by their contributions to the signal model. And the right hand side is comparing the uncertainty we have in the value of the $$n$$th solution element and the contribution of the $$n$$th atom to explaining the residual.
(I think I spy a problem however. That log term goes negative when $$p 0.5$$, however, and that the residual is not very well-correlated to the dictionary, as $$\sigma_n$$ becomes larger and larger, we will be inclined to set nearly all of the solution elements to zero. The opposite is true if $$\sigma_n \to 0$$ — in which case we already know the correct solution!
This is all well and good, but the problem remains of determining the parameters of the model: $$\{p, \sigma, \sigma_n\}$$… which I will take up tomorrow because the summer sun outside is making a lot of noise.