Hello, and welcome to Paper of the Day (Po’D): Fast Bayesian Pursuit Algorithm for Sparse Linear Models Edition. Today’s paper has been sitting on my stack for a while: P. Schniter, L. C. Potter, and J. Ziniel, “Fast Bayesian Matching Pursuit,” Proc. Workshop on Information Theory and Applications (ITA), (La Jolla, CA), Jan. 2008. (An expanded version is here: P. Schniter and L. C. Potter and J. Ziniel, “Fast Bayesian matching pursuit: model uncertainty and parameter estimation for sparse linear models.”) We reviewed another Bayesian approach here and here.

The authors of today’s Po’D present a method for finding a set of probable sparse solutions and their posterior probabilities for the signal model:

$$\vx = \MD\vs + \vr$$

where \(\vx, \vr \in \mathcal{R}^K\), \(\vs \in \mathcal{R}^N\), \(\MD\) is given, and we have \(N \gg K\).

The residual/noise is simply modeled as zero-mean white Gaussian noise with variance \(\sigma^2_\vr\).

The authors model sparsity in the solution using a random mixture vector \(\vm := [m_1, m_2, \ldots, m_N]^T\) where each \(m_i\) is iid Bernoulli with \(P\{m_i = 1\} = p\). Thus, the sparse solution conditioned on the mixture vector is distributed as

$$\vs | \vm \sim \mathcal{N}(\mathbf{0}, \MR(\vm))$$

where the covariance matrix is diagonal, and \([\MR(\vm)]_{n,n} = \sigma^2_{m_n}\). What we mean by “sparse solutions” we can express by defining \(p \ll 1\), and making \(\sigma^2_0 = 0\). (\(\sigma^2_1\) can be distributed as some positive random variable as well.)

With these, we can define the distribution of the signal conditional on \(\vm\):

$$p(\vx | \vm) = p((\MD\vs + \vr) | \vm) = \frac{1}{(2\pi)^{K/2} |\Phi(\vm)|^{1/2}} \exp\left [-\frac{1}{2} \vx^T \Phi(\vm)^{-1} \vx \right ]$$

where \(\Phi(\vm) := \MD\MR(\vm)\MD^T + \sigma^2_\vr \MI_K\).

Given the model and assumptions above, this expression embodies the relationships between the signal we observe and the set of all possible mixture vectors

\(\mathcal{M} := \{ \vm : [\vm]_n \in \{0,1\}, n = 1, 2, \ldots, N\}\).

Since we observe \(\vx\), and we want to find solutions, or equivalently mixture vectors, we can do so by first switching this expression around using Bayes’ theorem

$$p(\vm | \vx) \propto p(\vx | \vm) p(\vm) = p(\vx | \vm) p^{||\vm||_0}(1-p)^{N-||\vm||_0}$$

where I have substituted the prior on the mixture vector.

Substituting the expression above for \(p(\vx | \vm)\) and taking the log we find one of my favorite expressions:

$$\ln p(\vm | \vx) \propto \nu(\vm, \vx, p)$$

where the authors define the “basis selection metric”

$$\nu(\vm, \vx, p) := – \frac{K}{2}\ln 2\pi – \frac{1}{2} \ln |\Phi(\vm)| – \frac{1}{2} \vx^T \Phi(\vm)^{-1} \vx + ||\vm||_0 \ln \frac{p}{1-p} + N\ln(1-p).$$

To find the best mixture vectors (with respect to sparsity and our assumptions) we want to maximize \(\nu(\vm, \vx, p)\) given our signal and prior over the entire solution space \(\mathcal{M}\) … which is of size \(2^N\) … which means we might as well tackle easier NP-complete problems.

Instead, the authors approach this problem by designing an efficient means of finding \(\mathcal{M}_\star\), which is the set of mixture vectors that dominate \(p(\vm | \vx)\), in an iterative fashion.

(The authors provide the algorithm using all the subscripts, subsubscripts, superscripts, supersubscripts, one could hope for. So instead, here I rehash their simple description of the search procedure.)

In their process, they approximate \(\mathcal{M}_\star\) iteratively by adding more and more active elements to the mixture vectors based on their behavior with the basis selection metric \(\nu(\vm, \vx, p)\) … and thus the similarity to the matching pursuit.

So first we look at the set of \(N\) 1-sparse mixture vectors \(\mathcal{M}_1 := \{\vm \in \mathcal{M} : ||\vm||_0 = 1\}\), compute the set \(\{\nu(\vm, \vx, p) : \vm \in \mathcal{M}_1\}\), and find those \(M\) vectors having the largest values to create the set \(\mathcal{M}_{1,\star}\).

Then we perform the same operation, adding a single one to the mixtures vectors in \(\mathcal{M}_{1,\star}\), to create the set \(\mathcal{M}_{2,\star}\), or those 2-sparse mixture vectors with the \(M\) largest values of \(\nu(\vm, \vx, p)\).

We continue in this way, until some iteration \(MAX\), to create the set of \(M*MAX \ll 2^N \) best mixture vectors $$\{ \mathcal{M}_{i,\star} : i = 1, 2, \ldots, MAX\}.$$

Somewhere in there could be \(\mathcal{M}_\star\), but I am not so sure it is guaranteed.

From any one of these, we can compute a solution by

$$\hat \vs := \MR(\vm)\MD^T [\MD\MR(\vm)\MD^T + \sigma^2_\vr \MI_K]^{-1}\vx.$$

That algorithm seems simple enough, and entirely implementable with the fast algorithm designed by the authors.

But how to interpret this criterion?

If we forget the noise term, the atom selection mechanism in this approach essentially finds at each iteration given \(\vx\) the \(M\) smallest values of

$$ \ln\det(\MD\MR(\vm)\MD^T) + \vx^T [\MD\MR(\vm)\MD^T]^{-1} \vx.$$

At the \(n\)th iteration we are looking at \(n\)-sparse signals, and so \(\MR(\vm)\) will only have \(n\) non-zero elements on the diagonal.

Thus, \(\MD\MR(\vm)\MD^T\) will be mostly zeros, and where it is not it will be the outer products of the \(n\) atoms being considered.

The first term acts (I think) like a scaling term to correct for those of the \(n\) atoms that are coherent with the rest of the dictionary;

and the second term looks (I think) at the projection of the signal onto its projection onto some space … more thought needed.

Anyhow, I see this approach as a “batch” version of MP, where at the \(n\)th iteration \(M\) candidate \(n\)-sparse solutions are selected instead of the one maximizing some selection criterion. And from those, we continue building, almost like a tree approach. (I can’t seem to remember if this batch procedure been done before…)