# Elucidations on Fast Bayesian Matching Pursuit

A few Po’Ds ago I discussed the work of Phil Schniter et al: Paper of the Day (Po’D): Fast Bayesian Pursuit Algorithm for Sparse Linear Models Edition. I started fine ended, but I ended in a befuddled mess with a confused discussion of their atom selection criterion simplified by moi involving a singular matrix masquerading as an invertible one — my bad. Phil has very graciously contributed time to help me understand more clearly this work, and with a link to his MATLAB codeHuzzah! for reproducible research.

Phil says,

like any matching pursuit (MP), our Bayesian MP (BMP) is built on the idea of adding one atom per iteration to a set of existing atoms, with the final goal of capturing the sparse-signal support. As you know, the atom selection criterion of standard MP is based on maximizing the correlation between each of the candidate atoms and the residual.
With BMP, each new atom is selected as the one that maximizes the posterior probability of the resulting support set. We conjectured that this atom selection criterion would be more principled than the one of standard MP, and simulations do show improved MSE performance, though unfortunately we don’t have any rigorous proofs of convergence — and by “convergence” I am referring to our algorithm finding the most (a posteriori) probable basis configuration out of the 2^N possible basis configurations. Or yielding a parameter estimate that is within a certain distance from the MMSE estimate. We have some partial results in this direction, but we’re not really satisfied with them yet.

So, going back to the notation of my previous post, the atom selection criterion of BMP at iteration $$n+1$$ is (considering that we only select one atom each step)
$$\max_{\vm \in \mathcal{M}_{n+1} | \vm_n} p(\vm | \vx) \equiv \max_{\vm \in \mathcal{M}_{n+1} | \vm_n} p(\vx | \vm) p^{n+1}(1-p)^{N-n-1}$$
where $$\mathcal{M}_{n+1} | \vm_n$$ is the set of activation vectors with a 1 replacing a 0 in the previous activation vector $$\vm_n$$.
MP $$p(\vm | \vx)$$ equivalent to which new (or old) atom has the largest magnitude correlation with the residual. On the contrary, BMP does not take such a naive position. It assumes a Normal model for $$p(\vx | \vm)$$, and a Bernoulli model for the prior on the activation coefficients. It is this latter model that allows one to tune the pursuit to emphasize sparsity.

Phil adds further elucidation,

At the 1st iteration, we find the binary $$\vm$$ vector with a single non-zero entry that minimizes the metric
$$\ln\det(\MD\MR(\vm)\MD^T) + \vx^T [\MD\MR(\vm)\MD^T + \sigma_r^2 I]^{-1} \vx$$
which is easily done by “testing” the $$N$$ candidate vectors.
At the 2nd iteration, given the $$\vm$$ we just found, we look for the single additional non-zero location to activate that minimizes the above metric. here, there are $$N-1$$ possibilities that we need to check. At the 3rd iteration, given the two-coefficient $$\vm$$ that we just found, we look to add one more non-zero location to $$\vm$$, again to minimize the metric. Clearly, there are $$N-2$$ possibilities here. We continue with this procedure until a stopping rule tells us to terminate. (Stopping rules are discussed in the journal paper, and they require us to look at some of the terms in our “basis selection metric” that you dropped to form your simplified metric…the ones that involve the prior probability “p”.)

Now, to be complete, the tree-search that we describe in the paper goes beyond what I described above. In particular, it repeats the inflation search that I described above several times, being careful to avoid re-discovering past solutions. (For example, at the first iteration of the 2nd search, it looks for the one-coefficient $$\vm$$ that yields the 2nd-smallest value of the metric.) In this way, it builds up a number of different basis configurations and calculates the relative posterior probability for each one.

The cost of introducing these models, however, is at least three-fold (which may not make for interesting origami, but does give us pause):

1. We have seemingly increased the computational complexity (we are no longer computing just inner products)
2. We must estimate the parameters of the model (the Bernoulli parameter, the noise mean and variance, the mean and variances of the coefficient vector)
3. As for all methods of Bayesian inference, we are assuming our posterior and prior models are relevant.

As for the first problem, Phil says:

It’s perhaps not too surprising that the matrix inversion lemma can be employed to speed up the calculations above, and using that and a few other tricks we get the “fast” Bayesian matching pursuit (FBMP) algorithm. [We show] that it is possible to implement this more complicated atom selection criterion with a complexity order of $$M N K$$ [$$M$$ is the length of the signal, $$N$$ is the size of the dictionary, and $$K$$ is the iterations run, or the number of active atoms], which matches that of standard MP. Another feature of FBMP is that it builds up the components of the MMSE parameter estimate \hat{\vs} at little additional cost. I should point out that the algorithm proposed in our original ITA paper was an early version that didn’t have this order(M*N*K) complexity scaling.

As for the second problem, Phil says:

An additional contribution of the journal submission was an automatic way (based on expectation maximization) of choosing the statistical parameters of our data model to best match the data. Clearly, this is important because in practice you never know what statistical assumptions to use. The journal submission also generalized the ON/OFF signal model of the ITA paper to one that allows more than one flavor of “active” coefficient. this allows, for example, signal models where the coefficients live in the set {-1,0,1}.

As for the third problem, there appears not to be any other solutions than checking and refining the models until it works, at which point we don’t fix it, and above all, claim that we have found the true model. (As regards this point, I am reading the excellent article: A. Gelman and C. R. Shalizi, “Philosophy and the practice of Bayesian statistics“, which I learned of from Nuit Blanche.)

Thank you very much for your time Phil!

Advertisements