# Paper of the Day (Po’D): The Other Probabilistic Matching Pursuits Edition

After writing about probabilistic orthogonal matching pursuit (PrOMP) a few days ago, I thought it might be a good idea to look at a previous approach that actually uses MP: S. E. Ferrando, E. J. Doolittle, A. J. Bernal, and L. J. Bernal, “Probabilistic matching pursuit with Gabor dictionaries”, Signal Processing, vol. 80, no. 10, pp. 2099-2120, Oct. 2000. Boy, was I wrong.

For me, this paper is extremely difficult to understand, even though section 0 clearly lays out their notation. By the end, subscripts and superscripts are flying everywhere that I feel like I am trying to keep my eyes open while looking up into falling rain, or like I am trying to breath while my head is outside a fast moving car. Both are dangerous, and not just because of particulate matter. Anyhow, I will now attempt to describe the proposed algorithm (PrMP), which the authors very honestly state, generate “poorer estimates” than MP, and “is poor as a compression tool.”

The main application of this work appears to be similar to that given in the paper: P. J. Durka, D. Ircha, and K. J. Blinowska, “Stochastic time-frequency dictionaries for matching pursuit,” IEEE Trans. Signal Process., vol. 49, pp. 507-510, Mar. 2001. However, here Ferrando et al. are interested in removing the noise from a signal by way of teasing out the coherent structures from the noise with respect to the dictionary, and there Durka et al. are interested in removing the “noise” (artifacts) produced by the MP decomposition process to reveal the real time-frequency components underlying a signal. In the former case we end up with an average of decompositions providing an estimate of the original signal, and in the latter case we end up with an average time-frequency energy distribution of the signal without the artifacts of MP. And as far as I can tell, these two approaches are nearly the same.

Consider a signal model $$\vy = \vx + \sigma \vn$$ where we are interested in recovering $$\vx$$ from $$\vy$$, and $$\vn$$ is some unknown noise signal with unit power. To do this, we will assume that we have an overcomplete time-frequency dictionary $$\mathcal{D}$$ with (unit norm) atoms that are more highly correlated with $$\vx$$ than $$\vn$$ — which is not a bad assumption to make for a time-frequency dictionary, and when $$\sigma \ge 0$$ is small with respect to $$||\vx||$$. This means, e.g., that
$$\max_{\vd \in \mathcal{D}} |\langle \vx, \vd\rangle| \gg \max_{\vd \in \mathcal{D}} \sigma |\langle \vn, \vd\rangle|.$$
So we want to model our signal to some order $$n$$ as
$$\vx \approx \hat \vx(n) := \sum_{k=0}^{n-1} \alpha_k \vh_k$$
where the set $$\mathcal{H}_n := \{\vh_0, \ldots, \vh_{n-1}\}$$ is a rearranged subset of $$\mathcal{D}$$. Note that with this time-frequency dictionary, each $$\vd \in \mathcal{D}$$ is associated with three variables: scale $$s$$, translation $$u$$, and modulation frequency $$\omega$$.

We can iteratively build $$\mathcal{H}_n$$ using the deterministic law of MP, which is based only on the magnitude projections of the elements of $$\mathcal{D}$$ with an intermediate residual signal. Or we can define a conditional measure of probability over the dictionary, and select an atom at random from this distribution — which is what the authors do here (I think). They define this measure in the following way. Given we are at iteration $$k$$:
$$p_k(\vd | \mathcal{H}_{k}, \vy) \propto l_k(\vd | \mathcal{H}_{k}, \vy) \pi(\vd | \mathcal{H}_{k})$$
where
$$\pi(\vd | \mathcal{H}_{k}) = \pi(\vd) = \pi_s(\vd)\pi_u(\vd | s)\pi_\omega(\vd | s)$$
is defined as a product of a probability measure over atom scale, and two other probability measures over translation and modulation frequency given the scale. Each of these distributions is defined uniformly over a continuous region (scale), or over discrete sets (translation because we are dealing with sampled signals, and frequency because have defined a limited number of values). Finally, we define
$$l_k(\vd | \mathcal{H}_{k}, \vy) := \frac{|\langle \vy – \hat \vx(k), \vd \rangle|}{|| \vy – \hat \vx(k) ||} \chi(\vd; \tau)$$
where $$\chi(\vd; \tau)$$ is the “characteristic function” of the set of atoms in the dictionary that have a magnitude inner product with the residual signal that is bigger than that expected with the noise signal, which I believe is $$\tau$$. When there are no such atoms, we may stop the pursuit because then we are overmodeling the signal.

Thus, with a properly normalized $$p_k(\vd | \mathcal{H}_{k}, \vy)$$ we select atoms at random from the dictionary, up to some order where we begin to model the noise instead. We perform this procedure multiple times, generating the set $$\{\mathcal{H}_{k}, \mathcal{H}_{k_2}, \ldots, \mathcal{H}_{k_K} \}$$, and finally, we estimate our our signal by
$$\vx \approx \lim_{K\to \infty}\frac{1}{K} \sum_{k=1}^{K}\hat \vx(\mathcal{H}_k)$$
where $$\hat \vx(\mathcal{H}_k)$$ is the model produced by the atoms in $$\mathcal{H}_k$$.

I think.

This approach appears to only use the residual signal in assigning probabilities to elements in the dictionary, i.e., $$l_k(\vd | \mathcal{H}_{k}, \vy) \equiv l_k(\vd | \vy – \hat \vx(k))$$. Furthermore, $$\pi(\vd)$$ is uniformly distributed over the entire dictionary. A “more informed” approach should take into account how atoms are distributed in frequency and scale in the signal. For instance, for a music signal (say, everyone’s favorite solo glockenspiel), we expect there to be harmonically related large-scale phenomenon with significant energy, and some short-scale energetic excitations. Maybe not much stuff in-between. (At least this is how I see the most efficient and meaningful representation of a solo glockenspiel.) So it does not make sense after some initial decomposition iterations (probing the signal of its most coherent material) to continue using a prior that considers all atoms to be equally likely.

What I want to see is an atom selection rule (within the context of iterative descent methods of signal decomposition using overcomplete dictionaries) that looks like:
$$\vh_k = \arg \max_{\vd \in \mathcal{D}} P(\vd | \mathcal{H}_k, \vy )$$
where we select an atom based on the atoms already selected, and the original signal. In other words, the $$(k+1)$$th atom will be the most fit to “explain” the $$k$$ atoms we have already selected, and our original observation. This of course involves some idea of the structures we expect for a given class of signals, such as harmonics, long-held tones, and some idea of model order.

Perhaos toward this end, I look forward to reviewing the following papers:

• T. Blumensath and M. E. Davies, “Monte Carlo methods for adaptive sparse approximations of time-series,” IEEE Trans. Signal Process., vol. 55, pp. 4474-4486, Sep. 2007.
• H. Zayyani, M. Babaie-Zadeh, and C. Jutten, “Bayesian pursuit algorithm for sparse representation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., (Taipei, Taiwan), pp. 1549-1552, Apr. 2009.
• M. Elad and I. Yavneh, “A plurality of sparse representations is better than the sparsest one alone,” IEEE Trans. Info. Theory, vol. 55, pp. 4701-4714, Oct. 2009. (This article makes no reference to today’s Po’D by Ferrando et al. even though it also deals with signal denoising by greedy pursuits.)

I welcome any other suggestions.