Hello, and welcome to part 2 of the Po’D from yesterday.

We left off yesterday with the solution element activity decision rule, which brings us to state the pursuit in full.

- Given the \(j\)th-order estimation of the “activity vector” \(\vq(j)\) (where \(q_n = 1\) if \(s_n \ne 0\) and zero otherwise), update it to \(\vq(j+1)\) by the criterion $$\left | \langle \vx, \vd_n \rangle – \mathop{\sum_{i=1}^N}_{i\ne n} s_i’ \langle \vd_i, \vd_n \rangle \right | \mathop{ \mathop{>}_{<} }^{q_n = 1 }_{q_n = 0 } \frac{\sigma_n}{\sigma} \sqrt{2(\sigma^2+\sigma_n^2)\ln \left [ \frac{p}{1-p} \frac{\sqrt{\sigma^2+\sigma_n^2}}{\sigma_n}\right ] }$$
- Compute the \((j+1)\)th-order solution by linear least squares $$\vs(j+1) = \sigma^2 \textrm{diag}\{\vq(j+1)\} \MD^T\left [ \sigma^2 \MD\textrm{diag}\{\vq(j+1)\}\MD^T + \sigma_\vr^2\MI \right ]^{-1}\vx$$
- repeat.

This pursuit is arrived at the authors with the following assumptions:

- each element of the residual \(\vr\) is iid zero mean Gaussian with variance \(\sigma_\vr^2\);
- each element of the solution \(\vs\) has an independent probability \(p\) of being off (inactive); and if on, each element is iid as zero mean Gaussian with variance \(\sigma^2\);
- the error in our estimation of each solution coefficient is iid zero mean Gaussian, and thus the sum of these weighted errors (without the \(n\)th atom) and the inner product with the \(n\)th atom in the dictionary is iid zero mean Gaussian with variance \(\sigma_n^2\).

Thus, the problem remains of estimating the model parameters \(\{p, \sigma^2, \sigma_n^2, \sigma_\vr^2\}\).

With the assumptions made above, the authors propose to estimate the model parameters iteratively between repeating the pursuit steps above.

Clearly, a sane way to estimate \(p\) is from the average number of non-zero elements in \(\vq(j)\) (the authors made a mistake in the equation here): $$1 – \hat p = \mathbf{1}^T\vq(j)/N.$$

The authors propose to estimate the variance of the solution coefficients by $$\hat \sigma^2 = ||\vs(j)||_2^2/N$$ (although I don’t understand why it is not divided by N-1 … maybe we are assuming N is very large).

As for the residual variance, we can similarly estimate it with $$\hat \sigma_\vr^2 = ||\vx – \MD\vs(j)||_2^2/K$$ where \(K\) is the length of our signal \(\vx\). (Here I think the denominator should be \(K-1\) since \(K \ll N\).)

Finally, for the atom-dependent variance \(\sigma_n^2\), the authors propose to estimate it by

$$\sigma_n^2 = \hat \sigma_\vr^2 + \mathop{\sum_{i=1}^N}_{i\ne n} \langle \vd_i, \vd_n \rangle \alpha^{j}\sigma_{n,e}^2$$

where \(\sigma_{n,e}^2\) is the variance of a signal solution element estimate \(s_n’ – s_n\), and \(0 \le \alpha \le 1\) is some memory term controlling the convergence of this estimate.

The authors implement this pursuit in MATLAB, and compare its performance in recovering sparse solutions with other pursuit algorithms, such as the matching pursuits, gradient pursuit, and convex optimization with \(\ell\)_1 norm. In their experiments, they set \(N=512\), \(K=256\), \(\sigma^2 = 1\), and \(p=0.9\).

Their dictionary is randomly generated, and they synthesize several realizations of the input signal with different noise variances.

They run their algorithm for 20 iterations.

For the results of all the algorithms, the authors compare the solution SNR (\(||\vs(j)||_2^2/||\vs – \vs(j)||_2^2\)) as a function of the input SNR (\(\sigma^2/\sigma_\vr^2\)), and find that their pursuit maximizes solution SNR for nearly all input SNR by at most 4 dB.

They also find that their pursuit takes slightly more computation time than Expectation Maximization, and significantly more time than all other algorithms tested.

Now, having gone through all of that: Is this approach usable for finding sparse representations of audio signals in terms of overcomplete dictionaries of time-frequency atoms?

I do not think so.

First of all, I am extremely wary of the assumptions made.

I do not think that in this case we can make any of the assumptions above because audio signals have a lot of time-frequency structure, and thus many of the elements of the solution vector will not be independent (e.g., harmonics).

Also, the solution elements will not be identically distributed at all (active with probability \(1-p\)) because for many signal classes some frequencies and time-structures are much more favored than others.

Thus, this renders irrelevant the simple solution element activity decision rule of this paper because we cannot so easily generate posterior distribution for the hypotheses that a given element is active.

Finally, with such high coherent dictionaries as a union of multiple Gabor frames, the inner products between the atoms will greatly affect our uncertainty in the any solution element estimation.

On top of all of this,

I can hardly wait for an MP decomposition to finish, let alone this algorithm of much greater complexity.

However, I wonder if one could use MP to initialize parameters, and then refine the model with a Bayesian inference method?

Still, I shudder to think of the thought of the computer housekeeping required for such a task with dictionaries having millions of atoms, and Gramians that are millions by millions…