Having last week successfully understood MOD, and gained proper respect for the importance of initialization and dealing with evaporating atoms, I ran a set of experiments over the weekend testing MOD with a more difficult problem size. Before, I was having MOD run for 1000 iterations to learn dictionaries of cardinality 5 of length-3 atoms from 10 measurements of 1-sparse signals with OMP as the coding step — which makes it MP since it only selects the one best atom. We often see after some iterations, the coding error of the dictionary produced by MOD reaches numerical precision. In my new experiments, I am having MOD run for 1000 iterations learning dictionaries of cardinality 128 of length-64 atoms (each sampled from the uniform spherical ensemble) from 256 measurements of 8-sparse signals (non-zero elements distributed Normally), again with OMP as the coding step. (I make sure the collection of 256 sparse signals spans \(\mathcal{R}^{128}\).)

But here, I test different stopping conditions for OMP. First, I have OMP select only 8 atoms for coding a measurements. Second, I have OMP select 16 atoms, and from this I select the 8 with the largest magnitude weights, and then orthogonally project onto the span of these.

Finally, I have OMP select 24 atoms, from which I select 8 and project in the same.

I perform each test for 100 realizations. Below are the results for each realization.

On the left plot, we show the mean coding error as a function of MOD iteration for each set of 100 realizations. It is clear some learning is going on in all three cases. On the right, we plot the distribution (bars) of the final coding error at iteration 1000, and its cumulative distribution (lines).

It appears that we aid MOD by increasing the number of atoms selected by OMP, and then choose from among them. There are clearly three significantly different distributions.

However, we do not see MOD reaches anything close to numerical precision.

Now, I wonder, what happens when the sparse vectors are distributed Bernoulli-Rademacher? Are there not any atoms produced by MOD that match any in the dictionary? What happens with a highly correlated, e.g., shift-invariant, dictionary?

And how does K-SVD perform? And how does the Olhausen and Field approach perform?

All of these questions are motivated by the interesting work in B. Mailhé and M. D. Plumbley, “Dictionary Learning with Large Step Gradient Descent for Sparse Representations“, Proc 10th International Conference on Latent Variable Analysis and Source Separation (LVA/ICA 2012), Tel-Aviv, Israel, LNCS 7191, pp. 231-238, March 12-15, 2012.