# Dictionary learning with the Method of Optimal Directions

In K. Engan, S. O. Aase, and J. H. Husøy, “Method of optimal directions for frame design”, Proc. ICASSP, Vol. 5, pp. 2443 – 2446, 1999, the authors propose what is essentially a dictionary learning algorithm inspired by the generalized Lloyd algorithm.
Given a set of $$L$$ measurements $$\mathcal{U} := \{\vu_i \in \mathcal{R}^m\}$$,
expressed as the columns in the matrix $$\MU$$,
we want to learn a set of vectors in $$\mathcal{R}^m$$ that minimize some distortion.
So the generalized Lloyd algorithm proposes the following steps.

1. Define the number of atoms $$N$$ desired, and the order $$s$$ to use for each approximation of the elements of $$\mathcal{U}$$, and $$i \leftarrow 0$$.
2. Create an initial dictionary of $$N$$ atoms, expressed as a matrix $$\MPhi^{(0)}$$.
3. With $$\MPhi^{(i)}$$ and any algorithm of your choosing, find an $$s$$-order approximation of each element in $$\mathcal{U}$$ such that $$\tilde\vu_i := \MPhi^{(i)}\vx_i$$ where $$\vx_i \in \mathcal{R}^N$$ contains the $$s$$ non-zero approximation weights and zeros in all other elements; define the set of residuals $$\mathcal{E} := \{\vr_i := \vu_i – \tilde\vu_i \}$$
4. Adjust the dictionary $$\MPhi^{(i+1)} \leftarrow f(\MU,\MPhi^{(i)},\MX)$$
5. $$i \leftarrow i+1$$
6. Go to step 3.

While step 4 can be done in many ways,
the method of optimal directions (MOD) prescribes a specific approach.

MOD wants, as does many sane algorithms, to make the mean of the set of norm residuals $$\{\|\ve_i\|_2\}$$ as small as possible.
So it takes the residuals and finds an adjustment matrix $$\MPsi \in \mathcal{R}^{m \times N}$$ solving the problem
$$\begin{multline} \min_\MPsi \sum_j \|\vr_j – \MPsi\vx_j\|_2^2 = \min_\MPsi \sum_j \|\vu_j – (\MPhi^{(i)} + \MPsi)\vx_i\|_2^2 \\ = \min_\MPsi \|\MU – (\MPhi^{(i)} + \MPsi)\MX\|_\textrm{FRO}^2 \\ = \min_\MPsi \textrm{tr}\{[\MU – (\MPhi^{(i)} + \MPsi)\MX][\MU – (\MPhi^{(i)} + \MPsi)\MX]^T\} \\ = \min_\MPsi \textrm{tr} [(\MPhi^{(i)} + \MPsi)\MX \MX^T(\MPhi^{(i)} + \MPsi)^T ] – 2 \textrm{tr} [(\MPhi^{(i)} + \MPsi)\MX \MU^T] \\ = \min_\MPsi \textrm{tr} [ \MPsi\MX \MX^T\MPsi^T] + 2\textrm{tr} [ \MPsi (\MX \MX^T\MPhi^{(i)} – \MX\MU^T)] \end{multline}$$
where $$\MX$$ is the matrix of weights found by the approximation,
and the big FRO denotes the Frobenius norm.
Taking the derivative with respect to $$\MPsi$$ and setting to zero yields
$$\MPsi\MX\MX^T = \MU\MX^T – \MPhi^{(i)}\MX\MX^T$$
from which we see
$$\MPhi^{(i)} + \MPsi = \MU\MX^T(\MX\MX^T)^{-1}$$
as long as the outer products of the coefficients $$\MX\MX^T$$ are non-singular.

With this, then, MOD prescribes in the dictionary adjustment step
$$f(\MU,\MPhi^{(i)},\MX) := \MU\MX^T(\MX\MX^T)^{-1}$$
which is extremely strange to me since it only involves the original signals and the sparse coefficients — not to mention the discomfort that comes with knowing $$\MX$$ is sparse.

But now that I have had lunch and time to digest it, this update makes complete sense to me. This as nothing more than a weighted average of the measurements, where the $$n$$th column of $$\MX^T(\MX\MX^T)^{-1}$$ are the weights forming the $$n$$th dictionary atom from the associated measurements.
And if it happens that $$\MX\MX^T$$ is non-singular, then the rows of $$\MX$$ are linearly dependent.
By the SVD of $$\MX = \MG\Sigma\MV^T$$, we have
$$(\MPhi^{(i)}+ \MPsi)\MG\Sigma\Sigma^T\MG^T = \MU\MV\Sigma^T\MG$$
and thus no matter the rank of $$\MX$$, the dictionary update of MOD is
$$\MPhi^{(i)}+ \MPsi = \MU\MV\Sigma\MG^T(\MG\Sigma\Sigma^T\MG^T)^{-1} = \MU\MV\Sigma(\Sigma\Sigma^T)^{-1} \MG^T = \MU\MV\Sigma^\dagger \MG^T$$
where $$\Sigma^\dagger$$ is just a $$L\times N$$ diagonal matrix composed of the recipricol of the non-zero elements in $$\Sigma$$, i.e., $$[\Sigma^\dagger]_{nn} = 1/[\Sigma]_{nn}$$ if $$[\Sigma]_{nn} > 0$$, and zero else, for $$n = 1, \ldots, N$$.

So, going for distance now, given a set of $$L$$ measurements $$\mathcal{U}$$, the MOD for dictionary learning proposes the following steps to learn a dictionary of
at most $$N \le L$$ atoms:

1. Define the maximum order $$s \le m$$ to use for each approximation of the measurements. Set $$i \leftarrow 0$$.
2. Create an initial dictionary of $$N$$ atoms, expressed as a matrix $$\MPhi^{(0)}$$.
3. With $$\MPhi^{(i)}$$ and any coding algorithm of your choosing, find an $$s$$-order approximation of each element in $$\mathcal{U}$$ such that $$\tilde\vu_i := \MPhi^{(i)}\vx_i$$ where $$\vx_i \in \mathcal{R}^N$$ contains the $$s$$ or less non-zero approximation weights and zeros in all other elements.
4. Compute SVD $$\MX = \MG\Sigma \MV^T$$
5. Adjust the dictionary $$\MPhi^{(i+1)} \leftarrow \MU\MV\Sigma^\dagger \MG^T$$
6. $$i \leftarrow i+1$$
7. Go to step 3.

The quality of results, of course, depend heavily upon the coding algorithm used in step 3, as well as the maximum prescribed sparsity $$s$$.

## 3 thoughts on “Dictionary learning with the Method of Optimal Directions”

1. Andru Putra says:

Brilliantly explained. Thanks for making it comprehensible for dummies, like me. I bookmarked your “blog” for further reading. Thanks Bob!

Like

2. Alex Nelson says:

In other words, the dictionary update is just Phi = U X+, where X+ is the pseudoinverse of X. Correct?

Like