# Paper of the Day (Po’D): Cyclic Matching Pursuit Edition

Hello, and welcome to Paper of the Day (Po’D): Cyclic Matching Pursuit Edition. Today’s paper is: M. G. Christensen and S. H. Jensen, “The cyclic matching pursuit and its application to audio modeling and coding,” in Proc. Asilomar Conf. Signals, Syst., Comput., (Pacific Grove, CA), Nov. 2007.
Note that cyclic matching pursuit (CMP) is not related to complementary matching pursuit (CMP); but in this post I will refer to the former by CMP, and to the latter as TOCMP (the other CMP).
Good short acronyms are hard to come by.

Sparse approximation by matching pursuit (MP) is essentially analysis by iterative granular synthesis:
each atom is selected and subtracted from an intermediate residual,
and this residual is used to select the next atom, ad infinitum,
or at least until Eyjafjallajokull is plugged (which might make “sparse” meaningless).
In its atom selection, MP can make some very poor decisions,
which leads to diminished efficiency and meaningfulness in the signal model.
What I mean by diminished efficiency is that the signal model includes more waveforms than it could;
and what I mean by diminished meaningfulness is that the signal model
includes atom that do not accurately portray the characteristics of the signal,
e.g., energy at a time-frequency location where there is none.
Orthogonal MP (OMP), and optimized OMP (OOMP, also known as
“orthogonal least squares,” and “MP with prefitting”),
address the problem of diminished efficiency by
adjusting the weights of every atom so far selected at each step.
(OOMP takes this into consideration when selecting the next atom,
while OMP does not.)
With this change to the model weights,
OMP and OOMP are guaranteed to outperform MP
with respect to the signal-to-residual-energy ratio
in all cases but the most trivial.

While this change increases the efficiency of a signal model,
it does not guarantee any increase in its meaningfulness.
Once a poor atom is included in the model by ([O]O)MP, there it remains, a saboteur
camouflaged by the hustle and bustle of the other atoms pushing and pulling
in competition to best fit some waveform —
the shape of which might not even matter
as much as the statistics of the random process that underlies the signal,
or at least the content we are interested in.
All allegorical writing aside, what we want is a means to modify
not only the model weights of a sparse approximation,
but also the parameters of the model elements, for instance,
modulation frequency and phase, atom translation and scale and shape, and so on.

CMP proposes a way to do this based on an iterative estimation framework;
but it only provides a mechanism to adjust atom modulation frequency and phase
for MP using a monoresolution dictionary of Fourier atoms with no translation parameters.
Though this seems restrictive, CMP is proposed within the context of frame-wise audio coding
using sinusoidal models.
Thus, the problem it solves is in the selection of the best frequencies and phases
with respect to a perceptual distortion measure using MP
for a frame of audio data.
Its simplicity is very nice, and ever since I first read the paper
I have been wondering how to extend it to a multiresolution framework.
So, let’s have a look at CMP.

CMP provides a simple means to reassess
the parameters of sinusoidal atoms (frequency, phase, weight) in a signal model built by MP
with a contiuous Fourier dictionary.
Unlike ([O]O)MP, it is capable of correcting poorly selected atoms,
i.e., those model elements that diminish efficiency and meaningfulness.
In addition to this, CMP maintains the simple structure of MP,
even while using a dictionary of infinite size.
The periodic review and adjustment of atom parameters
can optimiz a signal model built by a greedy pursuit
with respect to some distortion measure,
and thus increase model efficiency and meaningfulness, we hope.

For a given $$K$$-length frame of audio data, $$\vx \in \mathcal{C}^K$$ (here we assume it is in analytic form),
we want to find an $$L$$-order sinusoidal model ($$L \ll K$$)
such that some distortion measure is minimized.
We can thus consider selecting elements from the infinite dictionary of
length-$$K$$ complex sinusoids
$$\mathcal{Z} := \left \{ \vz \in \mathcal{C}^K : \vz = e^{j \omega\vk}, [\vk]_k = k-1, k \in \{1, 2, \ldots, K \} \right \}_{\omega \in [-\pi, \pi)}$$
and combine $$L$$ of these with weights $$\va(L) \in \mathcal{C}^L$$.
Thus, our $$L$$-th order approximation of $$\vx$$ is
$$\hat{\vx(L)} = \left [ e^{j\omega_0 \vk} | e^{j\omega_1 \vk} | \cdots | e^{j\omega_{L-1} \vk} \right ] \left [ \begin{array}{@{}c@{}} a_0 \\ a_1 \\ \vdots \\ a_{L-1} \end{array} \right ] = \MZ(L)\va(L)$$
and we define some distortion measure
$$J[\vr(L)] := || \MW [ \vx – \hat{\vx(L)} ] ||_2 ^2 = || \MW [\vx -\MZ(L)\va(L)]||_2 ^2.$$
We will make $$\MW$$ to be identity for now.

Building such a model from the infinite dictionary $$\mathcal{Z}$$
requires estimating the set of parameters: $$\{ (\omega_l, a_l) \}_{l=0}^{L-1}$$.
CMP proposed to use the following iterative procedure
to estimate these $$L$$ (complex) amplitudes and frequencies;
subsequent steps will then cycle through these
estimates to refine them to further decrease the distortion.
Let $$\vr(0) = \vx$$, and consider all possible distortions
of the $$(l+1)$$th residual ($$l \ge 0$$):
$$J[\vr(l+1)] = \left | \left | \vr(l) – a e^{j\omega\vk} \right | \right |_ 2 ^2.$$
Expanding this and taking the partial derivative
with respect to $$a^*$$, we find that the best $$a$$ is
$$a_o = \frac{ \langle \vr(l), e^{j\omega \vk}\rangle}{||e^{j\omega \vk}||_2^2}.$$
Substituting this into the distortion equation gives
$$J[\vr(l+1)] = ||\vr(l)||_2^2 – \frac{ \left | \langle \vr(l), e^{j\omega \vk}\rangle \right |^2}{||e^{j\omega \vk}||_2^2}$$
which is minimized by maximizing the right-hand term.
Thus, the best frequency $$\omega \in [-\pi, \pi)$$ is that given by
$$\hat{\omega}_l = \arg \max_\omega \frac{|\langle \vr(l), e^{j\omega \vk}\rangle|}{||e^{j\omega \vk}||_2}.$$
This result is no surprise:
the numerator is essentially the
discrete-time FT (DTFT)
of $$\vr(l)$$,
and the denominator is a scaling
that takes into account the weighting by $$\MW$$
onto $$e^{j\omega \vk}$$,
which here is $$\sqrt{K}$$ considering $$\MW = \MI$$.
This frequency estimate can be substituted into the equation
for the optimal weight to find the best $$\hat a_l(\hat\omega_l)$$.
With these estimated parameters,
we define the new residual
$$\vr(l+1) := \vr(l) – \hat{a}_l(\hat\omega_l) e^{j\hat\omega_l\vk}$$,
and repeat the procedure above $$L-1$$ times
to provide the set of estimates $$\{(\hat \omega_l, \hat{a}_l) \}_{l=1}^L$$.
This is essentially MP using the infinite dictionary $$\mathcal{Z}$$;
and being MP, it is not likely that this set of estimates will
give the best signal model with respect to the distortion measure.
Thus, starting from this set of $$L$$ initial estimates,
CMP refines each amplitude and frequency to decrease the distortion further.

So define $$\theta_l := \{\omega_l, a_l\}$$ to be the
parameters associated with the $$l$$th complex sinusoid
in the signal model;
and let $$\{\hat{\theta}_l^{(i)}\}_{l=0}^{L-1}$$
be the set of $$L$$ parameter estimates
at the $$i$$th iteration of CMP.
With these estimates,
we can iteratively refine each one
by minimizing the following $$L$$ distortion measures:

1. $$\hat{\theta}_0^{(i+1)} = \arg \min_{\theta_0} J(\theta_0, \{\hat{\theta}_l^{(i)}\}_{l=1}^{L-1})$$
2. $$\hat{\theta}_1^{(i+1)} = \arg \min_{\theta_1} J(\hat{\theta}_0^{(i+1)}, \theta_1, \{\hat{\theta}_l^{(i)}\}_{l=2}^{L-1})$$
3. dot dot dot
4. $$\hat{\theta}_{L-1}^{(i+1)} = \arg \min_{\theta_{L-1}} J(\{\hat{\theta}_l^{(i+1)}\}_{l=0}^{L-2}, \theta_{L-1})$$

where we define, e.g., the distortion measure given the new estimate of the $$0$$th sinusoid,
the set of parameters of the $$1$$th sinusoid, and the previous estimates of the $$L-2$$ other sinusoids, as
$$J(\hat{\theta}_0^{(i+1)}, \theta_1, \{\hat{\theta}_l^{(i)}\}_{l=2}^{L-1}) := \left | \left | \vx – \left ( \hat a_0^{(i+1)} e^{j \hat\omega_0^{(i+1)}\vk } + a_1 e^{j\omega_1\vk} + \sum_{l=2}^{L-1} \hat a_l^{(i)} e^{j \hat\omega_l^{(i)}\vk }\right ) \right | \right |_2^2.$$
With this measure, we estimate the optimal modulation frequency of the $$1$$th sinusoid,
and then compute its complex amplitude.
We can then repeat this procedure over the entire model several times
until some stopping criteria is met.
It really is that simple.

The figure above shows how CMP stacks up to MP using the same dictionary
(here using perceptual weighting à la R. Heusdens, R. Vafin, and W. B. Kleijn,
“Sinusoidal modeling using psychoacoustic-adaptive matching pursuits,”
IEEE Signal Process. Lett., vol. 9, no. 8, pp. 262-265, 2002.
The model order is 100, and the frame size is 30 ms.
CMP refined the model parameters 10 times.
While the waveform structure of the model created by CMP is
more similar to the original than that of MP,
we can see the standard problems of pre-echo, and some distortions in the tail of the sound.
It is clear that CMP is able to reduce distortion more than MP as a function of model order.

One question I have is the sensitivity of CMP to the model order,
which is really a problem of all signal models.
The choice of $$L$$, made before decomposition,
will greatly affect the benefits of the optimization steps in CMP.
Considering that we have a signal composed of
$$L$$ groups of two closely spaced equal-amplitude sinusoids,
if we make the model order $$L$$ the resulting frequencies estimated by CMP
will be centered between each pair no matter how much refinement we did.
So maybe an additional step can be added where the model order is increased,
and the changes to the model parameter and distortion are gauged
in an attempt to find the best model order.

The ability to refine a signal model is important.
OMP and OOMP do this only when it comes to approximation weights;
but more is necessary because it is not just the weight of an atom
that can cause problems.
CMP extends this refinement to atom frequency as well,
but only within the context of a monoresolution dictionary.
Can and should CMP be extended to signal models built from
multiscale dictionaries?
This is a question on which I have been pondering for a long time.