Hello, and welcome to Paper of the Day (Po’D): Cyclic Matching Pursuit Edition. Today’s paper is: M. G. Christensen and S. H. Jensen, “The cyclic matching pursuit and its application to audio modeling and coding,” in Proc. Asilomar Conf. Signals, Syst., Comput., (Pacific Grove, CA), Nov. 2007.

Note that cyclic matching pursuit (CMP) is not related to complementary matching pursuit (CMP); but in this post I will refer to the former by CMP, and to the latter as TOCMP (the other CMP).

Good short acronyms are hard to come by.

Sparse approximation by matching pursuit (MP) is essentially analysis by iterative granular synthesis:

each atom is selected and subtracted from an intermediate residual,

and this residual is used to select the next atom, ad infinitum,

or at least until Eyjafjallajokull is plugged (which might make “sparse” meaningless).

In its atom selection, MP can make some very poor decisions,

which leads to diminished efficiency and meaningfulness in the signal model.

What I mean by diminished efficiency is that the signal model includes more waveforms than it could;

and what I mean by diminished meaningfulness is that the signal model

includes atom that do not accurately portray the characteristics of the signal,

e.g., energy at a time-frequency location where there is none.

Orthogonal MP (OMP), and optimized OMP (OOMP, also known as

“orthogonal least squares,” and “MP with prefitting”),

address the problem of diminished efficiency by

adjusting the weights of every atom so far selected at each step.

(OOMP takes this into consideration when selecting the next atom,

while OMP does not.)

With this change to the model weights,

OMP and OOMP are guaranteed to outperform MP

with respect to the signal-to-residual-energy ratio

in all cases but the most trivial.

While this change increases the efficiency of a signal model,

it does not guarantee any increase in its meaningfulness.

Once a poor atom is included in the model by ([O]O)MP, there it remains, a saboteur

camouflaged by the hustle and bustle of the other atoms pushing and pulling

in competition to best fit some waveform —

the shape of which might not even matter

as much as the statistics of the random process that underlies the signal,

or at least the content we are interested in.

All allegorical writing aside, what we want is a means to modify

not only the model weights of a sparse approximation,

but also the parameters of the model elements, for instance,

modulation frequency and phase, atom translation and scale and shape, and so on.

CMP proposes a way to do this based on an iterative estimation framework;

but it only provides a mechanism to adjust atom modulation frequency and phase

for MP using a monoresolution dictionary of Fourier atoms with no translation parameters.

Though this seems restrictive, CMP is proposed within the context of frame-wise audio coding

using sinusoidal models.

Thus, the problem it solves is in the selection of the best frequencies and phases

with respect to a perceptual distortion measure using MP

for a frame of audio data.

Its simplicity is very nice, and ever since I first read the paper

I have been wondering how to extend it to a multiresolution framework.

So, let’s have a look at CMP.

CMP provides a simple means to reassess

the parameters of sinusoidal atoms (frequency, phase, weight) in a signal model built by MP

with a contiuous Fourier dictionary.

Unlike ([O]O)MP, it is capable of correcting poorly selected atoms,

i.e., those model elements that diminish efficiency and meaningfulness.

In addition to this, CMP maintains the simple structure of MP,

even while using a dictionary of infinite size.

The periodic review and adjustment of atom parameters

can optimiz a signal model built by a greedy pursuit

with respect to some distortion measure,

and thus increase model efficiency and meaningfulness, we hope.

For a given \(K\)-length frame of audio data, \(\vx \in \mathcal{C}^K\) (here we assume it is in analytic form),

we want to find an \(L\)-order sinusoidal model (\(L \ll K\))

such that some distortion measure is minimized.

We can thus consider selecting elements from the infinite dictionary of

length-\(K\) complex sinusoids

$$\mathcal{Z} := \left \{ \vz \in \mathcal{C}^K : \vz = e^{j \omega\vk},

[\vk]_k = k-1, k \in \{1, 2, \ldots, K \} \right \}_{\omega \in [-\pi, \pi)}$$

and combine \(L\) of these with weights \(\va(L) \in \mathcal{C}^L\).

Thus, our \(L\)-th order approximation of \(\vx\) is

$$\hat{\vx(L)} = \left [ e^{j\omega_0 \vk} | e^{j\omega_1 \vk} | \cdots | e^{j\omega_{L-1} \vk} \right ]

\left [

\begin{array}{@{}c@{}}

a_0 \\

a_1 \\

\vdots \\

a_{L-1}

\end{array}

\right ]

= \MZ(L)\va(L)$$

and we define some distortion measure

$$J[\vr(L)] := || \MW [ \vx – \hat{\vx(L)} ] ||_2 ^2 = || \MW [\vx -\MZ(L)\va(L)]||_2 ^2.$$

We will make \(\MW\) to be identity for now.

Building such a model from the infinite dictionary \(\mathcal{Z}\)

requires estimating the set of parameters: \(\{ (\omega_l, a_l) \}_{l=0}^{L-1}\).

CMP proposed to use the following iterative procedure

to estimate these \(L\) (complex) amplitudes and frequencies;

subsequent steps will then cycle through these

estimates to refine them to further decrease the distortion.

Let \(\vr(0) = \vx\), and consider all possible distortions

of the \((l+1)\)th residual (\(l \ge 0\)):

$$J[\vr(l+1)] = \left | \left | \vr(l) – a e^{j\omega\vk} \right | \right |_ 2 ^2.$$

Expanding this and taking the partial derivative

with respect to \(a^*\), we find that the best \(a\) is

$$

a_o = \frac{ \langle \vr(l), e^{j\omega \vk}\rangle}{||e^{j\omega \vk}||_2^2}.

$$

Substituting this into the distortion equation gives

$$

J[\vr(l+1)] = ||\vr(l)||_2^2 – \frac{ \left | \langle \vr(l), e^{j\omega \vk}\rangle \right |^2}{||e^{j\omega \vk}||_2^2}

$$

which is minimized by maximizing the right-hand term.

Thus, the best frequency \(\omega \in [-\pi, \pi)\) is that given by

$$

\hat{\omega}_l = \arg \max_\omega \frac{|\langle \vr(l), e^{j\omega \vk}\rangle|}{||e^{j\omega \vk}||_2}.

$$

This result is no surprise:

the numerator is essentially the

discrete-time FT (DTFT)

of \(\vr(l)\),

and the denominator is a scaling

that takes into account the weighting by \(\MW\)

onto \(e^{j\omega \vk}\),

which here is \(\sqrt{K}\) considering \(\MW = \MI\).

This frequency estimate can be substituted into the equation

for the optimal weight to find the best \(\hat a_l(\hat\omega_l)\).

With these estimated parameters,

we define the new residual

\(\vr(l+1) := \vr(l) – \hat{a}_l(\hat\omega_l) e^{j\hat\omega_l\vk}\),

and repeat the procedure above \(L-1\) times

to provide the set of estimates \(\{(\hat \omega_l, \hat{a}_l) \}_{l=1}^L\).

This is essentially MP using the infinite dictionary \(\mathcal{Z}\);

and being MP, it is not likely that this set of estimates will

give the best signal model with respect to the distortion measure.

Thus, starting from this set of \(L\) initial estimates,

CMP refines each amplitude and frequency to decrease the distortion further.

So define \(\theta_l := \{\omega_l, a_l\}\) to be the

parameters associated with the \(l\)th complex sinusoid

in the signal model;

and let \(\{\hat{\theta}_l^{(i)}\}_{l=0}^{L-1}\)

be the set of \(L\) parameter estimates

at the \(i\)th iteration of CMP.

With these estimates,

we can iteratively refine each one

by minimizing the following \(L\) distortion measures:

- $$\hat{\theta}_0^{(i+1)} = \arg \min_{\theta_0} J(\theta_0, \{\hat{\theta}_l^{(i)}\}_{l=1}^{L-1})$$
- $$ \hat{\theta}_1^{(i+1)} = \arg \min_{\theta_1} J(\hat{\theta}_0^{(i+1)}, \theta_1, \{\hat{\theta}_l^{(i)}\}_{l=2}^{L-1})$$
- dot dot dot
- $$\hat{\theta}_{L-1}^{(i+1)} = \arg \min_{\theta_{L-1}} J(\{\hat{\theta}_l^{(i+1)}\}_{l=0}^{L-2}, \theta_{L-1})$$

where we define, e.g., the distortion measure given the new estimate of the \(0\)th sinusoid,

the set of parameters of the \(1\)th sinusoid, and the previous estimates of the \(L-2\) other sinusoids, as

$$

J(\hat{\theta}_0^{(i+1)}, \theta_1, \{\hat{\theta}_l^{(i)}\}_{l=2}^{L-1}) :=

\left | \left | \vx – \left ( \hat a_0^{(i+1)} e^{j \hat\omega_0^{(i+1)}\vk } + a_1 e^{j\omega_1\vk} + \sum_{l=2}^{L-1} \hat a_l^{(i)} e^{j \hat\omega_l^{(i)}\vk }\right ) \right | \right |_2^2.

$$

With this measure, we estimate the optimal modulation frequency of the \(1\)th sinusoid,

and then compute its complex amplitude.

We can then repeat this procedure over the entire model several times

until some stopping criteria is met.

It really is that simple.

The figure above shows how CMP stacks up to MP using the same dictionary

(here using perceptual weighting à la R. Heusdens, R. Vafin, and W. B. Kleijn,

“Sinusoidal modeling using psychoacoustic-adaptive matching pursuits,”

IEEE Signal Process. Lett., vol. 9, no. 8, pp. 262-265, 2002.

The model order is 100, and the frame size is 30 ms.

CMP refined the model parameters 10 times.

While the waveform structure of the model created by CMP is

more similar to the original than that of MP,

we can see the standard problems of pre-echo, and some distortions in the tail of the sound.

It is clear that CMP is able to reduce distortion more than MP as a function of model order.

One question I have is the sensitivity of CMP to the model order,

which is really a problem of all signal models.

The choice of \(L\), made before decomposition,

will greatly affect the benefits of the optimization steps in CMP.

Considering that we have a signal composed of

\(L\) groups of two closely spaced equal-amplitude sinusoids,

if we make the model order \(L\) the resulting frequencies estimated by CMP

will be centered between each pair no matter how much refinement we did.

So maybe an additional step can be added where the model order is increased,

and the changes to the model parameter and distortion are gauged

in an attempt to find the best model order.

The ability to refine a signal model is important.

OMP and OOMP do this only when it comes to approximation weights;

but more is necessary because it is not just the weight of an atom

that can cause problems.

CMP extends this refinement to atom frequency as well,

but only within the context of a monoresolution dictionary.

Can and should CMP be extended to signal models built from

multiscale dictionaries?

This is a question on which I have been pondering for a long time.