Hello, and welcome to the Paper of the Day (Po’D): Stereo Matching Pursuit Edition. Today’s interesting paper comes from almost eight years ago: R. Gribonval, “Sparse decomposition of stereo signals with matching pursuit and application to blind separation of more than two sources from a stereo mixture,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., vol. 3, (Orlando, FL), pp. 3057-3060, May 2002.

Consider a matrix of \(I\) real length-\(K\) sampled audio signals \(\MX := [\vx_1| \vx_2 | \cdots | \vx_I ]\) mixed into a two-channel “stereo” signal \(\MY = [\vy_l | \vy_r ]\), modeled in the z-domain by $$\MY(z) = \MX(z) \Lambda \mathbf{\Theta}(z) + \MN(z)$$ where \(\Lambda := \textrm{diag}(\lambda_1, \lambda_2, \ldots, \lambda_I)\) is a diagonal square matrix of real scalars that amplify each of the sources, and $$ \mathbf{\Theta}(z) := \left [ \begin{array}{ccc}

\cos \theta_1 & \sin \theta_1z^{-n_1} \\

\cos \theta_2 & \sin \theta_2z^{-n_2} \\

\vdots & \vdots \\

\cos \theta_I & \sin \theta_Iz^{-n_I} \\

\end{array} \right ]$$ denotes the stereo placements (\(\theta_i \in [0,\pi/2]\)) and integer time-delays (\(n_i \in \MZ\)) of the sources. The matrix \(\MN = [\vn_l | \vn_r ]\) is noise independent of the sources and mixing. In this model we are assuming that the only filtering of the sources is an integer time-delay and a constant gain for each channel. The challenge is to discover all of the \((3+K)I + 2K\) unknown components of this model, i.e., \(\MX, \Lambda\), and \(\mathbf{\Theta}(z)\), using just the \(2K\) measurements \(\MY\). This is clearly an underdetermined problem no matter what \(I \ge 1\) is. In this paper, Gribonval proposes an estimation method that uses greedy iterative descent sparse approximation with a parameterized dictionary of stereo atoms.

Define a dictionary \(\mathcal{D} = \{ \vg \in \MR^K : ||\vg||_2 = 1\}\) of time-localized atoms such that \(\text{span} \; \mathcal{D} = \MR^K\), and consider that we can model each column of \(\MX\) in terms of these atoms, which means $$\lim_{H \to |\mathcal{D}|} \Norm \vx_i – \sum_{k=1}^H \alpha_{i,k} \vg_k \Norm = 0.$$ With this, our model of \(\MY\) becomes $$\MY(z) = \sum_{i=1}^I \lambda_i \sum_k \alpha_{i,k} [ \cos \theta_i \vg_k(z) | \sin \theta_i z^{-n_i} \vg_k(z) ] + \MN(z).$$ Now consider defining a dictionary of “stereo” atoms from those in \(\mathcal{D}\): $$\mathcal{D}_s(z) := \{ \vs(z) := [\cos \phi\vg(z) | \sin \phi z^{-m} \vg(z) ] : \phi \in [0,\pi/2], |m| \le m_{\max}, \vg \in \mathcal{D}\}$$ which trivially spans the space in which \(\MY \in \MR^{K\times 2}\) exists. This means we can model \(\MY\) in terms of these atoms as we did with each \(\vx_i\), i.e., $$\lim_{L \to |\mathcal{D}_s|} \Norm \MY – \sum_{l=1}^L \beta_{l} \vs_l \Norm_F = 0.$$ If we assume that

$$\sum_{l=1}^L \beta_{l} \vs_l(z) = \sum_{l=1}^L \beta_{l} [\cos \phi_l \vg_l(z) | \sin \phi_l z^{-m_l} \vg_l(z) ]$$ $$\approx \sum_{i=1}^I \lambda_i \sum_k \alpha_{i,k} [ \cos \theta_i \vg_k(z) | \sin \theta_i z^{-n_i} \vg_k(z) ]$$ then we can begin to tease out the various unknowns in our model of the stereo signal. Notice too that the noise signal is no longer being considered as we will assume it is not represented in the \(L\)th-order model of \(\MY\). If we assume that each atom in the approximation of \(\MY\) corresponds to one atom in the \(k\)th source in the model of \(\MY\),

then we can estimate the number of sources, \(\hat I\), by looking for clusters in the sets \(\{\phi_l\}\) and \(\{m_l\}\), and then find a set of disjoint partitions to estimate each source

$$\widehat{\lambda_i \vx_i} = \sum_{l \in \mathcal{L}_i} \beta_l \vg_l$$ where \(\mathcal{L}_i\) contains the indices of the \(i\)th cluster. The problem now is to generate the decomposition of \(\MY\).

To solve this, Gribonval proposes stereo matching pursuit, which goes as follows using the \((M-1)\)th-order residual \(\MR_\MY^{(M-1)}\), where \(\MR_\MY^{(0)} = \MY\).

- Find the atom and its delay having the largest magnitude projection onto the residual: $$(\vg_M, m_M) = \arg \max_{\vg \in \mathcal{D}, |m| \le m_{\max}} || \MP_{\vg,m} \MR_\MY^{(M-1)} ||$$ where \(\MP_{\vg,m}\vy = [p_l, p_r]\) is the orthogonal projection of \(\vy\) onto the space spanned by \(\vg\) in the left channel, and its \(m\)-delayed version in the right channel.
- Set \(\beta_M = || \MP_{\vg_M,m_M} \MR_\MY^{(M-1)} ||\) and \(e^{j\phi_M} = (p_l + jp_r)/\sqrt{p_l^2 + p_r^2}\).
- Compute the new residual: $$\MR_\MY^{(M)}(z) = \MR_\MY^{(M-1)}(z) – \beta_M [ \cos \phi_M \vg_M | \sin \phi_M z^{-m_M} \vg_M].$$
- Repeat.

Gribonval’s experiments used the much-loved LastWave implementation of matching pursuit, with a discretized dictionary of complex Gabor atoms. (Of historical note, he says that 2000 iterations of stereo MP on a sampled audio signal of length 19,200 samples (8 kHz) took about 30 minutes on a Pentium III 750 MHz laptop. I remember the long waits accompanying the decomposition process; but the interface where one can drag around the atoms and play them back was great! While sparse approximation methods are still far from real-time, there have been some incredible advances since 2002.) His experiment with a mixture of “cello,” “drums,” and “piano,” shows that it can be demixed very well with respect to the best linear demixing.

I wonder though whether the mixture of these sources is completely artificial. If they are from the same music composition, e.g., where they are playing at the same tempo, and are in the same key, etc., then their correlation might break the assumption that each atom corresponds to one source. Also, since the clustering is sensitive to the number of atoms used to represent each source, I wonder how this method would fare if “piano” was replaced by a single sinusoid, which presumably would require far fewer atoms to represent than the other wideband sources. Regardless, with reference to the problem I stated before, we see here that under the condition of sparsity in the time-frequency domain, as well as correlation between channels, it is not unreasonable to assume “one atom one source” in a sparse approximation of a mixture.