Hello and welcome to Paper of the Day (Po’D): Content-adaptive Dictionaries Edition. Today’s paper comes from a pile of papers I have been meaning to read more closely. And it completes my 5-paper survey of papers related to the problem I posed last week: N. Cho, Y. Shiu, and C.-C. J. Kuo, “Audio source separation with matching pursuit and content-adaptive dictionaries (MP-CAD),” in Proc. IEEE Workshop Appl. Signal Process. Audio Acoust., (New Paltz, NY), pp. 287-290, Oct. 2007.

The authors propose a method to separate speech and music source types from a single channel mixture without reference to any model of mixing other than a simple sum. They approach this problem by assuming that there exists two signal spaces only slightly overlapping that describe speech and music signals. They create a “content-adapted dictionary” (CAD) to span one of these spaces by combining atoms of a generalized overcomplete time-frequency dictionary \(\mathcal{D}\) into units learned from musical training samples decomposed by matching pursuit. For example, given some training source signal \(\vx_i\), and the set of atoms in its sparse approximation (of some order), \(\mathcal{D}_i = \{ \vg_{\gamma_i} \in \mathcal{D} \}\), the corresponding atoms in the CAD are all those that span \(\mathcal{D}_i\), i.e., $$\mathcal{H}_i := \{\vh_i = \MD_i \vb : \vb \in \MR^{|\mathcal{D}_i|}, ||\vh_i|| = 1\}.$$ With several of these spaces learned, their union produces the music subspace that is hopefully minimally overlapping with the speech subspace. Performing sparse approximation (e.g., MP) of a mixture using this CAD will then extract the musical source type from the mixture.

The authors test their approach on mixtures of speech and clarinet music excerpts. They generate the CAD using 40 isolated clarinet notes taken from the RWC music database, built from a Gabor dictionary of 8000 atoms having some number of scales starting with 92.8 ms, and atoms at each scale having 800 modulation frequencies uniformly spread between 0 and the Nyquist frequency. (I assume then that the CAD contained 40 “atoms,” which are really subspaces onto which the intermediate residuals are projected. I still don’t know the order of the approximations.) Their plot of the power spectral density of (one atom from ?) each \(\mathcal{H}_i\) show a dictionary with nice harmonic content, like those seen in, e.g., S. A. Abdallah and M. D. Plumbley, “Unsupervised analysis of polyphonic music by sparse coding,” IEEE Trans. Neural Networks, vol. 17, pp. 179-196, Jan. 2006. With respect to the signal-to-distortion ratio, the proposed approach significantly outperforms other separation techniques based on independent subspace analysis, and non-negative matrix factorization. The authors provide a preliminary result using two CADs, one for music and one for speech, which appears to increase the performance of the separation. The obvious extension of this approach is that taken in P. Leveau, E. Vincent, G. Richard, and L. Daudet, “Instrument-specific harmonic atoms for mid-level music representation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, pp. 116-128, Jan. 2008. In this work CADs are learned for specific musical instruments — but more so for polyphonic music transcription than source separation and reconstruction.

Happy Easter!