Hello, and welcome to the Paper of the Day (Po’D): Sparsity and Classifiability Edition. Todays paper comes from just last month: M. Moussalam, T. Fillon, G. Richard and L. Daudet, “How Sparsely Can a Signal be Approximated while Keeping its Class Identity?”, in Proc. of Music and Machine Learning Workshop at ACM Multimedia, Firenze, Italy, Oct. 2010.
The authors essentially explore the classifiability of signal approximations created by matching pursuit (MP) decompositions with an overcomplete time-frequency dictionary. They ask the interesting question, to what precision must we approximate a signal until we can classify its content (or discriminate between known content) as good as by using features computed from the original signal? In this work, the authors use the 8xMDCT dictionary of Ravelli et al. (related Po’D here) and several hours or radio broadcast. From small frames of the decomposition they create two features. First, they count the number of atoms selected as a function of scale (“scale coefficient count”); second, they compute the ell_1 norm of the weights as a function of scale, normalized by the ell_1 norm of all the weights (“scale amplitude repartition”). From the sets of these two features they learn the best discriminating plane for classification, either from linear discriminant analysis, or a support vector machine.
In their first experiment, the authors decompose six-minute segments (why six minutes?) to several signal to residual energy ratios (SRR, but their equation for this appears to be a signal-approximation to residual energy ratio), and from these they create their features using frames of size 16384 samples, with 50% overlap. (What is the sampling rate? How are these frames imposed? What if an atom exists across frames?) Testing for speech/non-speech and music/non-music discriminability, the authors find that with the combination of their features built from decompositions of SRR 5 dB they are able to discriminate with an accuracy of close to that using MFCCs and delta MFCCs. (How many MFCCs did they keep? Are the MFCCs computed over the entire frame?) The error rates at this SRR are 14% for music/non-music, and 6% for speech/non-speech. Keeping with 5 dB SRR, the authors test the discriminability using these features between music, speech, and mixtures in 36 hours of TV and radio broadcast. (How did they obtain labels for all the data? Maybe QUAERO data is already labeled.) The authors find that their features perform the best only slightly for speech, whereas MFCCs and delta MFCCs detect music and mixtures better.
While speech/music classification is essentially a solved problem (with accuracies as high as 98%, C. Sénac and E. Ambikairajh, “Audio Classification for Radio Broadcast Indexing: Feature Normalization and Multiple Classifiers Decision,” in Advances in Multimedia Info. Process., Lecture Notes in Computer Science, Vol. 3332/2005, pp. 882-889, 2005), the authors here use this task to detect when enough is enough in the world of atomic decompositions of audio signals. I like that idea; however, I see many problems in their approach. First, the authors assume that the SRR of the decomposition of a six-minute audio segment reflects the SRR of the smaller frames of length 16384 samples. The mean SRR of the entire six minute segment will be 5 dB, but there could very well be content at smaller scales that has yet to be touched by a single atom. MP and its greedy cousins are not democratic in how they distribute atoms; and so when we look at the local residual energies we can find SRRs that are much lower. This means that there could exist many frames with few atoms just because the local energy is small, and not because the signal content belongs to one class or the other. Second, the authors assume that the residual energy reflects “the agreement between the model and data.” This assumption is common, but it is not at all the case (see my PhD work, B. L. Sturm and J. J. Shynk, “Sparse approximation and the pursuit of meaningful signal models with interference adaptation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, pp. 461-472, Mar. 2010, and references therein detailing the problems with building models using MP and its variants). One can have a model producing a high SRR, but it could still provide a horrible representation of the data… and all things are possible with an overcomplete dictionary and a greedy cousin.
These things being said, I think this paper motivates an interesting idea for an iterative pursuit: construct an approximation of a signal given some dictionary such that its contents are highly discriminable, yet the approximation is very sparse. Instead of selecting at each iteration an atom that removes the most energy, select an atom that increases the most the “class presence” of some content…