Hello, and welcome to Paper of the Day (Po’D): Sparse Coding for Drum Sound Classification Edition. Today’s paper will be presented at the 3rd International Workshop on Machine Learning and Music (MML’10): S. Scholler and H. Purwins, “Sparse coding for drum sound classification and its use as a similarity measure,” in Proc. Int. Workshop Machine Learning Music ACM Multimedia, (Firenze, Italy), Oct. 2010.
Today’s paper is relatable to a previous Po’D: Sparse Time-relative Auditory Codes and Music Genre Recognition Edition, except instead of attacking the problem of music genre classification, the authors here are looking at the use of sparse features for discriminating between three different percussive sounds (bass, snare, and highhat). For this task, the authors compare three different sets of features (computed over entire sound):
- histograms of atoms and weights found in a matching pursuit decomposition over a dictionary of gammatone atoms;
- histograms of atoms and weights found in a matching pursuit decomposition over a dictionary of learned atoms, à la E. Smith and M. S. Lewicki, “Efficient Coding of Time-relative Structure Using Spikes,” Neural Computation, vol. 17, pp. 19-45, 2005;
- the catch-all features MFCCs.
In their brief tests, they find that the features from decompositions using a learned dictionary perform the best in the three-class discrimination task (using a random forest classifier).
What is more interesting to me, however, are two of their figures.
Above we can see how sparse coding results in learned atoms
that are more significant than the generic time-frequency tiles
offered by Gabor or gammatone dictionaries.
Each of these atoms can be associated with soft labels, like
59% bass drum, or 95% high hat. In my opinion, this is where
sparse approximation becomes much more interesting than transforms
over orthogonal bases. However, the main problem is that 1) the
learned waveforms are not parametric, and 2) the learned waveforms are
too time-domain specific for these random processes! What we need
instead are high-level models that embody the statistics of the
phenomena involved in “bass drum”, for instance.
In Fig. 2 we see the distribution of the feature centroids for additive combinations of realizations of the sound classes projected onto the two principal components.
Since MFCCs are non-linear I don’t expect mixtures of sounds to have
mixtures of MFCCs; and so we see it is more difficult to position a mixture of two sounds as being in between the two separate sources using MFCCs features. But here we see that sparse approximation with good dictionaries (learned or generic) are able to separate sources a bit before we go into a feature space.
Thus, features from combinations of sources result in combinations of features of sources … more or less, but much more so than those pesky MFCCs features.
With that, and the discussion at a recent presentation regarding the perceptual significance behind sparse approximation, I will soon be reviewing the following papers:
- E. Smith and M. S. Lewicki, “Efficient coding of time-relative structure using spikes,” Neural Computation, vol. 17, no. 1, pp. 19-45, 2005.
- E. Smith and M. S. Lewicki, “Efficient auditory coding,” Nature, vol. 439, pp. 978-982, Feb. 2005.
- T. J. Gardner and M. O. Magnasco, “Sparse time-frequency representations,” Proc. National Academy of the Sciences, vol. 103, pp. 6094-6099, Apr. 2006.
- R. F. Lyon, M. Rehn, S. Bengio, T. C. Walters, and G. Chechik, “Sound retrieval and ranking using sparse auditory representations,” Neural Computation, vol. 22, no. 9, 2010