I have been experimenting with the approach to feature extraction posed in J. Andén and S. Mallat, “Multiscale scattering for audio classification,” Proc. Int. Soc. Music Info. Retrieval, 2011. Specifically, I have substituted these “scattering coefficients” for the features used by Bergstra et al. 2006 into AdaBoost for music genre recognition.

The idea behind the features reminds me of the temporal modulation analysis of Panagakis et al. 2009, which itself comes from

S. A. Shamma, “Encoding sound timbre in the auditory system”, IETE J. Research, vol. 49, no. 2, pp. 145-156, Mar.-Apr. 2003. One difference is that these scattering coefficients are not psychoacoustically derived, yet they appear just as powerful as those that are.

Let’s take a signal \(x(t)\), then filter with a bandpass filter of some time-width (support) \(a^{j_1}\), \(a > 0\), i.e., \(x(t)\star \psi_{j_1}(t)\), where star is a time-domain convolution. Then we compute the mean of the magnitude output at a scale \(a^J\) by \(|x(t) \star \psi_{j_1}|\star \phi_J(t)\),

where \(\phi_J(t)\) is a lowpass filter with time-width \(a^J\). I see the output of this as essentially average modulation information at a scale \(a^{j_1}\). So, computed at a variety of scale indices \(\{j_1\}\), we have an idea of variation in \(x(t)\) in several frequency bands. We can perform the same thing again, obtaining “co-occurance” information. We take a signal \(x(t)\) filtered with a bandpass filter \(x(t)\star \psi_{j_1}(t)\), then filter its magnitude with another bandpass filter \(|x(t)\star \psi_{j_1}(t)|\star\psi_{j_2}(t)\), and then compute the mean of the magnitude output at a scale \(a^J\) by \(||x(t)\star \psi_{j_1}(t)|\star\psi_{j_2}(t)|\star \phi_J(t)\).

Now, we have co-occurance coefficients at the scales \(a^{j_1}\) and \(a^{j_2}\).

We can continue in this way to get co-co-co-occurance information as well, but Andén and Mallat find that second-order scattering coefficients carry most of the energy of the signal; and their experiments are done with no more than one level of co-occurance.

Furthermore, with so much redundancy in the transform, they propose taking a DCT along each dimension \(j_1\) and \(j_2\) separately, to reduce the dimensionality. Selecting from among these coefficients then forms features for use in classification.

I have essentially done no more than take my code of Bergstra et al.’s AdaBoost music genre classifier, and replace the features (means and variances of MFCCs, zero crossings, spectral quantiles, etc.) with second-order scattering coefficients computed at scales 371 ms or 743 ms (J=128 at 22050 and 11025 Hz sampling rates, respectively). (In order to do this, the code here seems to require the input signal has a power of 2 number of samples, so the most I can take from each excerpt in GTZAN is 23.8 s.) The feature dimensionality here is 386. (In my features for Bergstra et al. it is 120.)

Then I use AdaBoost.MH to learn 1000 decision stumps, for each fold in 5-fold stratified cross validation.

Then, I classify each frame individually, or entire 23.8 s excerpts by selecting the class with the maximum posterior over all frames.

Below we see for the scale of 371 ms, the mean accuracies (well, over 2 folds since it is taking so long) of these features. The mean overall accuracy in this case is 80.3%.

And for the scale of 743 ms, the mean accuracies (again, only over 2 folds), is seen below. The overall mean accuracy in this case is 81.3%

Of course, these are excellent results; but we have seen this behavior before.

Now it is time to look deeper.