I am working with a student now on STWOPetal: studying the work of Y. Panagakis, C. Kotropoulos, and G. R. Arce, “Music genre classification via sparse representations of auditory temporal modulations,” in Proc. European Signal Process. Conf., (Glasgow, Scotland), pp. 1-5, Aug. 2009; Y. Panagakis, C. Kotropoulos, and G. R. Arce, “Music genre classification using locality preserving non-negative tensor factorization and sparse representations,” in Proc. Int. Soc. Music Info. Retrieval Conf., pp. 249-254, Kobe, Japan, Oct. 2009; Y. Panagakis, C. Kotropoulos, and G. R. Arce, “Non-negative multilinear principal component analysis of auditory temporal modulations for music genre classification,” IEEE Trans. Acoustics, Speech, Lang. Process., vol. 18, pp. 576-588, Mar. 2010. (Po’D here.) The first part of their process entails creating auditory spectrograms of segments of recorded music. For this, they point to the following work: X. Yang, K. Wang, and S. A. Shamma, “Auditory representations of acoustic signals,” IEEE Trans. Info. Theory, vol. 38, no. 2, pp. 824-839, Mar. 1992.

In their paper, Yang et al. propose essentially a signal processing algorithm modeling the early stages of auditory processing in humans. First (after sound has impinged upon the ear drum), we have the frequency-divided cochlea, which can be modeled by a set of constant-Q bandpass filters (from 20 Hz to 20 kHz). Second, for the output of these filters, we have sudden changes in electric potentials caused by the bending of cilia on the basilar membrane, which can be modeled by differentiation (high-pass filtering), compression (adapting to sound pressure levels), and low-pass filtering (cutoff around 4-5 kHz). And finally, we have leakage across hair cell membranes (across bands in the original filterbank) and then potential firing and integration (I think), which can be modeled by a mixed time and space derivative with a smoothing spatial filter, half-wave rectification, and integration over a long time-constant (10-20 ms). The output of each of these then is what is sent along the auditory nerve for processing; and from this Panagakis et al. create the auditory spectrogram, and then the features they employ for music genre recognition.

However, in their description Panagakis et al. do not mention all the details necessary to recreate the auditory spectrogram they use. They use 96 constant-Q bandpass filters over four octaves, but do not say what the first center frequency is. (We use the Gammatone filterbank from Malcolm Slaney’s toolbox.) Here, we assume 100 Hz; which makes the last filter have a center frequency of 1600 Hz. We assume their differentiation is just a first-order time difference. They also do not specify the non-linear compression, which Yang et al. define as a sigmoid

$$

g(u) = \frac{1}{1+e^{-\gamma u}} – \frac{1}{2}.

$$

and state “… saturation in a given fiber is limited to 30-40 dB.”

For now, we make \(\gamma = 10\), but assume it will have to be adapted to the signal pressure level. We also make the lowpass filter after compression a 6th-order Butterworth with a cutoff frequency of 4.5 kHz.

For their spatial derivative, we assume a first-order difference across each pair of frequency bands (but what to do about the first band?). (Also, Panagakis et al. do not mention that they do any spatial smoothing as done by Yang et al.) The half-wave rectification is clear as a bell; but for the integration, Panagakis et al. use an exponential window with a time constant of 8 ms (in EUSIPCO 2009) or 2-8 ms (in TASLP 2010). We set the time constant to 5 ms.

And voilà. We have an auditory spectrogram:

Code is here. (I am unsure whether this is completely correct. If I input at sinusoid of 500 Hz, I see peaks around 800 Hz…) *Edit 20110409: code is correct. MATLAB just doesn’t like non-linearly spaced vectors input to “imagesc”.*