I have been working with a student on STWOPetal: studying the work of Y. Panagakis, C. Kotropoulos, and G. R. Arce, “Music genre classification via sparse representations of auditory temporal modulations,” in Proc. European Signal Process. Conf., (Glasgow, Scotland), pp. 1-5, Aug. 2009; Y. Panagakis, C. Kotropoulos, and G. R. Arce, “Music genre classification using locality preserving non-negative tensor factorization and sparse representations,” in Proc. Int. Soc. Music Info. Retrieval Conf., pp. 249-254, Kobe, Japan, Oct. 2009; Y. Panagakis, C. Kotropoulos, and G. R. Arce, “Non-negative multilinear principal component analysis of auditory temporal modulations for music genre classification,” IEEE Trans. Acoustics, Speech, Lang. Process., vol. 18, pp. 576-588, Mar. 2010. (Po’D here.) In Part 1, we sufficiently tackled the first part, which was creating auditory spectrograms. The second part entails using “modulation-scale analysis” to derive “auditory temporal modulations” from these spectrograms. For this, Panagakis et al. point to the paper, S. Sukittanon, L. E. Atlas, and J. W. Pitton, “Modulation-scale analysis for content identification,” IEEE Trans. Signal Process., vol. 52, no. 10, pp. 3023-3035, Oct. 2004.

In their paper, Sukittanon et al. motivate classification using features extracted across multiple scales. Not only does this make sense for acoustic signals, since they have a variety of phenomena with occurring over multiple time scales, it also appears to be relevant to how humans perceive sound by using characteristics that exist and change over short- and long time scales, and scales in-between. Thus, they propose joint-frequency analysis of sound. Essentially, one computes the transform (e.g., Fourier or wavelet) of each demodulated output from signal passed through a filterbank. A simple example is taking a short-term Fourier transform, and then applying another Fourier decomposition across time for each frequency. The output of this analysis is then in the two-dimensional frequency-modulation space. Analyzing music signals in this way can reveal not only approximate pitches of instruments (short-term feature), but also rhythmic periodicity (long-term feature). The “modulation-scale” analysis they propose refers to using a wavelet transform on each demodulated filterbank output, or each row of a magnitude spectrogram. This of course maps a 1-d signal into a translation-scale space, but by integrating over translation we create a modulation-scale analysis. (They then apply this representation to audio fingerprinting, which isn’t really content identification… And another thing. The tempo of a piece will not be detectable until the periodic attacks are wideband.)

For us then, we essentially take each row of an auditory spectrogram, compute its wavelet transform, sum the energies across scales, and put that number in the right place in the modulation-scale domain.

Panagakis et al. use a bank of Gabor filters having a variety of passbands to detect modulations around the following frequencies: 2.^[1:8] Hz.

I assume that this can be done simply using a wavelet transform using a first-derivative Gaussian wavelet at eight appropriate scales:

modulationRates = 2.^[1:8]'; % in Hz % find the appropriate wavelet scales, apparently divided by 5 because % of effective support of wavelets? waveletScales = (Fs./modulationRates)/5; % check approx frequencies with % scal2frq(waveletScales,'gaus1',1/Fs) dt = 1/Fs; % to approximate dt in the integration ASTM = zeros(size(AS,1),length(modulationRates)); % for each temporal row of the auditory spectrogram for ii=1:size(AS,1) ASCWT = cwt(AS(ii,:),waveletScales,'gaus1'); % for each modulation rate find energy for jj=1:length(waveletScales) % divide by sqrt or not? ASTM(ii,jj) = sqrt(1/waveletScales(jj))* ... dt*ASCWT(jj,:)*ASCWT(jj,:)'; end end

where “AS” is the auditory spectrogram (rows are frequencies).

I have tested this using synthetic signals with known modulations, and it appears to work well. I am still not clear on the scale-dependent scaling of the energies, but it doesn’t appear to change much without it.

Another thing I have done is downsample each row of the auditory spectrogram, since the outputs of the filters will be band limited. This saves a lot of time in analyzing the wavelet transform; and I can’t see much difference with a downsampling factor of 100, at an original sampling rate of 16 kHz.

Et voilĂ . Below we see an auditory spectrogram of 30 seconds of rap music.

And here is the auditory temporal modulations

And here is the code: TempMod.zip