After a long week of Easter and massive amounts of peer reviewing, I am returning to reproducing the results in Y. Panagakis, C. Kotropoulos, and G. R. Arce, “Music genre classification via sparse representations of auditory temporal modulations,” in Proc. European Signal Process. Conf., (Glasgow, Scotland), pp. 1-5, Aug. 2009; Y. Panagakis, C. Kotropoulos, and G. R. Arce, “Music genre classification using locality preserving non-negative tensor factorization and sparse representations,” in Proc. Int. Soc. Music Info. Retrieval Conf., pp. 249-254, Kobe, Japan, Oct. 2009; Y. Panagakis, C. Kotropoulos, and G. R. Arce, “Non-negative multilinear principal component analysis of auditory temporal modulations for music genre classification,” IEEE Trans. Acoustics, Speech, Lang. Process., vol. 18, pp. 576-588, Mar. 2010. We have received some help from the authors and are eagerly awaiting the necessary details to get moving again.
The first thing we need to know is this business about a “bank of 96 overlapping bandpass filters with center frequencies uniformly distributed … over four octaves.” In Part 1, we assumed this begins at 100 Hz, which means four octaves later is only 1.6 kHz. But this doesn’t appear to be the case. In Fig. 2 of their TASLP paper, we see the following image of an auditory spectrogram (AS):
Just by luck I found this is from the first example of blues music in the Tzanetakis dataset.
Below I show the short-time Fourier transform of the first 3 seconds of this music. The resemblance between the two is quite clear. However, while the frequency axis on the spectogram is spaced linearly, that in the AS is logarithmic. Also, the units on the x-axis of the AS do not appear to be time in seconds. (The y-axis are just filterbank outputs with increasing center frequency.)
By comparing these two figures, I want to find at which frequency the four octaves begins — assuming that the image of the AS truly reflects the feature extraction process. I can see from the fundamental of the note centered at 2 seconds (approx. 275 Hz) that the lowest center frequency of the filterbank has to be less than 200 Hz. So my initial assumption of 100 Hz may not be that poor. The first harmonic, which I assume to be an octave above the fundamental (I see it is around 550 Hz), should be 24 filterbank channels higher than the fundamental if we have 24 filters per octave. But in the AS image, it only appears to be 12. Furthermore, in the AS image I count at least 9 harmonics of that note, which means that highest filterbank center frequency must be higher than about 2.5 kHz.
All this leads me to conclude that the authors actually use 12 filters each octave, stretching over 8 octaves. Since they resample each signal to 16 kHz, I don’t think it unreasonable to create a constant-Q filterbank with 12 bands each octave, stretching from 20 to 5120 Hz. The highest the lowest frequency can go is 31.25 Hz because we will exceed the Nyquist frequency.
Tomorrow, I will tackle the constant-Q filterbank issue. The authors stated to me that their tests with a Gammatone filterbank did not produce as good results as with the constant-Q filterbank… which makes me wonder why we should put effort into mimicking the human auditory system here when the filters that best model the cochlea are not as good as others.