I have been working with Pardis Noorzad on STWOPetal: studying the work of Y. Panagakis, C. Kotropoulos, and G. R. Arce, “Music genre classification via sparse representations of auditory temporal modulations,” in Proc. European Signal Process. Conf., (Glasgow, Scotland), pp. 1-5, Aug. 2009; Y. Panagakis, C. Kotropoulos, and G. R. Arce, “Music genre classification using locality preserving non-negative tensor factorization and sparse representations,” in Proc. Int. Soc. Music Info. Retrieval Conf., pp. 249-254, Kobe, Japan, Oct. 2009; Y. Panagakis, C. Kotropoulos, and G. R. Arce, “Non-negative multilinear principal component analysis of auditory temporal modulations for music genre classification,” IEEE Trans. Acoustics, Speech, Lang. Process., vol. 18, pp. 576-588, Mar. 2010. (Po’D here.) In Part 1, we sufficiently tackled the first part, which was creating auditory spectrograms. In Part 2, we successfully performed “modulation-scale analysis” on the auditory spectrograms to derive “auditory temporal modulations.” In Part 3, we implemented and tested sparse representation classification (SRC), and compared it with other classification methods.
For nearly 70 hours this weekend I computed on my little labtop the features of the 500 minutes of music in the 1.2 GB Tzanetakis dataset. That gave this 2.9 MB set of features: Features20110418.zip. We ran it through our classifiers (tested on written digits) with 10-fold cross-validation, but the results were awful. Then I saw in Panagakis et al. EUSIPCO 2009 that they make sure all 30-second audio signals are zero mean, and have unit variance. So I reran the feature extraction on my destop, which took less than 9 hours, and produced these features: Features20110419.zip (feature extraction code included there).
Now, we return to classifying the data. Still, the results look poor. Below, we see the classification accuracy of using SVM with a linear kernel (top) and nearest neighbor (bottom). (Our sparse representation classifier is giving 10% accuracy no matter what we pass it, so debugging is needed.)
For reference, below are the relevant plots presented in Panagakis et al. EUSIPCO 2009: (c) is the linear SVM, and (e) is the nearest neighbor. The only thing resembling our plots is the ordering of the data reduction methods for SVM. Our accuracies are about half what they should be. And for nearest neighbors, nothing looks correct.
So, looking at our auditory modulations, I also do not see a relationship to the plots in Y. Panagakis, C. Kotropoulos, and G. R. Arce, “Non-negative multilinear principal component analysis of auditory temporal modulations for music genre classification,” IEEE Trans. Acoustics, Speech, Lang. Process., vol. 18, pp. 576-588, Mar. 2010. For comparison purposes, here is their plot.
(Panagakis et al. appear to evaluate the wavelet transform at more modulation rates than the 8 stated in their papers.) And here is ours:
So, apparently, something is not correct in our feature extraction procedure…