Hello, and welcome to the Paper of the Day (Po’D): Music Genre Classification Edition. Today’s paper comes from ICASSP 2005: A. Meng, P. Ahrendt, and J. Larsen, “Improving Music Genre Classification by Short-time Feature Integration,” in Proc. ICASSP, vol. 5, pp. 497-500, Philadelphia, PA, 2005.
Automatic classification of music by genre has applications in organizing music databases, and music recommendation systems, both of which are attractive commercially. From a signal processing perspective, this problem is interesting because one must move from a low-level sampled audio signal to the perceptual and cognitive high-level-hard-to-define-but-know-it-when-I-hear-it domain of genre. To create a good music genre classifier then requires a means to tease out of a music signal those features that provide the best tell-tale signs of “Rock and Roll,” “Pop,” “Jazz,” “Classical,” and “Grindcore.” Taking a page from how humans perceive and understand music, we expect that the best classifier will make its decision using features from time-scales long (e.g., rhythm) and short (e.g., spectral centroid).
The authors propose a music genre feature model that includes descriptors from time scales between 30 ms (first 6 MFCCs – not sure if 0th coefficient is included or not) to descriptors computed over 740 ms (mean and variance of MFCCs; power spectral density of MFCCs in 4 bands; autoregression of the PSD of each MFCC with model order selected as to minimize classification error (on a test set I presume); % of frames with high number of zero-crossings; % of frames with low short-time energy) to descriptors computed over almost 10 seconds (mean, variance, and autoregressive models of previous descriptors – except ratios; beat spectrum computed from MFCCs; beat histogram computed from audio samples).
The authors consider early and late information fusion schemes. Early fusion of descriptors occurs in the construction of autoregressive models of low-level descriptors computed over several frames; as well as the estimation of the means, variances, and power spectral densities of each descriptor. The authors also reduced the dimensionality (to about 20) of some of the features by PCA. The late fusion comes from considering the votes of several trained classifiers. The authors trained two different classifiers: 1) single-layer neural network; 2) Gaussian classifier (Bayes’ rule with Gaussian posterior?).
To test the performance of their features and the two types of feature fusion, the authors used two data sets, one consisting of 100 pieces distributed equally in four genres: hard rock, jazz, pop, and techno; and one consisting of 354 30-second music samples acquired from Amazon.com, distributed equally in six genres: classical, country, jazz, rap, rock, and techno. Training and test data were created by using 30-second segments. For the first dataset, the authors confirmed its labels by a human genre classification experiment.
The authors performed several tests using combinations of all the features presented. Classification error for dataset 1 was the best (5%) using autoregressive models of MFCCs features computed over 10 seconds, which represents one way to fuse low-level features into long time-scale features. The performance of these same features for the second database was much lower, around 30% classification error.
Although I am not convinced music genre is something that can and should be defined, I can see its benefits for organizing music data. I believe a better approach is letting the data define itself by clustering, instead of imposing labels like “country” and “jazz”. However, it is clear that any clusters found by considering only features from short time-scales without considering longer time scales will have limited usefulness.