I have started to poke some of the questions of a few months ago about music genre recognition. In particular, I want to see how the faults in this well-used dataset affect the results of classification. One might assume that these faults will only reduce the performance of algorithms; however, the faults of the dataset (1000 excerpts) are so varied that I cannot be sure about this until I do further testing.
For instance, with so many exact replicas (54), it is possible that in cross-validation the same features are in the training and test sets, which will of course increase the mean performance in particular folds. There are also many excerpts from the same artist and/or album (e.g., 28 from Bob Marley, 24 from Britney Spears), from the same recording (12), and versions (12). Thus, the producer effect and artist effect will inflate performance. With all the mislabelings though (118), accuracy could be hurt. And in the cases where the training set has multiple copies of the same features, we can consider the training data to not be as rich as thought, which will decrease performance. All in all, the sum total of the good and bad effects of the faults may cancel each other; and it appears that the results of classifiers run on the faulty dataset are not too different from those obtained using other music genre datasets (which might have similar problems, but I am not sure). So this is an interesting question.
Secondly, I remain to be convinced for the many algorithms proposed to recognize genre, that it is actually the music (rhythm and instruments for instance), and not extramusical features (such as compression), that are driving the recognition. In other words, I doubt the simplest problems have even been solved.
So, to investigate these questions, I am using the best performing method I have found. This approach is well-described in:
- J. Bergstra, N. Casagrande, D. Erhan, D. Eck, and B. Kégl, “Aggregate features and adaboost for music classification,” Machine Learning, vol. 65, pp. 473-484, June 2006.
- J. Bergstra, “Algorithms for classifying recorded music by genre,” Master’s thesis, Université de Montréal, Montréal, Canada, Aug. 2006.
Since so few details are left out of the description of this work, I only needed to spend a few hours to program the system, and find results validating those reported by Bergstra et al. (There are a few assumptions I had to make, but nothing that apparently kills the system.)
Their approach involves first finding features for 46.4 ms FRAMES (1024 samples).
These features include 40 MFCCs, zero crossing rates, spectral mean, variance, and 16 quantiles, and a 32-order linear prediction error. Then I find the mean and variance of each dimension for SEGMENTS of 129 frames, giving features of 120 dimensions for each segment. No segments overlap, and so for each 30 s sound excerpt we have about 10 labeled feature vectors.
Separating the songs with 5-fold stratified cross validation, I train a set of decision stumps with AdaBoost.MH, a multiclass extension of AdaBoost. (A stump is a decision tree with two leaves.) After about 1000 training iterations, I test the classifier, and find the mode of every 10 classified segments to determine the genre of the excerpt. This process is made extremely easy using Multiboost, and a short MATLAB to ARFF script I had to modify.
Below we see the confusion matrix resulting from this method, which has an accuracy in this case of about 73.4%. This is using only decision stumps, and 1000 training iterations, but the results are in the neighborhood of the 83% accuracy reported by Bergstra et al. using a larger tree and 2500 training iterations.
Now, knowing the faults in this dataset, it is not worthwhile to wax poetic on how some of these numbers make sense, e.g., “Rock is close to Country, Disco, and Metal, so those confusions are not unexpected,” or “Hiphop and Reggae share the same roots, and we can see that here.” Instead, armed with this classification method, it is time to run this process many times and then see which excerpts are the repeat offenders, and how the variety of faults of this dataset benefit or hurt the results. I wonder too about the resilience of the AdaBoost.MH approach. I will definitely have to consider that in my analysis by comparing the results using another classification approach.