Hello, and welcome to the Paper of the Day (Po’D): Automatic classification of musical genres using inter-genre similarity edition. Today’s paper is: U. Bağci and E. Erzin, “Automatic classification of musical genres using inter-genre similarity,” IEEE Signal Proc. Letters, vol. 14, pp. 521-524, Aug. 2007. Related to this work are the following four:
- U. Bag ̆cı and E. Erzin, “Boosting classifiers for music genre classifi- cation,” in Proc. 20th Int. Symp. Comput. Inform. Sci. (ISCIS’05), Istanbul, Turkey, Oct. 2005, pp. 575-584.
- U. Bagci and E. Erzin, “Inter genre similarity modeling for automatic music genre classification,” in Proc. IEEE Signal Process. Comm. Apps., pp. 1-4, Apr. 2006.
- U. Bagci and E. Erzin, “Inter genre similarity modeling for automatic music genre classification,” in Proc. DAFx 2006.
- U. Bagci and E. Erzin, “INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION” arXiv:0907.3220v1, 2009.
(Is the DAFx paper an English translation of the IEEE conference paper written in Turkish?)
This work is next on my docket for reproduction, for not the least of reasons that it reports a classification accuracy of over 92% in the GTZAN dataset.
For most of the time in our day-to-day machine learning experiences, we want more data in order to build better models. In this paper, however, the approach is to find the music excerpts of each genre that are the most difficult to classify, and then use only those to build the classifier. The idea is that by learning the difficult examples, we will also lean the easy ones.
The features used in this work are 13 MFCCs (with or without the zeroth coefficient?), spectral centroid, roll-off, flux, and zero-crossing rate, all computed from each 25 ms window shifted by 10 ms (I think, they talk about “25 ms overlapping audio windows … for each 10 ms frame.” And later it is written: “The resulting … feature vectors are extracted for each 10 ms audio frame with an overlapping window of size 25 ms.” Huh?).
For a set of time-ordered features then, the system computes weighted first and second derivatives.
Finally, the system creates for each analysis window (except the first few because of the derivatives) a feature vector by concatenating all of these features into one that has 51 dimensions.
The system next models each genre by mixtures of Gaussians in this 51-dimensional space using features from the training data. Then, the system classifies each “frame” (is that a 25 ms. window?) of the same training data. Taking only the misclassified frames in each genre, the system constructs a new model for these features, which represents the “inter-genre similarity” (IGS) distribution. (This kind of reminds me of a universal background model in speaker identification.)
Finally, the system updates its models of each genre by using only the correctly classified frames.
This leaves us with 10 genre models and 1 model of the IGS.
This can be done in an iterative fashion as well, so as to tune all the models.
Now, to classify an excerpt producing a set of feature vectors (in a decision window of, say, 3 or 30 s duration), the system computes for each genre a weighted sum of log posteriors, with weights acting as a 1-0 loss function with respect to the IGS model (i.e., if the cond. prob. of observing a feature given the model of genre \(n\) is smaller than the cond. prob. of observing a feature given the IGS model, then the weight of that posterior is zero; otherwise it is one).
This acts to minimize the insecurity imparted to a classifier by frames that are difficult-to-model by any genre.
For comparison, the authors set all these weights to one so the IGS model is never used in classification (called the “flat classifier”).
To test this approach, the article use the GTZAN dataset, 2-fold cross-validation,
and compute mean classification accuracy (no std.dev.).
Testing various number of Gaussians, and different feature sets,
the article reports for the flat classifier mean accuracies of about 70% using only MFCCs, and their first and second weighted derivatives modeled by mixtures of 16-48 Gaussians (with diagonal covariances) over 30 s decision windows.
For these same features and parameters but with the 1-0 loss decision function and the IGS model (not tuned), the article reports mean accuracies around 85%.
(With a mixture of 48 Gaussians we see mean accuracy above 88%.)
The confusion matrix in this article shows some interesting items.
First, when using the IGS model vs. the flat classifier, most confusions decrease.
We go from 19 Rock confused by the flat classifier as Country to only 4 using the IGS model.
But in two cases the confusions increase by a lot: we go from 4 confusions of Country as Metal by the flat classifier, to 11 by the IGS model. What are these Country Metal tunes? (I think one must be “White Lightening” by George Jones — a real boozer of a rock star, by the way.)
And we go from 9 confusions of Rock as Metal by the flat classifier to 17 by IGS, which might not be so troublesome considering excerpts by Queen appear in both categories (including the exact same excerpt).
Finally, the paper reports that the approach created by tuning the genre and IGS models five times obtains a mean classification accuracy of 92.4% over a 30 s decision window.