Test results for inter-genre similarity, part 1

Having reviewed yesterday the work on music genre classification by boosting with inter-genre similarity, I now have some results.
Adapting my MFCC code to fit the description in the text,
and using the excellent PRTools toolbox,
it only took me a matter of minutes to create the code.
The time to run it, however, takes hours.

Essentially, the core of the algorithm is like:

% create initial models (mixture of Gaussians classifier)
W = mogc(traindataset,numGaussians);
% classify
results = traindataset*W;
predictedLabels = results*labeld;
% find those frames that are wrongly classified
idx_wrong = ~(predictedLabels == traindataset.nlab);
% create new dataset from those wrongly classified
newlabels = traindataset.nlab;
newlabels(idx_wrong) = 0; % this is the IGS class
traindataset = dataset(traindataset.data,newlabels);
traindataset = setprior(traindataset,0);
% create final models
W = mogc(traindataset,numGaussians);
% test
results = testdataset*W;

I am assuming the priors for any of the genre categories or the IGS model are the same, 1/11. (Nothing about this is mentioned in the paper.)
Then to classify, I take the posteriors of each class for a set of 2770 frames.
(Bagci and Erzin say they use 3000 frames for all excerpts, but I don’t believe it since there are some excerpts in GTZAN that are shorter than 30 seconds.)

sumlogpost = -Inf*ones(numclasses,1);
% compare posteriors of each class to those of IGS model
for jj=1:numclasses
posteriorsforthisclass = posteriors(:,jj+1);
% find those that have posteriors above that of IGS
idx = (posteriorsforthisclass > posteriors(:,1));
numframestoconsider(jj) = sum(idx);
if numframestoconsider(jj)
sumlogpost(jj) = sum(log(posteriorsforthisclass(idx)))/numframestoconsider(jj);
[~,pred] = max(sumlogpost);

And that is all.

When we do not consider whether the posterior of a frame is more likely to come from the IGS model (the flat classifier), we get the following confusion table for 2-fold CV in GTZAN using GMMs of order 8 (full covariance matrices however, instead of the diagonal ones specified in the text; and 27.7 s decision windows instead of 30 s).
(Recall is along diagonal, precision is the right-hand column, F-measure is last row, and element at lower right is classification accuracy.)
That is a classification accuracy of about 63%.
The mean classification accuracy
reported by Bagci and Erzin is about 67%.
I think we are certainly in the neighborhood.

Now, when we take the IGS model into account,
we observe the following confusion table using the exact same folds.
That is a classification accuracy in the high-tens — like the weather across much of the US.
It apparently thinks most things are Rock and Metal.
The mean classification accuracy reported by Bagci and Erzin
for this IGS is about 87%.
No longer are we in the neighborhood, but on the Tom Waits train to the other side of hell.

When we instead make the classifier decide based upon only those frames that have a higher posterior in the IGS model, we observe the following confusions.
This is a statistically significant improvement indeed, but confusing why that should work better than before.

So, on the basis of what number of frames in each class
is this classifier making its decision?
Let’s look at the distribution of the percentage of frames the classifier uses to make classifications that are incorrect.
Here we see for Blues excerpts that the classifier is only looking at about 17% of the frames to mislabel it. For classical, it is only looking at around less than 10% (277 25 ms frames). For the most part, we would expect most wrong classifications to occur on the basis of too few reliable frames. So this makes sense.

Now let’s look at the distribution for when classifications are correct —
which, as we see above, this classifier rarely is.
This is quite a different picture.
Except for Country, Disco, Hip hop, Pop, and Rock,
most of the correct decisions are made using a majority of frames.
Particularly worrisome is that most of the frames of excerpts in Country, Hip hop, and Rock, and all of the frames of Pop, are too well-described by the IGS model.

I am now going to verify things are working correctly in PRtools.
One problem might be the fact that I am estimating full covariance matrices, whereas Bagci and Erzin estimate diagonal covariance matrices.
The multiplicity of estimates in my model may be contributing to the bad results.
Another problem might be the features. I will reduce to MFCCs without the delta and delta delta coefficients to see if that makes a difference.
Yet another problem is that too many frames appear to end up in the IGS model. Maybe those need to be capped somehow.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s