Appearing at ISMIR 2011 was the following intriguing paper:
C. Marques and I. R. Guiherme and R. Y. M. Nakamura and J. P. Papa, “New Trends in Musical Genre Classification Using Optimum-Path Forest”, Proc. ISMIR, 2011.
As it reports classification accuracies in GTZAN above 98.8%,
it certainly caught my attention.
With respect to the classification accuracies in GTZAN reported in 94 other works, we see that of the optimum path forest in the image below as reference :
Joao has filled in a critical detail missing from the paper.
Their results come from classifying every feature (computed from a 23 ms window)
instead of the 30 s excerpts.
This is even more curious to me since experience shows such classification
should be very poor … unless the partitioning of the dataset into training and test sets distributes features from excerpts across instead of keeping them separated.
Looking at the code behind the “opf_split” program confirms that it takes no care to avoid a biased partition.
Another curious detail in the paper is that they write they have 33,618 MFCC vectors from the 1000 excerpts in GTZAN.
I get 1,291,628 MFCC vectors.
So, I decided to run this evaluation as I think they did:
./runOPF.sh alldata.bin 0.5 0.5 1 1
where “alldata.bin” is an OPF-formatted file of the features I compute in MATLAB,
the first two numbers specify the train/test split, the last two numbers denote whether feature normalization is used, and how many independent trials to run.
Here is some of the output:
Training time: 23248.525391 seconds Testing time: 30824.958984 seconds Supervised OPF mean accuracy 74.323967
We see that after nearly 15 hours of computation,
we don’t get anywhere near the 98.8% accuracy.
And without feature normalization, the accuracy rises only to about 76.3%.
The paper reports that the training and testing times for OPF in GTZAN
are 9 and 4 seconds, respectively.
Respectfully, my computer is not so slow as to cause a 7000 fold
increase in computation time.
I tried several other things to increase the accuracy,
but nothing was working.
Then I tried testing and training on the same fold, and got an accuracy of 99.97%.
Joao confirms that this appears to be at least part of what happened.
Now, I am going to run the same experiment, but using a proper partitioning,
and the fault filtering necessary for evaluating systems with GTZAN. I predict that we should see the classification accuracy drop from 74 to at least 55.