Finally published is my article, Classification accuracy is not enough: On the evaluation of music genre recognition systems. I made it completely open access and free for anyone.
Some background: In my paper Two Systems for Automatic Music Genre Recognition: What Are They Really Recognizing?,
I perform three different experiments to determine how well two state-of-the-art systems for music genre recognition are recognizing genre. In the first experiment, I find the two systems are consistently making extremely bad misclassifications. In the second experiment, I find the two systems can be fooled by such simple transformations that they cannot possibly be listening to the music. In the third experiment, I find their internal models of the genres do not match how humans think the genres sound.
Hence, it appears that the systems are not recognizing genre in the least.
However, this seems to contradict the fact that they achieve extremely good classification accuracies, and have been touted as superior solutions in the literature.
Turns out, Classification accuracy is not enough!
In this article, I show why this is the case, and attempt to torpedo “business as usual”,
i.e., using evaluation approaches that are standard in machine leaning.
I step through the evaluation of three different music genre recognition systems, pointing out published works making arguments at each level of evaluation.
I start with classification accuracy, and move deeper into precision, recall, F-scores, and confusion tables. Most published work stops with those, invalidly concluding that some amount of genre recognition is evident, or that an improvement is clear, and so on.
However, “business as usual” provides no evidence for these claims.
So, I move deeper to evaluate the behaviors of the three systems,
and to actually consider the content of the excerpts, i.e., the music.
I look closely at what kinds of mistakes the systems make,
and find they all make very poor yet “confident” mistakes.
I demonstrate the latter by looking at the decision statistics of the systems.
There is little difference for a system between making a correct classification,
and an incorrect one.
To judge how poor the mistakes are, I test with humans whether the labels selected by the classifiers describe the music.
Test subjects listen to a music excerpt and select between two labels which they think was given by a human.
Not one of the systems fooled anyone.
Hence, while all the systems had good classification accuracies,
good precisions, recalls, and F-scores, and confusion matrices that appeared to make sense, a deeper evaluation shows that none of them are recognizing genre,
and thus that none of them are even addressing the problem.
(They are all horses, making decisions based on irrelevant but confounded factors.)
The evaluation approaches of “business as usual” are entirely inadequate to address not only music genre recognition,
but also music emotion recognition (see my ICME paper), and music autotagging (forthcoming submission).
With all the work I have done in the time since submitting this work,
I now immediately reject any submitted work proposing a system in these application areas that has an evaluation going no deeper than classification accuracies and confusion matrices. “Business as usual” is a waste of time, and a considerable amount of such business has been performed.
Will “business as usual” continue?