Classification accuracy is not enough

Finally published is my article, Classification accuracy is not enough: On the evaluation of music genre recognition systems. I made it completely open access and free for anyone.

Some background: In my paper Two Systems for Automatic Music Genre Recognition: What Are They Really Recognizing?,
I perform three different experiments to determine how well two state-of-the-art systems for music genre recognition are recognizing genre. In the first experiment, I find the two systems are consistently making extremely bad misclassifications. In the second experiment, I find the two systems can be fooled by such simple transformations that they cannot possibly be listening to the music. In the third experiment, I find their internal models of the genres do not match how humans think the genres sound.
Hence, it appears that the systems are not recognizing genre in the least.
However, this seems to contradict the fact that they achieve extremely good classification accuracies, and have been touted as superior solutions in the literature.
Turns out, Classification accuracy is not enough!

In this article, I show why this is the case, and attempt to torpedo “business as usual”,
i.e., using evaluation approaches that are standard in machine leaning.
I step through the evaluation of three different music genre recognition systems, pointing out published works making arguments at each level of evaluation.
I start with classification accuracy, and move deeper into precision, recall, F-scores, and confusion tables. Most published work stops with those, invalidly concluding that some amount of genre recognition is evident, or that an improvement is clear, and so on.
However, “business as usual” provides no evidence for these claims.
So, I move deeper to evaluate the behaviors of the three systems,
and to actually consider the content of the excerpts, i.e., the music.
I look closely at what kinds of mistakes the systems make,
and find they all make very poor yet “confident” mistakes.
I demonstrate the latter by looking at the decision statistics of the systems.
There is little difference for a system between making a correct classification,
and an incorrect one.
To judge how poor the mistakes are, I test with humans whether the labels selected by the classifiers describe the music.
Test subjects listen to a music excerpt and select between two labels which they think was given by a human.
Not one of the systems fooled anyone.
Hence, while all the systems had good classification accuracies,
good precisions, recalls, and F-scores, and confusion matrices that appeared to make sense, a deeper evaluation shows that none of them are recognizing genre,
and thus that none of them are even addressing the problem.
(They are all horses, making decisions based on irrelevant but confounded factors.)

The evaluation approaches of “business as usual” are entirely inadequate to address not only music genre recognition,
but also music emotion recognition (see my ICME paper), and music autotagging (forthcoming submission).
With all the work I have done in the time since submitting this work,
I now immediately reject any submitted work proposing a system in these application areas that has an evaluation going no deeper than classification accuracies and confusion matrices. “Business as usual” is a waste of time, and a considerable amount of such business has been performed.

Will “business as usual” continue?


2 thoughts on “Classification accuracy is not enough

  1. Given that classification measures will not be rooted out any time soon, can we conceive of database constructions in which classification *can* be a meaningfull measure?
    After all, at some point accuracy will be high enough for usage in applications, no matter the non-sensibility of the remaining errors…


  2. Hi Jort. Hope your summer is going well!
    I give several alternatives in my article, especially sections 5 and 6. These all grow out of classification in some dataset. I take great pains to not say, “Classification accuracy is not useful”. It is just not enough. The behavior of the MGR system is of interest, i.e., whether it is recognizing genre by factors relevant to genre, and classification accuracy, confusion tables, etc., do not capture that. Even with 100% classification accuracy measured in some dataset, one still cannot claim with any validity that the system is recognizing genre, and that it will be useful in the real world for some use case. The answer is not in constructing an ideal dataset (and I don’t know how to do that with real music), but in evaluating MGR systems in ways that have the validity to address the relevant claim.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s