Hello, and welcome to the Paper of the Day (Po’D): Evaluating music emotion recognition: Lessons from music genre recognition? edition. Today’s paper is my third accepted for presentation at the 2013 IEEE Int. Conf. on Multimedia and Expo: B. L. Sturm, “Evaluating music emotion recognition: Lessons from music genre recognition?“
The one-line summary of this paper, for those in a hurry: Meaningful conclusions about music genre/emotion recognition systems do not follow from standard approaches to evaluation. Here is why, and what to do about it.
In this paper, we finally identify a major and fundamental problem
with most research in music genre/emotion recognition. Starting with my paper “Two Systems for Automatic Music Genre Recognition: What Are They Really Recognizing?“,
I knew something was not right: why do state of the art and high-performing music genre recognition systems behave so strangely?
Surely, someone else has remarked on this behavior, and taken different approaches to evaluate systems designed to address this extremely complex problem.
So, I looked at how genre recognition systems have been evaluated
by reading a few papers, and cataloging their approaches to evaluation: “A Survey of Evaluation in Music Genre Recognition.”
This revealed that hardly anyone has thought much about evaluation,
and a vast majority use a standard approach in machine learning
to evaluate supervised learning, i.e., comparing predicted labels with those of a ground truth, and present accuracy as a figure of merit.
So, in “Classification Accuracy Is Not Enough : On the Evaluation of Music Genre Recognition Systems“, we show why this evaluation approach — used in 91% of published work we review in our survey — is incapable of measuring the depth to which a music genre recognition system recognizes genre. In short, a richer kind of evaluation is necessary for determining which proposed systems are promising for solving the problem.
Then, in “The GTZAN dataset: Its contents, its faults, their affect on evaluation, and its future use” (about to be resubmitted), we show that the results of 96 works that evaluate classification accuracy even in the same dataset cannot be meaningfully compared in any useful sense. (We also show that when taking into account all the faults of GTZAN, classification accuracies of systems that were estimated to be around 80%, decay to 50% or lower.)
Now, in today’s PoD, we identify the principal goals of music genre/emotion recognition, and show why the most widely-used approach to evaluate these systems provide nothing relevant. (We do not argue whether genre/emotion recognition is a good idea, or if it is well-posed, and so on. We only address the fundamental problem of evaluating whether a system can recognize music genre or emotion.) In the words of Richard Hamming: “There is a confusion between what is reliably measured, and what is relevant. … Just because a form of measurement is popular has nothing to do with its relevance.”
When a genre recognition system is tested by comparing labels in test data having many uncontrolled independent variables (e.g., dynamic compression, dynamic range, loudness, and so on), one cannot logically conclude the performance is due to a capacity to recognize genre/emotion in music. Even when one sees 100% classification accuracy! Classification accuracy in this case, while easy to do, is irrelevant for reliably measuring whether a system is recognizing genre/emotion. The conclusion does not validly follow as long as all but one independent variable are uncontrolled. This is basic experimental design, and it appears to have been rarely considered in music genre/emotion recognition.
In short: When Clever Hans trots into town, do not insist on asking more questions of the same kind.
Now, watch this and tell me, why does Maggie look to her handler when it was Oprah who asked the question?