Hello, and welcome to Paper of the Day (Po’D): Horses and more horeses edition. Today’s paper is: B. L. Sturm, “A Simple Method to Determine if a Music Information Retrieval System is a `Horse’“, IEEE Trans. Multimedia, 2014 (in press). This double-header of a Po’D also includes this paper: B. L. Sturm, C. Kereliuk, and A. Pikrakis, “A Closer Look at Deep Learning Neural Networks with Low-level Spectral Periodicity Features“, Proc. 4th International Workshop on Cognitive Information Processing, June 2014.
The one-line precis of these papers is:
For some use cases, it is important to ensure Music Information Retrieval (MIR) systems are reproducing “ground truth” for the right reasons: Here’s how.
The seed that sprouted my “horse” article was an experiment in my Three Experiments paper. One of the reviewers of that paper suggested I carve out that piece, develop it, and submit as a journal article. So, it now appears after two years of thought, and my whole-hearted adoption of Clever Hans as an allegory of the current state of evaluation in much of MIR (not to mention many other research domains applying machine learning).
A “horse” is a system that obtains a good performance in some evaluation by relying on “tricks”, such as by paying attention to irrelevant criteria confounded with the “correct” answers. Clever Hans was such a “horse” — as well as a real horse — because he was “solving” mathematical problems by closely watching and responding to the unintentional cues of a questioner, and not by using abstract thinking. In the public demonstrations of Clever Hans, he arrived at the “right” answers for the “wrong” reasons. Employing Hans for computing actuarial tables will not pay off, even though his performance in some evaluation is stellar. Similarly, a “horse” of an MIR system has learned and uses criteria that are irrelevant to address the problem for which it is designed. Employing such an MIR system for describing music is not going to pay off, even though its performance on GTZAN beats the competition. In this article, I propose and demonstrate an approach to address a question that is rarely explicitly asked in MIR evaluation: Why does system X appear to perform as it does? I seek to explain the results of an evaluation with regards to the problem a system supposedly addresses. More specifically, I test whether the system is a “horse”.
The idea, motivated by the experiments of Pfungst on Clever Hans, is very simple. If a system is working by using characteristics relevant to solving its intended problem, then “irrelevant transformations” of “questions” posed to it will have no affect on its behavior. In the case of Clever Hans, such irrelevant transformations of mathematical questions are having the questioner stand out of the field of view of Hans, or having Hans answer a mathematical question to which the questioner does not know the correct answer. These changes to the conditions of the question should not affect any system that can actually solve mathematical questions; but for Hans, they had a massive negative impact. This eventually led to the discovery that Hans was only responding to visual but unintentional cues, and the coup de grâce of eliciting any response from Hans to any question by intentionally controlling those cues.
In my “horse” article, I consider two music genre recognition (MGR) systems, and a music emotion recognition (MER) system. Whatever MGR and MER means for these systems, I don’t know; but according to the standard evaluation approach using three datasets (two of which are benchmark datasets in MGR research), these three systems obtain excellent (competitive with state of the art) figures of merit (FoM). One MGR system scores an accuracy of over 0.77 in GTZAN, which is better than about half of the best accuracies reported in GTZAN over the past ten years. Another system has an accuracy of nearly 0.84 in the validation set of the ISMIR2004 genre dataset. When it comes to such results — artificial well-defined systems appearing capable of addressing ill-defined problems entirely dependent upon human culture and all its peculiarities — the default position is that “something fishy is going on here”, or “They are `horses’.” Their results must be explained. The hypothesis I thus test is that these systems achieve such good FoM not by listening to the music, but by relying on confounds as Clever Hans did.
To do this, I propose and employ the “method of irrelevant transformations.” This method makes subtle changes to audio signals such that the music is not affected in a way that is relevant to the problem. In my “horse” article, I attempt to make subtle adjustments to the spectral characteristics of the recordings of music in the testing dataset such that the FoM of a system either inflates to perfect, or deflates to random. If this is possible, then the system is not paying attention to the music, just as Clever Hans is not paying attention to the question. For each of the three systems above, I am able to elicit significant changes in their FoM using the same music. The FoM of one MGR system goes from an accuracy of 0.77, up to above 0.96, and down to below 0.1. It is similar for the other systems. Clearly, the systems are not paying attention to music, and are responding correctly due to spectral characteristics confounded with the ground truth in these datasets. I then perform a second experiment showing that each system can be made to classify the same music in different ways by these irrelevant changes. These examples can be heard here: http://imi.aau.dk/~bst/research/TM_expt2/.
In our deep learning paper, we apply the method of irrelevant transformations again to explain the FoM of the system that “won” the Audio Latin Music Genre Classification contest in MIREX last year (ALGC). Since the dataset used in that evaluation is not public, we instead used the BALLROOM dataset.
We built an MGR system from that dataset, and the same algorithms and settings used in the “winning” ALGC system. Again, we find exceptional FoM, an accuracy of nearly 0.89.
Using irrelevant transformations by subtle time-stretching (increasing/decreasing the length of a 30s excerpt by no more than 1.8s), we can deflate this FoM to 0.11, or inflate to 1.0. We are also able to make the system choose many classes for the same music. These examples can be heard here: http://imi.aau.dk/~bst/research/DeSPerFtable/exp.html.
(A reproducible research package for this paper is here.)
That such subtle changes to the input radically affect the FoM of this system shows its reliance on some temporal characteristic confounded with the “ground truth” of BALLROOM. To prove such a characteristic exists, the figure below shows for BALLROOM how its genre labels are confounded with tempo.
Indeed, one might argue that an MGR system considering tempo is not really a “horse” because tempo might be relevant to discriminating between or identifying at least some genres. If we employ a 3-nearest neighbor classifier using only the tempo of the BALLROOM excerpts, its accuracy is over 0.78 in reproducing the “ground truth”. Such a system clearly has no musicological knowledge such as instrumentation, meter, rhythm, structure, etc., necessary for discriminating between and recognizing the “genres” in BALLROOM. Yet, the system still obtains a high FoM in BALLROOM (which would fall apart quickly with the same irrelevant transformation of time-stretching) simply because tempo is so strongly correlated with the “ground truth” of BALLROOM (probably because many of its excerpts come from pedagogical materials) — a correlation that likely doesn’t exist in the real world. Were we to redefine the labels of BALLROOM to be instead “tempo”, then the system would not be a “horse” because the problem is no longer MGR, but tempo recognition. The point is that the appearance of good performance in BALLROOM comes mainly from the existence of a confound, and a system learning to exploit it.
Of course, work must be done surrounding the definition of what makes a transformation “irrelevant” or not, which depends entirely on the scientific question to be answered. Spectral equalization, and time-stretching, are not irrelevant transformations for all possible questions of an MIR system. Also to be shown is whether the dataset used in ALGC (LMD) has this confound of tempo with genre labels — as of now that is our prediction from our results.
Clearly, all these MIR systems appear able to perform remarkable human feats, but only because the standard evaluation methods used in MIR do not discriminate between “horses” and solutions. Their excellent FoM come not from their capacity to address the problem of MGR or MER (whatever those are), but from a lack of control over independent variables in the evaluation (giving rise to confounds). “Horses” aren’t always bad; but when a “horse” is insufficient to achieve the success criteria of a use case, it is imperative to design solutions, and employ evaluation methods sensitive to detecting a “horse”.