Happy last day of the Year of the Horse!

Tomorrow marks Chinese New Year for 2015, which means we move to the Year of the Goat from the Year of the Horse. It is thus fitting to briefly review the discussion of “horses” I have brought to music information retrieval research.

Just before my unconference session at ISMIR 2015 someone asked, “What is a horse?” That is a great place to start.

What is a “horse”?

I find all of the following descriptions in my papers and blog posts:

  • a system that appears to be capable of a remarkable human feat, e.g., music genre recognition, but actually works by using irrelevant characteristics (confounds)
  • a system that obtains a good performance in some evaluation by relying on “tricks”, such as by paying attention to irrelevant criteria confounded with the “correct” answers
  • a system that gives the “right” answer for the “wrong” reason
  • a system that has learned and uses criteria that are irrelevant to address the problem for which it is designed
  • a system that is making decisions based on irrelevant but confounded factors
  • a system that is not actually addressing the problem it appears to be [or is claimed to be] solving
  • a system from which you can elicit any response by changing irrelevant factors.

Though I have not been precise, one thing I hope is clear: a “horse” is a system that fools people into believing it is actually solving the problem it appears to be solving though it actually is not. A concrete example in MIR can be a music genre recognition system. Based on the large amount of ground truth it reproduces in a commonly-used dataset, one is persuaded to believe the system is recognising music genre. In reality, such an experiment has no validity for making such a conclusion; and with further experimentation, the system can be revealed to be “solving” the problem of music genre recognition not by learning to consider general musical characteristics, but instead, and much more simply, exploiting characteristic quirks shared between train and test datasets resulting from a poor sampling of the universe of music (confounding).

But, why “horse”?

It was serendepitous that my article was published in the Year of the Horse, but the motivation for my use of “horse” is Clever Hans, the real horse that appeared capable of abstract thinking. Until properly controlled experiments were conducted, many believed from his handler’s demonstrations that Hans was actually solving arithmetical problems, that he could read, and understand the Gregorian calendar and music theory. The controlled experiments of Pfungst showed Hans was quite incapable of arithmetic, and identified the cues he was exploiting that made him appear capable of these feats when he actually was not.

Who cares?

Just as Pfungst did for Hans, the designation of a system as a “horse” comes from showing an experimental outcome is explained by the system’s exploitation of factors inadvertently introduced into the experiment. Finding a “horse” is not a horrible outcome! In fact, it should be celebrated because it provides some very positive things. First, we know something about how the system is working, and what some conditions are in which it will fail or succeed in the real world. This provides a way to improve the system, or adapt it to particular use cases. Second, we have identified flaws in the experimental design, which we can then use to improve it. In summary, I find Pfungst’s work with Hans an excellent model for thinking clearly about experiments with MIR systems.

What’s next?

Obviously, goats are now on my list. Funny enough, an author of that work is at QMUL! It’s a sign.


