Hello, and welcome to the Paper of the Day (Po’D): The “Horse” Inside Edition. Today’s article is my own, accepted for a journal special issue in 2015 but yet to appear: B. L. Sturm, “The ‘Horse’ Inside: Seeking Causes Behind the Behaviours of Music Content Analysis Systems”, http://arxiv.org/abs/1606.03044.
The central questions of this work is this: what is the “winning” system of the 2013 MIREX challenge Audio Train/Test: Genre Classification (Latin) actually doing to reproduce dataset ground truth? How does it appear capable of identifying and discriminating between recorded music labeled by and large according to extrinsic factors, e.g., how people dance to it and geographic origin? This is a very different question in a domain where the most prevalent questions are instead, What is the accuracy, precision, F-score, etc. of my system? Which are the most and least predictive of my features? What combination of features and machine learning methods gives me the highest accuracy?
My article encompasses and extends the past work of two specific papers. In B. L. Sturm, C. Kereliuk, and A. Pikrakis, “A closer look at deep learning neural networks with low-level spectral periodicity features,” in Proc. Int. Workshop on Cognitive Info. Process., pp. 1–6, 2014, we take a first look at the system in question (Po’D here). Through two intervention experiments, we find that the system (trained in BALLROOM) is highly sensitive to slight changes in tempo. We can make the system choose many different labels for the same music just be changing the tempo. Results can be auditioned here. We also re-discover the high correlation between music tempo and label in the dataset, which leads a 3-nearest neighbor classifier to achieve a very high accuracy.
In B. L. Sturm, C. Kereliuk, and J. Larsen, “¿El Caballo Viejo? Latin genre recognition with deep learning and spectral periodicity,” in Proc. Int. Conf. on Mathematics and Computation in Music, pp. 335–346, 2015, we return to the system in question, but now trained and tested in the Latin Music Database (discussion and slides here). We perform the same tempo intervention experiment, and find the system to again be highly sensitive, but not to the extent of our previous experiments in BALLROOM. Results can be auditioned here.
In my new article, I bring this work together under the umbrella of meaningfully evaluating machine music listening systems (and how metacreation can provide useful tools). I envision this article (and my ISMIR 2016 paper, Rodríguez, Sturm and Maruri, “Analysing Scattering-based Music Content Analysis Systems: Where’s the Music?”) as a “how-to” for analysing a system and explaining the results of a classification test. This involves dissecting a system into its components and ascertaining the kinds of knowledge to which it is and is not sensitive. This involves looking closely at the materials used to teach it and test it, and how they comport with the intended task of the system. This involves performing different intervention experiments to tease out the causes behind the system’s behaviours. Last, but not least, this also involves hard work, creativity, and a willingness to accept that a high-performing system may actually be useless.
If you are interested in these issues, think about contributing to HORSE 2016!