I am known as “that guy”, but really I am “that other guy”

My colleague Bryan Pardo is not the only one who has told me that I am known as “that guy”. You know, the one who says, “Do not use GTZAN.” I can understand why many believe that, but I have never said to not use GTZAN. I addressed this misconception about a year ago, but I should do it yearly because I am receiving similar comments from anonymous reviewers of my work, e.g., “Why use GTZAN, given that you were the person who pointed out all the flaws in that dataset?”

First, there was this paper: “An analysis of the GTZAN music genre dataset,” in Proc. ACM MIRUM Workshop, (Nara, Japan), pp. 7–12, Nov. 2012. And the accompanying presentation. In the discussion of that paper, I write:

That so much work in the past decade has used GTZAN to train and test music genre recognition systems raises the question of the extent to which we should believe any conclusions drawn from the results. Of course, since all this work has had to face the same problems in GTZAN, it can be argued their results are still comparable. This, however, makes the false assumption that all machine learning approaches so far used are affected in the same ways by these problems.

Growing significantly from that work was this paper: “The GTZAN dataset: Its contents, its faults, their effects on evaluation, and its future use,” arXiv 2013, which ended up published in my article, “The state of the art ten years after a state of the art: Future research in music information retrieval,” J. New Music Research, vol. 43, no. 2, pp. 147–172, 2014. In this “State of the Art“, “GTZAN” appears 220 times, which present many opportunities for something like, “Do not use GTZAN.” In the discussion of that article, however, I write the contrary:

One might hold little hope that GTZAN could ever be useful for tasks such as evaluating systems for MGR, audio similarity, autotagging, and the like. Some might call for GTZAN to be ‘banished’. There are, however, many ways to evaluate an MGR system using GTZAN. Indeed, its faults are representative of data in the real world, and they themselves can be used in the service of evaluation (Sturm 2013a, 2013g). For instance, by using the GTZAN dataset, we (Sturm 2012c, 2013a, 2013g) perform several different experiments to illuminate the (in)sanity of a system’s internal model of music genre. In one experiment (Sturm 2012c, 2013a), we look at the kinds of pathological errors of an MGR system rather than what it labels ‘correctly’. We design a reduced Turing test to measure how well the ‘wrong’ labels it selects imitate human choices. In another experiment (Sturm 2012c, 2013g), we attempt to fool a system into selecting any genre label for the same piece of music by changing factors that are irrelevant to genre, e.g. by subtle time-invariant filtering. In another experiment (Sturm 2012c), we have an MGR system compose music excerpts it hears as highly representative of the ‘genres of GTZAN’, and then we perform a formal listening test to determine if those genres are recognizable. The lesson is not to banish GTZAN, but to use it with full consideration of its musical content. It is currently not clear what makes a dataset ‘better’ than another for MGR, and whether any are free of the kinds of faults in GTZAN; but at least now with GTZAN, one has a manageable, public, and finally well-studied dataset.

The above references my use of GTZAN in several papers: “Two systems for automatic music genre recognition: What are they really recognizing?,” in Proc. ACM MIRUM Workshop, pp. 69–74, Nov. 2012. Here is the accompanying presentation. I also use GTZAN in these articles: “Classification accuracy is not enough: On the evaluation of music genre recognition systems,” J. Intell. Info. Systems, vol. 41, no. 3, pp. 371–406, 2013; “A simple method to determine if a music information retrieval system is a “horse”,” IEEE Trans. Multimedia, vol. 16, no. 6, pp. 1636–1644, 2014.

We even use GTZAN in this paper: B. L. Sturm and F. Gouyon, “Revisiting inter-genre similarity,” IEEE Sig. Process. Letts., vol. 20, pp. 1050–1053, Nov. 2013.

Stay tuned for more forthcoming papers in which we use GTZAN!

So, if I am not “that guy”, then what guy am I? Some have said I am “that guy who misunderstands the fundamentals of machine learning”. Fair enough, but that is another post. :)

In my State of the Art article, I address the assumption that “all machine learning approaches are similarly affected by the faults in GTZAN”, and show that it is false. Here is the relevant figure from that article.

figure07

Five different systems (x-axis), trained and tested the same way with each of four different partitioning strategies (legend) of GTZAN. Left and right markers are results on different folds. “ST” is “business as usual”, 10 repetitions of stratified 2-fold cross-validation using random splits. “ST'” is 10 repetitions of a nearly stratified 2-fold cross-validation using random splits, but after having removed 67 samples that are exact and recording repetitions, or highly distorted recording from GTZAN. The simple nearest neighbor (NN) classifier shows a significant decrease in the amount of ground truth it reproduces when we remove these exact and recording replicas. The minimum (Mahalanobis) distance classifiers (MD, MMD) are not so affected because their Gaussian models benefit less from exact replicas than NN. “AF” is artist filtering, where we have split the dataset into approximately two halves where no “artist” appears in both; and “AF'” is artist filtering with the exact and recording replicas removed. Now we see the five different machine learning methods are affected to different degrees by considering the replication of artists, excerpts, and recordings. In other words, whereas in ST one would say MAPsCAT is the “second best”, in AF it is not significantly different from simple NN and minimum distance. Whereas NN is in the middle in ST, it is “worst” in AF. SRCAM remains the “best”, but why? How does it relate to quality? How does it relate to music genre — whatever that is?

This should not be controversial, and only reiterates similar effects noted before for other systems and other datasets:

  • E. Pampalk, A. Flexer, and G. Widmer, “Improvements of audio-based music similarity and genre classification,” in Proc. ISMIR, pp. 628–233, Sep. 2005.
  • Y. E. Kim, D. S. Williamson, and S. Pilli, “Towards quantifying the “album effect” in artist identification,” in ISMIR, pp. 393–394, 2006.
  • A. Flexer, “A closer look on artist filters for musical genre classification,” in Proc. ISMIR, pp. 341–344, Sep. 2007.
  • A. Flexer and D. Schnitzer, “Effects of album and artist filters in audio similarity computed for very large music databases,” Computer Music J., vol. 34, no. 3, pp. 20–28, 2010.

However, the most unfortunate conclusion from all of this is that one has very shaky ground upon which to conclude anything from the 100+ published results derived from using GTZAN in a specific way (Fig. 4 in my article The GTZAN dataset, reproduced below):

GTZANaccs

One might claim, “I basically believe that results on GTZAN are at least broadly correlated with relevance and quality of systems, at least for the majority of papers that have used it.” But belief does not guarantee correlation and relevance, experimental validity, and scientific conclusions. Even armed with that belief, one is still left to judge which majority of results in the figures above reflect “system quality”, not to mention even defining “system quality.” It is a tough position to be in, but I think we can do much better than proceeding with belief in business as usual.

So, I am the guy saying things like,

To conclude, one colleague commented at my recent presentation, ¿El Caballo Viejo?, “At some point you are going to have start coming up with solutions.” It may appear like all I am doing is complaining, and tearing up people’s work; but that is not the case. Here is a brief menu of the constructive directions I have tried to contribute in my work (much of the work is reproducible, with code here):

“Are we asking the right questions? Here are some new questions to ask music genre recognition systems to understand what they are doing.” (B. L. Sturm, “Two systems for automatic music genre recognition: What are they really recognizing?,” in Proc. ACM MIRUM Workshop, pp. 69–74, Nov. 2012.)

“What are the current and prevailing practices of evaluation in music genre recognition research?” (B. L. Sturm, “A survey of evaluation in music genre recognition,” in Adaptive Multimedia Retrieval: Semantics, Context, and Adaptation (A. Nünberger, S. Stober, B. Larsen, and M. Detyniecki, eds.), vol. LNCS 8382, pp. 29–66, Oct. 2014.)

“Here is how one might incorporate non-uniform risk in music genre classification.” (B. L. Sturm, “Music genre recognition with risk and rejection,” in Proc. ICME, pp. 1–6, 2013.)

“Here are several other ways one can look into music content analysis systems than just comparing the amount of ground truth reproduced.” (B. L. Sturm, “Classification accuracy is not enough: On the evaluation of music genre recognition systems,” J. Intell. Info. Systems, vol. 41, no. 3, pp. 371–406, 2013.)

“Here is a machine learning system that comes from a nice and persuasive idea, but with a little analysis we see it turns out to be a bad idea.” (B. L. Sturm and F. Gouyon, “Revisiting inter-genre similarity,” IEEE Sig. Process. Letts., vol. 20, pp. 1050–1053, Nov. 2013.)

“Here are five maxims to guide future research in music information retrieval.” (B. L. Sturm, “The state of the art ten years after a state of the art: Future research in music information retrieval,” J. New Music Research, vol. 43, no. 2, pp. 147–172, 2014.)

“Here is a very simple test you can perform to determine whether your system is a ‘horse’.” (B. L. Sturm, “A simple method to determine if a music information retrieval system is a “horse”,” IEEE Trans. Multimedia, vol. 16, no. 6, pp. 1636–1644, 2014.)

“Evaluation in MIR is essentially an experiment. Can we begin to fold into this pursuit the formalised design and analysis of experiments?” (B.L.Sturm, “Making Explicit the Formalism Underlying Evaluation in Music Information Retrieval Research: A Look at the {MIREX} Automatic Mood Classification Task,” in Post-proc. Computer Music Modeling and Research, pp. 89–104, 2014.)

“Yes, we can; but we first have to figure out what problem we are trying to solve. Let’s try and formalise this pursuit of music information retrieval so that we can start to attack this very hard problem of evaluation.” (B. L. Sturm, R. Bardeli, T. Langlois, and V. Emiya, “Formalizing the problem of music description,” in Proc. ISMIR, pp. 89–94, 2014.)

“If music genre recognition is without the definition necessary to demonstrably address it with current approaches, can we design a similar problem that is within reach?” (B. L. Sturm and N. Collins, “The kiki-bouba challenge: Algorithmic composition for content-based mir research & development,” in Proc. ISMIR, pp. 21–26, 2014.)

And now, to appear at the coming “Machine Learning for Music Discovery Workshop” at the International Conference on Machine Learning (ICML) 2015: B. L. Sturm, H. Maruri-Aguilar, B. Parker and H. Grossmann, “The Scientific Evaluation of Music Content Analysis Systems: Valid Empirical Foundations for Future Real-World Impact”.

That title really is too grand for a two page abstract, but this is the direction we are heading. After four hours of meetings, we finally have an idea of things the experimental unit can be — which depends entirely on the question one is asking. Our forthcoming work will hopefully offer new and valid experimental paradigms from which we can all benefit.

4 thoughts on “I am known as “that guy”, but really I am “that other guy”

  1. Pingback: Paper of the Day (Po’D): Debugging Machine Learning Tasks Edition | High Noon GMT

  2. Pingback: Paper of the Day (Po’D): Cross-collection evaluation for music classification tasks Edition | High Noon GMT

  3. Pingback: Enkele valkuilen in AI | Smals Research

Leave a comment