Hello, and welcome to the Paper of the Day (Po’D): Cross-collection evaluation for music classification tasks Edition. Today’s paper is D. Bogdanov, A. Porter, P. Herrera, and X. Serra, “Cross-collection evaluation for music classification tasks,” in Proc. ISMIR, 2016. This paper was another this year devoted to evaluation, and questioning the resulting numbers and importance given to them. I hope that there could be a special session at ISMIR 2017 devoted to evaluation!
My one line precis of this paper is: Evaluate music classification systems across collections independent of their training data.
Bogdanov et al. presents a compelling use of the continuously growing AcousticBrainz repository — a crowd-sourced collection of features extracted from millions of audio recordings. From such a large collection, a user can produce “validation sets”, which can then be used to test any music X-recognition system. Many different sets have so far been created through the “Dataset creation challenge”.
Bogdanov et al. identifies a variety of challenges. One is reconciling a possible difference in semantic spaces. A system might map to one semantic space that only partially overlaps that of a dataset derived from AcousticBrainz. For instance, in music genre recognition, a system trained using GTZAN maps audio features to one of 10 labels, including “Blues” and “Disco”. A validation dataset might include observations labeled “Blues”, “Folk”, “Electronic” but not “Disco”. Hence, in testing the system with this validation dataset one could ignore all observations not labeled “Blues”, or consider “Folk” as “Blues”, and so on. The researcher must take care in deciding how the semantic space of the system relates to the validation dataset. Bogdanov et al. proposes and studies different strategies for this challenge.
Another challenge is producing the ground truth for a dataset derived from AcousticBrainz. One major benefit of AcousticBrainz is that its observations are connected to unique MusicBrainz IDs. These can be used to collect all kinds of information from crowd-sourced collections, such as Wikipedia and last.fm. Bogdanov et al. reports experiments using ground truth collected from last.fm.
Bogdanov et al. demonstrates the proposed methodology in the case of music genre recognition. Systems are trained on five different datasets, some of which are often used in MIR. Each of these systems map to different semantic spaces. Cross-validation in the original datasets show very good accuracy, e.g., 75% in GTZAN. Several validation datasets are constructed with a variety of mappings to the semantic spaces of the systems, with numbers of observations ranging from 240,000 to over 1,000,000. The results show, for instance, that the system trained on all of GTZAN — the learning methods of which show cross-validation accuracy in GTZAN of 75% — have accuracies as low as 6.93% in a validation dataset of over 1,200,000 observations, and only as high as %13.7 in a validation dataset of over 270,000 observation. These results clearly show some of the difficulties in interpreting the 75% accuracy originally observed. Bogdanov et al. observe similar results for other trained systems.
I appreciate that this paper begins by echoing the observation that “Many studies in music classification are concerned with obtaining the highest possible cross-validation result.” I can’t agree more when the paper adds that this concern has limited ability to reliably assess the “practical values of these classifier models for annotation” and “the real capacity of a system to recognize music categories”. These things need to be said more often! And I like the fact that the paper is pushing the idea that the world is far bigger and more varied than what your training data might lead your machine to believe.
What I am wondering about is this: in what ways does the approach proposed in Bogdanov et al. provide a relevant means to measure or infer the “practical value” of a music classification system, or its “real capacity” to recognize music categories? Won’t this approach just lead to music classification studies that are concerned with obtaining the highest possible cross-collection-validation result? In that case, can we say how that relates to the aims of MIR research, i.e., connecting users with music and information about music?
I argue in several papers and my recent Po’D, a hard part with evaluating systems, but a source of some of the most useful information, comes from explaining their behaviours, explaining what they have actually learned to do. By explaining I mean using experiments to demonstrate clear connections between system input and output, and not anthropomorphising the algorithm, e.g., by claiming its confusions “make sense.” I know explaining system behavior is not the point of the cross-collection approach proposed in Bogdanov et al., but the paper’s motivation does lie in trying to measure the “practical value” of music classification systems, and their “real capacity” to recognize music categories. To answer these, I think explaining system behavior is meaningful.
Certainly, size is a benefit to the approach in Bogdanov et al. Having orders of magnitude more observations presents a very persuasive argument that the resulting measurements will be more reliable, or consistent, statistically speaking. However, reliability or consistency does not imply relevance. Putting size aside, the heart of the proposed approach entails having a system label each input and then comparing those answers to a “ground truth” — which is classification. Though classification experiments are widely established, easy to do, and easy to compare, they are extremely easy to misinterpret and sometimes irrelevant for what is desired. (Asking a horse more arithmetic questions of the same kind does not increase the test’s relevance to measure its capacity to do arithmetic. We have to control factors in smart ways.)
It is really interesting in Bogdanov et al. to see these optimistic figures of merit absolutely demolished in larger scale classification experiments, but I am not sure what is happening. Let’s look at two reported cases specifically. With strategy S2-ONLY-D1 (which Bogdanov et al. says “reflects a real world evaluation on a variety of genres, while being relatively conservative in what it accepts as a correct result”), we see the GTZAN-trained system has a normalised accuracy of only about 6% in 292,840 observations. This is in spite of the same learning methods showing a cross-validation accuracy in GTZAN of 75%. For another dataset (ROS), the resulting system shows a 23% normalised accuracy in 296,112 observations. This is in spite of the same method showing a cross-validation accuracy of over 85% in ROS. Without a doubt, whatever the GTZAN-trained system has learned to do in GTZAN, it is not useful for reproducing the ground truth of the S2-ONLY-D1 validation dataset. And whatever the ROS-trained system has learned to do in ROS, it is also not useful for reproducing the ground truth of the S2-ONLY-D1 validation dataset. Now, why is this happening?
Evaluating across collections brings the spectre of “concept drift” (especially for datasets as old as GTZAN). So one might argue that it is not fair to test a system trained in music data from 2002 and before on music data coming after. “Pop music” has certainly changed since then. Then again, one might argue it is fair because, generalisation. There is also the problem with the ground truth coming from poorly defined terms wielded by Internet-connected people describing music for reasons that are not entirely clear or controlled — which also brings to mind “covariate shift.” So one might argue that it is not fair to test a system trained with expert-annotated observations on a dataset that is not so annotated. Then there is the problem that these systems use track-level feature statistics, which have very unclear relationships to the music in an audio signal. So, how do these classification results relate to music category recognition?
All of this is to say Bogdanov et al. raises many interesting questions! What exactly have these systems learned to do? How do those things relate to the music that they appear to be classifying? Since the Essentia features might include statistics of spectral magnitudes below 20 Hz, has the GTZAN-trained system learned to look for such information to recognise music categories? Is that why it performs so poorly in the larger dataset? Another interesting finding in Bogdanov et al. is that the GTZAN-trained system labels over 70% of the 292,840 observations as “Jazz.” What is it so biased to that category? We can get 80% recall in GTZAN “Jazz” just by using sub-20 Hz information. Is that contributing to this bias? Or is it just that “Jazz” has “infected” a majority of music?
Machines learn the darndest things!