# Five challenges for music information retrieval researchers

Tomorrow, at the ninth edition of the Digital Music Research Network at QMUL, I will be presenting work done with Nick Collins at Durham University: Five Four challenges for music information retrieval researchers.

We propose five challenges that turn on its head the problem of music description as has been pursued in MIR research for some time, i.e., take a labeled dataset of recorded music, then combine features with machine learning and reproduce as many labels as possible by any means. This “engineering approach” we claim is deficient in many regards: formally and explicitly defining problems and use cases; identifying and testing underlying assumptions; and using evaluation with the validity to address relevant hypotheses. We thus propose some challenges to encourage new approaches that address real-world problems having to do with music content, however that is defined.

A brief aside:
Looking through the MIR literature, we find the term “music content” is often used, but very rarely defined. In one work from almost 20 years ago (E. Wold, T. Blum, D. Keislar, and J. Wheaton, “Content-based classification, search and retrieval of audio,” IEEE Multimedia, vol. 3, pp. 27-36, Fall 1996), we find a user-centered almost-definition: “… properties that users might wish to specify” for retrieving sound and music. This is echoed in a more recent work (M. Casey, R. Veltkamp, M. Goto, M. Leman, C. Rhodes, and M. Slaney, “Content-based music information retrieval: Current directions and future challenges,” Proc. IEEE, vol. 96, pp. 668-696, Apr. 2008):
“Content-based MIR is engaged in intelligent, automated processing of music. The goal is to make music, or information about music, easier to find.” Hence, “music content” it seems can be anything from the sample values themselves (an example given in Wold et al.), to whatever high-level concepts “users might wish to specify” in their search.

Back to the program:
So, Nick and I propose some challenges to motivate problem-based thinking in the description of music by listening machines. We express these challenges using the formalism developed in a recent ISMIR 2014 paper: Formalizing the Problem of Music Description.

Briefly, the problem of music description is defined by a use case, which consists of specifications of the following four components: a music universe ${\Omega}$, a music recording universe ${\mathcal{R}_\Omega}$, a semantic universe ${\mathcal{S}_{\mathcal{V},A}}$, and success criteria ${\{P_i\}}$ (a set of boolean predicates). ${\mathcal{S}_{\mathcal{V},A}}$ is a set of sequences built from the vocabulary ${\mathcal{V}}$ (a set of tokens, e.g., instrument names) according to the semantic rule ${A}$ (a Boolean predicate on sequences, e.g., permitting only unary sequences). A music description system is a map from ${\mathcal{R}_\Omega}$ to ${\mathcal{S}_{\mathcal{V},A}}$. This done by two intermediate maps, one from ${\mathcal{R}_\Omega}$ to the semantic feature universe ${\mathcal{S}_{\mathbb{F},A'}}$ (a set of sequences built from the feature vocabulary ${\mathbb{F}}$ according to the semantic rule ${A'}$); and then a map from ${\mathcal{S}_{\mathbb{F},A'}}$ to ${\mathcal{S}_{\mathcal{V},A}}$. The problem of music description is to make the map from ${\mathcal{R}_\Omega}$ to ${\mathcal{S}_{\mathcal{V},A}}$ (the music description system) “acceptable”, i.e., the success criteria of the use case are satisfied.

The image above gives an overview of the problem of music description. First, on the far left, we have the music universe ${\Omega}$. This is the intangible stuff where “works” of music reside, whatever those are (see L. Goehr “The Imaginary Museum of Musical Works: An Essay in the Philosophy of Music: An Essay in the Philosophy of Music,” Oxford University Press, 1992). Anyhow, a specification of ${\Omega}$ in the use case provides an indication of the problem domain, e.g., music for solo clarinet. An element of ${\Omega}$ is mapped in some way, e.g., recorded human performance or transcription, to an element in ${\mathcal{R}_\Omega}$. This is the universe of tangible observations, such as the bits of a CD track, the groves on a record, the pressure fluctuations at an ear, or the notes on printed score. This depends on the use case. Unlike the elements of ${\Omega}$. the elements of ${\mathcal{R}_\Omega}$ are “sensed” by a description system. That means they are input to the system by its operator. A specification of ${\mathcal{R}_\Omega}$ describes this material, e.g., 30 second excerpts of CD-quality live recorded performance. Finally, a music description system maps ${\mathcal{R}_\Omega}$ to ${\mathcal{S}_{\mathbb{F},A'}}$ by a feature extraction algorithm ${\mathscr{E}}$; and then maps ${\mathcal{S}_{\mathbb{F},A'}}$ to ${\mathcal{S}_{\mathcal{V},A}}$ by a classification algorithm ${\mathscr{C}}$.

So, that all sets the stage for our five challenges.

One challenge is the replication detection challenge (RDC). Essentially, the abstract challenge in RDC is to find the duplicated music in a dataset ${\mathcal{D}} = \{({\mathbf r},s)\} \in {\mathcal{R}_\Omega} \times {\mathcal{S}_{\mathcal{V},A}}$ by looking only at elements of ${\mathcal{S}_{\mathbb{F},A'}}$ or of ${\mathcal{R}_\Omega}$. The tasks of RDC can be highly specific, e.g,

• find exact replicas: ${{\mathbf r}_x = \alpha {\mathbf r}_y}$, ${\alpha \ne 0}$, ${{\mathbf r}_x, {\mathbf r}_y \in \mathcal{D}}$
• find exact replicas up to a time shift: ${{\mathbf I}_x{\mathbf r}_x = \alpha {\mathbf I}_y{\mathbf r}_y}$, where ${{\mathbf I}_x, {\mathbf I}_y}$ are fat matrices extracting particular segments
• find exact replicas up to a time shift and linear time-invariant (LTI) transform ${T(\cdot)}$: ${{\mathbf I}_x{\mathbf r}_x = T({\mathbf I}_y{\mathbf r}_y)}$
• find exact replicas up to a time shift, LTI transform, and time dilation ${D(\cdot)}$: ${{\mathbf I}_xD({\mathbf r}_x) = T({\mathbf I}_y{\mathbf r}_y)}$.

The tasks can also have a lower specificity, but seek to identify music content in a dataset that are not independent observations, e.g.,

• find elements in ${\mathcal{D}}$ that are from the same superset recording
• find elements in ${\mathcal{D}}$ that are from the same artist
• find elements in ${\mathcal{D}}$ that are from the same album.

Why is this challenge needed? Because clean data is rare, and several well-used datasets in MIR have significant problems with replicas, e.g., GTZAN and Latin Music Dataset.

Another challenge is the feature inversion challenge (FIC), which seeks to appraise the relevance of features proposed for addressing the problem of music description. The abstract challenge is to infer from an element in ${\mathcal{S}_{\mathbb{F},A'}}$ the music content in ${\mathcal{R}_\Omega}$ that produced it. For instance, if ${\mathcal{V}}$ consists of instrument descriptors, then one hopes any element in ${\mathcal{S}_{\mathbb{F},A'}}$ is directly relatable to instruments specified or heard in the element ${\mathcal{R}_\Omega}$ corresponding to it. How is the feature vocabulary related to the feature vocabulary? As a specific example, one cannot reliably infer the music content from the bags of frames of features used in the “unacceptable solution” of the realisation of the kiki-bouba challenge discussed here (and below). This is in spite of that system producing all “right” answers. So, what is the relationship between ${\mathcal{S}_{\mathbb{F},A'}}$ and content in ${\mathcal{R}_\Omega}$? Note too that this challenge can involve building a music similarity system that attempts to find music recordings in a ${\mathcal{R}_\Omega}$ from an element of ${\mathcal{S}_{\mathbb{F},A'}}$, and then a listener tries to describe the characteristics of the “original” music recording.

Why is this challenge needed? Because too often it seems features are argued as being “relevant” because a system that uses them reproduces a lot of ground truth in a dataset. Relevance isn’t the only explanation of such an outcome.

A third challenge is the description inversion challenge (DIC): infer from the output of a music description system the music content it is describing. That is, go from ${\mathcal{S}_{\mathcal{V},A}}$ back to ${\mathcal{R}_\Omega}$ or ${\Omega}$ by describing the content of the music the system has described. Examples of this challenge are given by my recent AcousticBrainz reviews.

Why is this challenge needed? If the focus of content-based MIR is to make it easier to find music, then any proposed system should be tested against that goal. Would such a music recording be an acceptable result for a query consisting of the description given by the system? Does the description bear resemblance to the music content?

A fourth challenge is the use case challenge (UCC). We design this challenge to reflect on the applicability of a music description system for addressing useful problems of music description. The abstract challenge in UCC is to define the use case ${\{\Omega, \mathcal{R}_\Omega, \mathcal{S}_{\mathcal{V},A}, \{P_i\}\}}$ of the problem of music description given only the music description system, i.e., the map $\mathcal{R}_\Omega \to \mathcal{S}_{\mathcal{V},A}$, or the feature extration and classification algorithms. For instance, given a feature extraction algorithm (e.g., MFCCs from 43ms 50% overlapped windows), and classification algorithm (e.g., single nearest neighbour), specify a use case such that a realisation of the system is useful to address music information needs of a musicologist.

Why is this challenge needed? Though centered on the analysis of music recordings, the involvement of musicologists/users in MIR research is so far the exception, not the rule. This challenge brings into the fold real users with real information needs.

The last challenge is the kiki-bouba challenge (KBC), which I describe in more detail here. The abstract KBC is to build a system that can discriminate, identify, recognise, and imitate Aristotelian categories of “music.” These categories, named “Kiki” and “Bouba”, are completely specified by algorithms. For each of these four tasks, the use cases are different. For instance, the discrimination task specifies $\Omega$ as the kiki-bouba universe, $\mathcal{R}_\Omega$ as full kiki or bouba pieces (or excerpts), ${\mathcal{S}_{\mathcal{V},A}}$ is a set of different labels, and the success criteria specify how discrimination is to be caused by the high-level content of which kiki and bouba can be discriminated. For the identification task, $\Omega$ might be the kiki-bouba universe, but could be something else.

Why is this challenge needed? We design KBC to address significant problems faced by MIR, including insufficient data for training and testing; the quality, ambiguity and questionable relevance of “ground truth”; copyrights restricting data collection, use and sharing; and research irreproducibility. KBC constructively addresses all of these problems by using algorithmic music composition to pose a simple problem that can be defined explicitly and formally (more so than the problem of “music genre recognition”, for instance), has limitless data with perfect ground truth unencumbered by copyrights, that requires valid evaluation to solve, and also facilitates reproducibility.

Notably, I think most of these five challenges with my still-funny formalism are not yet any better defined than the “engineering approach” discussed above. Much work remains to be done in order to make them clearer and thus implementable in the real world. However, I think laying out the problem of music description in such a way exposes the underlying assumptions. Don’t forget the user! Don’t forget the sanity of your features! Verify your models! Is your data as “clean” as you think? Where is the music? What do you mean, “music”?