Paper of the Day (Po’D): A Plan for Sustainable MIR Evaluation Edition

Hello, and welcome to the Paper of the Day (Po’D): A Plan for Sustainable MIR Evaluation Edition. Today’s paper is B. McFee, E. J. Humphrey and J. Urbano, “A Plan for Sustainable MIR Evaluation,” in Proc. ISMIR, 2016. I was happy this year to see some entire papers devoted to evaluation and really wish there could be more — maybe even a special session at ISMIR 2017 devoted to evaluation!

My one line precis of this paper is: Here is a realistic plan to address many problems of MIREX, including cost, scalability, sustainability, transparency, and data obsolescence.

McFee et al. makes a great argument of how MIREX has many shortcomings which can be overcome. First, MIREX is very costly because it requires a centralised system with dedicated personnel to debug and run submitted code, and then to compute and assemble all the results. If the number of submissions is to increase, then the human cost will become too great, and so it is not scalable. Even if the number of submissions is to stay the same, funding is so unstable that action needs to be taken now. With regards to its benefits, MIREX typically provides very little feedback to researchers, instead summarising results using rough metrics. Where full results are given, the testing data might be hidden. This means one cannot see the original data that led to a particular result. Hence, MIREX participants receive little information as to how they could improve their systems. Finally, though the code for testing MIREX submissions is open and available, most of the data remains private and is becoming outdated and overused.

McFee et al. proposes a new approach whereby computational efforts are shifted significantly onto the participant. This “distributed computation” will free considerable human resources from having to debug submissions and run them. In short, participants will be responsible for running their systems on an evaluation subset taken from a larger music dataset that is open, available, and used for many evaluation tasks. Participants would then upload the resulting system outputs to a central and possibly automated system, which would then just compile and compare all results together. I think this is a completely practical pipeline now that there exists a lot of music data available at no legal jeopardy thanks to creative commons licenses, etc.

Another piece of the puzzle, with an ever growing collection of music data, is how to produce the “ground truth” for all running tasks in order to facilitate evaluation. McFee et al. argues that this need not be available at the outset, and proposes instead to adopt an approach known as “incremental evaluation.” This approach posits that the “most informative” data points are likely those for which competing systems do not agree, e.g., two systems choosing the same output for some input cannot be distinguished regardless of whether they are both right or wrong, so there may be no value in obtaining the ground truth for that input. Hence, incremental evaluation targets human efforts toward determining the ground truth of those data points producing disagreements. Along with the use of subsets of a larger and growing data collection being used for evaluation, several iterations of incremental evaluation would gradually lead to a good portion of labeled and open data with which to train and evaluate systems. (It is a kind of “cold start” problem.)

Finally, McFee et al. describes a “trial run” of its proposed plan for 2017: music instrument recognition. The authors are currently soliciting volunteers for this grand experiment, Project OMEC (complete the webform if you wish to participate).

By and large I think the plan proposed by McFee et al. is feasible and attractive. I especially appreciate how it shifts a large burden of effort onto the participant. Of course, with this shift could arise problems of integrity, e.g., participants “cheating” by post hoc changing their system predictions, or training on testing data; but I don’t see this as any major problem.

While I believe the shift in effort of McFee et al. is in the right direction for improving evaluation practices in MIR, I think there must eventually be a far greater shift of responsibility. Furthermore, though the evaluation plan proposed in McFee et al. is more scalable and sustainable than MIREX, I wonder how the evaluation results it produces can be made meaningful rather than just transparent. Let me explain these two points in greater detail.

One key to any progressive research endeavour (science or engineering) is the responsibility of the researcher. The literature of MIR is replete with claims of systems successfully solving specific key problems. For instance, just from the once-MIR-flagship problem of music genre recognition:

  1. Since 61.1% accuracy is significantly greater than the 10% expected of random, our system/features/learning is learning to solve/is useful for solving the problem.
  2. We see our system confuses Rock and Pop and so is making sensible confusions.
  3. Since 91% accuracy is significantly greater than the state of the art of 61.1%, our system is better at solving this problem.
  4. Over the past several years, accuracies have apparently hit a ceiling, and so our progress on solving the problem has stalled.

Scratching a bit below the surface of this evidence (accuracies, confusions, recall, precision, F-score, … whatever figure of merit), however, produces many worrying signs that this “evidence” may not actually be evidence of success at all. For instance, here are two real-life examples:

  1. a state of the art “music genre recognition” system exploiting inaudible information to reproduce ground truth in GTZAN
  2. a state of the art “music rhythm recognition” system using only tempo to reproduce the ground truth of BALLROOM.

What I am arguing (which is not original) is that the MIR researcher should take much more responsibility/care for the interpretation of their experimental results and subsequent claims of success. This is of course outside the purview of the plan proposed in McFee et al., resting instead in the daily practices of and training in the research discipline.

Related to this though is my point about the meaningfulness of the evaluation results produced by MIREX and the plan proposed by McFee et al. I think it’s great to provide participants with fine-grained information, which could be useful for adjusting their submissions. However, I am highly critical of some situations in which the evaluation methodology entails comparing sets of answers — just see the two examples above, as well as Clever Hans the horse. (I am not claiming that this methodology is always a bad.) Regardless of whether a system has correctly reproduced the ground truth of some observation, or whether two systems differ in their predictions of specific observations, I am arguing that the researcher has a responsibility to _explain_ the outcome. This practice is nearly completely missing in MIR, yet can provide some of the best information for designing and improving machine music listening systems.

Let me give a real and recent example. The ISMIR 2016 Best Oral Presentation was won by Jan Schlüter of OFAI for his presentation of his vocal pinpointing systems. His paper details several experiments and comparisons of figures of merit that clearly demonstrate his proposed systems have learned something incredible from weakly labeled data. In his presentation, Jan probed what the systems have actually learned to do. He built a very nice online demo, and showed how his high-performing system can be made to confidently predict vocals where there is none just by adding a single frequency-sweeping sinusoid. It appears then that his “vocal pinpointing” system has not learned to pinpoint vocals, but rather developed a sensitivity to particular time-frequency energy structures that are often but not always present in vocal signals. Hence, Jan’s creative experiment provides insight into the behaviour of his systems, situations for which they could fail (e.g., singing voice with minor vibrato, no singing voice but music instruments that slide), and opportunities for improving both the systems and training methods for solving the intended problem. How could such meaningful information come from a list of music tracks on which a system disagreed with other systems?

Machines learn the darndest things!

Of course, there are many problems in MIR for which this does not matter. We don’t always need to interrogate our machines. I think these problems are where I see the plan proposed in McFee et al. making a substantial and lasting contribution. I am looking forward to seeing where it goes!

Advertisements

2 thoughts on “Paper of the Day (Po’D): A Plan for Sustainable MIR Evaluation Edition

  1. Pingback: Paper of the Day (Po’D): Cross-collection evaluation for music classification tasks Edition | High Noon GMT

  2. Hey Bob, thanks for the great writeup!

    You’re of course quite correct about the importance of transparency in automatic annotations, and the plan outlined in our paper doesn’t directly address that issue. (For my money, this plan was all about efficiently gathering training data, and open eval is a nice side-effect — my coauthors may disagree, but that’s okay. :))

    I’d like to chime in on a couple of points that we couldn’t fully develop in the paper due to space constraints.

    First, although the paper describes incremental eval as a means of collecting “ground truth” against which estimates are compared for scoring, it can also be used by human judges to directly score estimates. This is essentially how the AMS task in MIREX works (see: Urbano et al., 2012): there is no ground truth for offline evaluation, but disagreement is used to prioritize which examples are scored by judges. In principle, this should safeguard against Hansing — given a sufficiently varied data set — since humans are directly observing the system outputs. If the data set is not sufficiently varied to distinguish causal factors from spurious correlates, then there’s probably not much we can do short of sampling more data. (Growing the data set over time is in our plan already, and this is exactly why.)

    Second, if all the predictions *and* audio are publicly available, any outside party is free to perform meta-analysis on the results. We briefly touch on this point in the paper, but there’s much more to be said. Raw disagreement would not be enough to do proper Hans-detection, but it would facilitate things like your analysis of the Ballroom dataset — provided someone is clever enough to find the spurious correlates. To do proper causal inference here would require interventions, and thus access to the estimator implementations, but we at least drop one barrier (access to data) that currently exists.

    Finally, to your point:

    > How could such meaningful information come from a list of music tracks on which a system disagreed with other systems?

    I think this isn’t so far-fetched. One of the points that has been brought up repeatedly offline (and I can’t recall if this came up during the Q&A after the presentation) is that we might not want to do pure incremental eval, but instead something like 80/20 incremental/uniform sampling. The broader benefit of this approach is that it could reveal failure cases common to all participating algorithms (eg, your minor vibrato example), at the loss of some short-term statistical efficiency in differentiating between systems. The hard part, of course, is abstracting from failure cases to failure modes, which is exactly our job as a scientists and engineers.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s