It’s an honor to be invited to speak as part of the 2017 SPARS Summer School. I started my European research career at the 2009 edition of SPARS; so it is nice to have the opportunity to reconnect with this community!
Here are my powerpoint slides, for those who wish to view along. (It was a 2 hour talk, so I thought 200 slides would be just perfect, but we finished early, and with a 10 minute break in-between. Will add more later :)
Preparing for this lecture gave me the opportunity to think about what I have done since arriving to QMUL in Dec. 2014. I arrived to London eager to integrate the formal design and analysis of experiments (DOE) within the music informatics research pipeline. So far, that “integration” turns out to be like fitting a square peg in a magazine subscription.
DOE is a wonderfully developed and beautiful discipline. I have really enjoyed digging into Rosemary Bailey‘s work, and my many conversations with the DOE experts around the UK. However, the kinds of experiments we do in music informatics specifically, and machine learning in general, are not really like those done in medicine and agriculture. There’s enough ambiguity about what are treatments and experimental units in machine learning experiments, and how those things relate to populations, that making progress becomes more like playing a game of “whack-a-mole”.
In lieu of a complete solution, I try to present shades of new ways of working. In developing machine learning systems, we all have big questions about what metrics to use, how to compare results from repeated trials of K-fold cross-validation, and so on (I try to identify all of these big questions in my introduction). But these questions are meaningless if our resulting system appears to be solving the problem but is actually not (it’s a horse). So, instead, I propound that the two bigger and better questions we should ask is:
- What has my system actually learned to do? (Does it actually address the problem?)
- How can I improve it? (How can I make it actually address the problem?)
I present a series of cautionary tales from my research adventures, which not only demonstrate many of the failings of current practices, but also reveal some of the ways to answer the two bigger and better questions above. I not only talk about other people’s horses, but a horse of my own! I argue that horses aren’t all bad, and that they can be useful — but it is good to always know what is and is not going on “behind the reins”.
The conclusion is important:
The default position for any machine learning system should always be,
“horse until proven otherwise.”
Don’t stop at the cross-validation accuracy.
Work to explain what the system has actually learned to do.