Horses in Umeå!

I’m happy to be speaking at the 6th Swedish Workshops on Data Science at Umeå University, Nov. 20-21 2018.

Title: Be a responsible data scientist: Identify and tame your “horses”

Abstract: A “horse” is a system that is not actually addressing the problem it appears to be solving. The inspiration for the metaphor is the real-life example of Clever Hans, a horse that appeared to have great skill in mathematics but had actually learned to respond to a prosaic cue confounded with the correct answer. Similarly, a model created through the statistical treatment of a large dataset and wielded by a data scientist can also appear successful for solving a complex problem, but  actually not be. In this talk, I take a critical look at past applications of data science — exemplifying contemporary practices — and identify where issues arise that affect the validity of conclusions. I argue that the onus is on the data scientist to not stop at describing how well a model performs on a given dataset (no matter how big it may be), but to go further and explain what they with their models are actually doing. I provide some examples of how researchers have identified and tamed “horses” in my research domain, music informatics.

Advertisements

Po’D (Paper of the day): Explaining explainations edition

Hello, and welcome to my Po’D (Paper of the day): Explaining explainations edition. Today’s paper contributes a survey of explainable AI (XAI): L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, L. Kagal, “Explaining Explanations: An Approach to Evaluating Interpretability of Machine Learning”, eprint arXiv:1806.00069, May 2018. My interest in this area stems from: 1) horses; 2) ethics; and 3) the development of new  methods of research and development in my field.

My one line precis of this paper: To address mistrust of AI, XAI should focus on interpretability and explainability.

Explainable AI (XAI) is exactly what it says: building systems that “explain” their learned behaviors, or at least are “transparent” enough to “interpret”, and by which address a lack of public trust in AI. A major impediment here — as in much of engineering — is one of definition: what is meant by “transparent”, “to interpret”, and “to explain”? And to whom: an experienced engineer or a lay person? In a sense, we know these things when we see it in the context of some use case; but there is a need for standardising these criteria and systematising their evaluation in developing XAI.

Gilpin et al. argue that there is a need for interpretation and explaination. For Gilpin et al., the difference between the two terms seems to be in who is acting. “Interpretation” involves someone/thing trying to link the behavior of a system with its input (e.g., activiation maximisation in a deep image content recognition system). “Explaination” involves the system giving reasons for its behavior (e.g., a medical diagnostics system presenting a diagnosis and highlighting reasons for it).

Gilpin et al. do not explicitly define “an explaination”, but they propose two qualities of one: its “interpretability” and its “completeness”. The first is measured along lines of human comprehension. The second is measured along lines of inferring future behaviours of the system. An example they provide is an explaination that consists of all parameters in a deep neural network. It is complete because one can infer how the system will behave under any other input, but it is not interpretable because one cannot quickly comprehend how the behavior arises from the parameters. Gilpin et al. suggest that the challenge of XAI is to create explainations that are both interpretable and complete (which they see as a tradeoff), and which increase the public trust in that system.

Gilpin et al. identify three different approaches to XAI in research using deep neural networks (which they propose as a taxonomy of XAI): 1) explaining the processing of data by the network (e.g., LIME, salience mapping); 2) explaining the representation of data in the network (e.g., looking at the filters in each layer, testing layers in other tasks); and 3) making the system explainable from the start (e.g., disentangled representations, generated explanations). They also review some past work on interpretability and XAI.

Gilpin et al. conclude with a look at types of evaluation within each of the three approaches to XAI they identify. These include completeness for data processing; detecting biases for data representation; and human grading of produced explainations.

There’s lots of good references in this article some of which I would like to read now:

  1. B. Herman, “The promise and peril of human evaluation for model interpretability,” arXiv preprint arXiv:1711.07414, 2017.
  2. Q.-s. Zhang and S.-C. Zhu, “Visual interpretability for deep learning: a survey,” Frontiers of Information Technology & Electronic Engineering 19(1):27–39, 2018.
  3. A. S. Ross, M. C. Hughes, and F. Doshi-Velez, “Right for the right reasons: Training differentiable models by constraining their explanations,” arXiv preprint arXiv:1703.03717, 2017.
  4. F. Doshi-Velez and B. Kim, “Towards a rigorous science of interpretable machine learning,” arXiv preprint: arXiv:1702.08608, 2017.
  5. A. Abdul, J. Vermeulen, D. Wang, B. Y. Lim, and M. Kankanhalli, “Trends and trajectories for explainable, accountable and intelligible systems: An HCI research agenda,” in Proc. CHI Conf. on Human Factors in Computing Systems, 2018.
  6. R. Guidotti, A. Monreale, F. Turini, D. Pedreschi, and F. Giannotti, “A survey of methods for explaining black box models,” arXiv preprint arXiv:1802.01933, 2018.

I only have three criticisms of this article. Its title can better reflect its contents, e.g., “A survey of XAI”. It doesn’t propose an approach to evaluating interpretability of machine learning. (Or maybe its “approach” is via the taxonomy?) The second criticism is that the suitcase words of “explaination” and “interpretability” remain just as vague by the end, and are even accompanied by another suitcase word, e.g., “completeness”. Definitions of these things are difficult to pin down. It’s nice to see Gilpin et al. go to some philosophical works to help with the discussion, but by the end I’m not sure we have anything more solid than, “I know it when I see it.” Finally, the discussion of the evaluation could be made more concrete by appealing to specific use cases of AI. I think notions of “explainability” make the most sense when put in a context of use, whether it’s by the machine learning researcher, or a bank loan officer working with a system. Anyhow, this article was well worth the time I spent reading it.

=== Update: Nov. 1 2018 (after reading group discussion)

In their review of past work explaining deep network representation, Gilpin et al. suggest transfer learning as a way of understanding what layers are doing. I don’t agree with this. This is like answering the question, “Why did my barber do a good job cutting my hair? Look, he also cooks good food.” Inferring what a layer must be doing by looking at its performance on a different task is just guessing.

No attention is paid to the harms of interpretability. For instance, spammers can adjust their strategy by adapting content to subvert explainable spam detection systems.

Going to use the Nottingham Music Database?

The “Nottingham Music Database” (NMD) has been appearing more and more in applied machine learning research and teaching over the past few years. It’s been used in tutorials on machine learning, and even educational books on deep learning projects. It’s been fun to generate music with computers for a very long time.

The music generation start-up company Jukedeck put some effort into cleaning an ABC-converted version of the database, offering it on github. Most recently, NMD appears in this submission to ICLR 2019: HAPPIER: Hierarchical Polyphonic Music Generative RNN. Seeing how that paper uses NMD, and the conclusions it draws from the music generated by the models it creates, I am motivated to look more closely at the NMD, and to propose some guidelines for using it in machine learning research.

Here is the source page of the “Nottingham Folk Music Database” by Eric Foxley, which “contains about 1200 folk melodies, mostly British & American. They mostly come from the repertoire over the years of Fred Folks Ceilidh Band, and are intended as music for dancing.” It is a very personal collection, as Foxley describes: “Most tunes have been collected over a lifetime of playing (which started when I sat in at the back of many bands in the London area and elsewhere from the age of 12 onwards), and the sources from whom I learnt the tunes are acknowledged. These are all collected “by ear”, and details change over time. The arrangements, harmonies, simplifications are entirely mine. Where there is a known printed source, that is included. I apologise for any unknowing omissions of sources, and would be happy to add them.” Based on the date of Foxley’s website, this collection seems to have been assembled before 2001.

Foxley provides a description of the contents here:

  • “Jigs. This directory contains about 350 6/8 single (mostly “crochet-quaver” per half bar) and double jigs (mostly quavers).
  • Reels. 2/4 and 4/4. This includes about 460 marches, polkas, rants etc.
  • Hornpipes. These are played (but not written) dotted. We include about 70 hornpipes, schottisches and strathspeys. See the “Playing for Dancing” document for the distinction.
  • Waltzes. About 50 tunes with 3/4 time signature.
  • Slip jigs. These are jigs in 9/8 time.
  • Miscellaneous. This directory contains just a few tunes, which we play mainly for listening to, when dancers need a breather.
  • Morris. Just a sample few, about 30. They include some chosen for listening to, and some from the Foresters Morris Men’s repertoire.
  • Some Christmas ones (15).
  • About 45 tunes from the Ashover collection, provided by Mick Peat.”

Not listed there, but included in The Tunes, are tunes taken from Playford’s 1651 book, The Dancing Master.

Foxley provides a note on the distribution of the database: “We are happy for others to use tunes from our repertoire; after all, the tunes [we] use were picked up from others, and the traditional tunes are best! We just hope that you play them properly and carefully, not as streams of notes but as phrased music making folks want to dance.”

Foxley also provides a warning: “The melodies as stored are my interpretation of the essence of the tune. Obviously no respectable folk musician actually plays anything remotely like what is written; it is the ornamentation and variation that gives the tune its lilt and style.”

Foxley appears to have assembled his collection for a few different purposes: 1) a collection for his group’s own music practice playing for dances and other events (see this page of tunes for specific weddings); 2) as material for researching music analysis and search and retrieval by computers.

NMD is thus a personal collection of an English folk music enthusiast and computer scientist with decades of experience in playing and dancing to this kind of music. Much of the collection is focused on dance music (jigs, reels, hornpipes, waltzes, slip jigs, Morris, Playford’s ), but some of it is specialised (Miscellaneous, Christmas). A small portion of the collection comes from another person (Mick Peat). While it is an extensive collection for a single person, it is not extensive for a tradition (compare to the Morris music collection at The Morris Ring). It should be emphasised what Foxley says: NMD is his collection of rough transcriptions of tunes that should never be performed as written, but when performed well should make “folks want to dance”.

Here’s the first three guidelines for using the NMD:

1. Do not believe that when you train a model on sequences from the NMD that your model is learning about music. Your trained model may show a good fit to held out sequences in NMD. Do not believe that this means it has learned about the music represented by the NMD. Your model is learning about sequences in the NMD. Those sequences are not music, but impoverished, coarse and arbitrary representations of what one experiences when this particular kind of music is performed. Also, the music represented in NMD is not “polyphonic”. Each sequence of NMD provides a sketch of the melody (which all melody instruments play), and harmonic accompaniment (which is not always present).

2. If you are working with a generative model, your trained model may produce sequences that appear to you like the sequences in NMD. Do not convert those sequences to MIDI and then listen to an artificial performance of them to judge their success. Do not submit those synthetic examples together with synthetic examples of tunes from NMD to a listening test and ask people to rate how pleasant each is. Do not assume that someone with a high degree of musical training knows about the kind of music represented in the NMD.

3. Find an expert in the kind of music represented in the NMD and work with them to determine the success of your model. That means you should submit sequences generated by your model trained on NMD to these experts so that they can evaluate them according to performability and dancability.

Let’s have a look at a real example from NMD. I choose one at random among those I have experience playing. Here’s Foxley’s transcription of “Princess Royal” from what he says is the Abingdon Morris tradition:

.MS
title = "\f3Princess Royal\fP";
ctitle = "AABCBCB";
rtitle = "\f2Abingdon\fP";
timesig = 4 4;
key = g;
autobeam = 2;
chords;
bars = 33.

d^<'A' c^< |
b"G" a"D" g"G" d^< c^< |
b"G" a"D" g"G" g^ |
e^."C" d^< c^ e^ |
d^."G" c^< b d^ |
c^ "Am" b "g" a "f+" g "e" |
f<"D7" g< a< "c+" f< d "b" d^< "a" c^< |
b<"G" a< b< g< a"D7" f | g>"G" g :| \endstave.

e^.'B'"C" e^< e^ d^ | e^"C" f^"d" g^>"e" |
g^"C/e" f^"d" e^"c" d^"b" |
b<"G/d" a< g< b< a >"D7" |
g"G" g a."D7" a< |
b<"G" a< g g^. f^< | g^"G" d^ e^>"C" |
d^"G" b c^>"C" | \endstave.
\5,8 |! \continue.

d^ 'C' c^ |
b>"G" a>"D" |
g>"Em" d^"D7" c^ |
\-2 |
g>"Em" g^> | \endstave.
e^>."C" d^ |
c^>"C" e^> |
d^>."G" c^ |
\timesig = 2 4. b."G" d^< |
\timesig = 4 4. c^"Am" b "g" a "f+" g "e"|
\6,8 |! \endstave.

.ME

Here’s the ABC conversion from the Sourceforge NMD:

X: 20
T:Princess Royal
% Nottingham Music Database
P:AABCBCB
S:Abingdon
M:4/4
L:1/4
K:G
P:A
d/2c/2|"G"B"D"A "G"Gd/2c/2|"G"B"D"A "G"Gg|"C"e3/2d/2 ce|"G"d3/2c/2 Bd|
"Am"c"g"B "f#"A"e"G|"D7"F/2G/2"c#"A/2F/2 "b"D"a"d/2c/2|\
"G"B/2A/2B/2G/2 "D7"AF|"G"G2 G:|
P:B
"C"e3/2e/2 ed|"C"e"d"f "e"g2|"C/e"g"d"f "c"e"b"d|"G/d"B/2A/2G/2B/2 "D7"A2|\
"G"GG "D7"A3/2A/2|"G"B/2A/2G g3/2f/2|
"G"gd "C"e2|"G"dB "C"c2|"Am"c"g"B "f#"A"e"G|\
"D7"F/2G/2"c#"A/2F/2 "b"D"a"d/2c/2|"G"B/2A/2B/2G/2 "D7"AF|"G"G2 G||
P:C
dc |"G"B2 "D"A2|"Em"G2 "D7"dc|"G"B2 "D"A2|"Em"G2 g2|"C"e3d|"C"c2 e2|"G"d3c|
M:2/4
"G"B3/2d/2|\
M:4/4
"Am"c"g"B "f#"A"e"G|"D7"F/2G/2"c#"A/2F/2 "b"D"a"d/2c/2|\
"G"B/2A/2B/2G/2 "D7"AF|"G"G2 G||

Here’s the ABC from the Jukedeck NMD cleaned collection:

X: 20
T:Princess Royal
% Nottingham Music Database
Y:AAFBCBCB
S:Abingdon
M:4/4
L:1/4
K:G
P:A
d/2c/2|"G"B"D"A "G"Gd/2c/2|"G"B"D"A "G"Gg|"C"e3/2d/2 ce|"G"d3/2c/2 Bd|
"Am"cB AG|"D7"F/2G/2A/2F/2 Dd/2c/2|\
"G"B/2A/2B/2G/2 "D7"AF|"G"G2 G:|
P:F
G|
P:B
"C"e3/2e/2 ed|"C"ef g2|"C/e"gf ed|"G/d"B/2A/2G/2B/2 "D7"A2|\
"G"GG "D7"A3/2A/2|"G"B/2A/2G g3/2f/2|
"G"gd "C"e2|"G"dB "C"c2|"Am"cB AG|\
"D7"F/2G/2A/2F/2 Dd/2c/2|"G"B/2A/2B/2G/2 "D7"AF|"G"G4||
P:C
zz dc |"G"B2 "D"A2|"Em"G2 "D7"dc|"G"B2 "D"A2|"Em"G2 g2|"C"e3d|"C"c2 e2|"G"d3c|
M:2/4
"G"B3/2d/2|\
M:4/4
"Am"cB AG|"D7"F/2G/2A/2F/2 Dd/2c/2|\
"G"B/2A/2B/2G/2 "D7"AF|"G"G4||

There’s something unusual in the Jukedeck processing. First, there is an F section that does not appear in the others, but just acts to balance the 3-beat bar before. Second, many of the bass notes (specified by a lower case letter) have been stripped out. Anyhow, by and large Foxley’s version and the Sourceforge NMD appear the same.

Let’s get a feeling for how this sequence becomes music, and how that functions together with a dancer. Below is the staff notation of the Abingdon version of Princess Royal (Foxley’s PDF resulting from his transcription) along with a video of a performance.

Screen Shot 2018-09-30 at 12.13.35 PM.png

There are several important things to notice here. 1) The written and performed melodies deviate in many places, just as Foxley says they should; 2) The accompanying harmony here is sometimes not what is notated; 3) The musician closely follows the dancer, allowing enough time for them to complete the steps (hops and such).

When it comes to the notated version of the sequence, look at how the parts are structured and how they relate to one another. In the A part, bars 5-8 relate to bars 1-4. Patterns in bars 3 and 4 mimic those in bars 2 and 3. The B part contrasts with A, but its conclusion echoes that of A. The first 7 bars of part C is the first four bars of part A with doubled note lengths; and its last four bars are the last four bars of part A. There’s a lot of structure there! And these kinds of structures and patterns exist throughout the sequences in NMD.

Here’s some more guidelines.

4. Look at how the sequences generated by your model trained in the NMD exhibit the same kind of structures and patterns of the sequences in the NMD. Are there similar kinds of repetitions and variations? How do the sections relate together? If you don’t see any of these kinds of things, your model is not working. If you don’t know what to look for, see guideline 3.

5. Do not train your sequence model on a MIDI conversion of the NMD. They are not the same. (The MIDI file created by Jukedeck from the tune above also has the wrong structure — AAAABCBCB instead of AABCBCB. Other midi files there are sure to have similar problems.) Training on MIDI conversions of the NMD will also add a lot more complexity to your model, and make training less effective. The ABC notation makes sequences that are quite terse, so why not take advantage of that?

Now let’s have a look at one of the examples generated by the HAPPIER model:

Screen Shot 2018-10-02 at 3.02.02 PM.pngThe very first event shows something is very wrong. Overall, the chord progression makes no sense, the melody is very strange, and the two do not relate. There is none of the repetition and variation we would expect given the NMD. None of the four examples presented in the HAPPIER paper look anything like music from the NMD. There is some step wise motion, so the HAPPIER model has that going for it; but it is clearly not working as claimed.

The HAPPIER paper claims the new model “generates polyphonic music with long-term dependencies compared to the state-of-the-art methods.” The paper says the HAPPIER models “perform better for melody track generation than the LSTM Baseline in the prediction setting” because their negative log likelihoods on sequences from NMD are lower. The paper also claims that HAPPIER model also “performs better in listening tests compared to the state-of-the-art methods”. The paper also claims that “the generated samples from HAPPIER can be hardly distinguished from samples from the Nottingham dataset.” None of these claims are supported by the evidence.

That brings up the final guideline.

6. If you are going to train a model on the NMD, or on this kind of melody-focused music, compare your results with folk-rnn. The code is freely available, it’s easy to train, and it works exceptionally well on this kind of music (when it is represented compactly, and not as MIDI). I have yet to see any model produce results that are better than folk-rnn in the context of this kind of music.

An experimental album of Irish traditional music and computer-generated tunes

albumcover.jpgFor the past 6 months, the music album “Let’s Have Another Gan Ainm” has been distributed to reviewers and listeners in Europe and the USA as a new release of Irish traditional music. We are now publicly revealing that each track on the album includes computer-generated material, specifically material generated by our deep neural network folk-rnn.

Reviews of the album, both published and private, have been very positive. The album even received radio play. More information about our experiment and the music on the album (e.g., how each  came to be) can be found in our technical report. We show exactly what the computer generated and the changes that were made. More details about the reception of the album will be provided at a later time.

In the meantime, enjoy the album!