Reading List for FDT3303: Critical Perspectives on Data Science and Machine Learning (2019)

The course is fully booked, with 23 students and a few auditors. We have a very good crop of papers for this inaugural edition of my PhD course. Some of these are classic papers (bold). Some are very new ones (italic). All deserve to be read and critically examined!

Nov. 1: Questions of Ethics
J. Bryson and A. Winfield, “Standardizing ethical design for artificial intelligence and autonomous systems,” Computer, vol. 50, pp. 116–119, May 2017.

Nov. 13: Questions of Performance
E. Law, “The problem of accuracy as an evaluation criterion,” in Proc. Int. Conf. Machine Learning: Workshop on Evaluation Methods for Machine Learning, 2008.

F. M.-Plumed, R. B. C. Prudêncio, A. M.-Usó, and J. H.-Orallo, “Making sense of item response theory in machine learning,” in Proc. ECAI, 2016.

Nov. 15: Questions of Learning
D. J. Hand, “Classifier technology and the illusion of progress,” Statistical Science, vol. 21, no. 1, pp. 1–15, 2006.

E. R. Dougherty and L. A. Dalton, “Scientific knowledge is possible with small-sample classification,” EURASIP J. Bioinformatics and Systems Biology, vol. 2013:10, 2013

Nov. 20: Questions of Sanity
I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in Proc. ICLR, 2015

S. Lapuschkin, S. Wäldchen, A. Binder, G. Montavon, W. Samek & K.-R. Müller, “Unmasking Clever Hans predictors and assessing what machines really learn” Nature 2019

Nov. 22: Questions of Statistics
C. Drummond and N. Japkowicz, “Warning: Statistical benchmarking is addictive. kicking the habit in machine learning,” J. Experimental Theoretical Artificial Intell., vol. 22, pp. 67–80, 2010.

S. Makridakis, E. Spiliotis, V. Assimakopoulos, “Statistical and Machine Learning forecasting methods: Concerns and ways forward“, PLOS ONE 2018.

S. Goodman, “A Dirty Dozen: Twelve P-Value Misconceptions”, Seminars in Hematology, 2008.

Nov. 27: Questions of Experimental Design
C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth, “The reusable holdout: Preserving validity in adaptive data analysis,” Science, vol. 349, no. 6248, pp. 636–638, 2015.

T. Hothorn, F. Leisch, A. Zeileis, and K. Hornik, “The design and analysis of benchmark experiments,” Journal of Computational and Graphical Statistics, vol. 14, no. 3, pp. 675–699, 2005.

Nov. 29: Questions of Data
M. J. Eugster, F. Leisch, and C. Strobl, “(Psycho-)analysis of benchmark experiments: A formal frame- work for investigating the relationship between data sets and learning algorithms,” Computational Statistics & Data Analysis, vol. 71, pp. 986 – 1000, 2014.

Luke Oakden-Rayner, Jared Dunnmon, Gustavo Carneiro, Christopher Ré, “Hidden Stratification Causes Clinically Meaningful Failures in Machine Learning for Medical Imaging” arXiv 2019

S. Tolan, “Fair and unbiased algorithmic decision making: Current state and future challenges,” JRC Technical Reports, JRC Digital Economy Working Paper 2018-10, arXiv 2018.

Dec. 4: Questions of Sabotage
J. Su, D. V. Vargas, and K. Sakurai, “One pixel attack for fooling deep neural networks,” arXiv, vol. 1710.08864, 2017

S. G. Finlayson, J. D. Bowers, J. Ito, J. L. Zittrain, A. L. Beam, I. S. Kohane, “Adversarial attacks on medical machine learning” Science 2019.

Dec. 6: Questions of Interpretability
Z. Lipton, “The mythos of model interpretability,” in Proc. ICML Workshop on Human Interpretability in Machine Learning, 2016

Malvina Nissim, Rik van Noord, Rob van der Goot, “Fair is Better than Sensational: Man is to Doctor as Woman is to Doctor” arXiv 2019.

Dec. 11: Questions of Methodology
Z. C. Lipton and J. Steinhardt, “Troubling trends in machine learning scholarship,” in Proc. ICML, 2018.

Meyer, Michelle N., “Two Cheers for Corporate Experimentation: The A/B Illusion and the Virtues of Data-Driven Innovation” 13 Colo. Tech. L.J. 273 (2015).

Dec. 13: Questions of Application
K. L. Wagstaff, “Machine learning that matters,” in Proc. Int. Conf. Machine Learning, pp. 529–536, 2012.

Cynthia Rudin, David Carlson, “The Secrets of Machine Learning: Ten Things You Wish You Had Known Earlier to be More Effective at Data Analysis” arXiv 2019.

M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim, “Do we need hundreds of classifiers to solve real world classification problems?,” Journal of Machine Learning Research, vol. 15, pp. 3133–3181, 2014

Machine Folk from GPT-2!

It was just a matter of time before someone retrained the massive GPT-2 language model (up to 1.5 billion parameters) on folk music. That’s what Gwern Branwen has done, described in this comment thread. I’m not sure which size model he has used, but for training he seems to have used a concatenation of four folk-rnn datasets. In this post I want to analyze some of the samples generated from the resulting model to help me determine whether there’s much difference in quality compared with transcriptions generated by folk-rnn models (100,000 examples here).

The file Gwern links to contains 12929 lines. I will generate 5 random numbers in [1, 12929] and then analyze the transcriptions closest to those line numbers.

Here’s the transcription closest to line number 2881:

X: 47015
T: Money In Both Pockets
M: 4/4
L: 1/8
K: Gmaj
BA|GFGA BGGB|dBGB AGFA|GFGA BGBc|deed B2 BA|
GFGA BGGB|dBGB AGFA|GFGA BGBc|defd B2 Bc||
dggf gdBd|g2 bg agfg|dggf g2 bg|afge d2 Bc|
dggf gdBd|g2 bg agfg|dggf g2 bg|afge d2 dc||

Here is the tune notated:

Screen Shot 2019-10-24 at 08.54.17.png

The transcription appears to be a novel one. The training data has a tune and settings with the same name, but these are quite different from the above. This generated tune has a conventional AABB structure (assuming implicit repeat signs). Within each part there is some repetition and variation consistent with the style. The B part goes higher in register than the A part, which is also stylistically consistent. Both parts lack a cadence, however, sticking to the dominant. It is very easy to play on the accordion:

Overall, this is a boring tune since there is too much repetition and not enough variation. But I would say it is plausible since there is nothing clearly wrong with it.

The two transcriptions generated by the model that follow the above are given the same title by the model:

X: 47016
T: Money In Both Pockets
M: 4/4
L: 1/8
K: Gmaj
|: G3A BGGB | dBGB AGFA | G3A BGBc | defd BGBc |
GFGA BGGB | dBGB AGFA | G3A BGBc | defd Bd d2 |
|g2 gf gdBd | (3efg fg agfg | gagf gabg | afge d2 (3Bcd |
g2 gf gdBd | (3efg fg agfg | gagf gabg | afge d3c ||

X: 47017
T: Money In Both Pockets
M: 4/4
L: 1/8
K: Gmaj
|: G3A BGGB | dBGB AGFA | G3A BGBc | defd BGBc |
GFGA BGGB | dBGB AGFA | G3A BGBc | defd Bcdf |
g2 gf gdBd | efge fgaf | gagf gabg | afge d2 Bc |
g2 gf gdBd | efge fgaf | gagf gabg | afge d3c ||

This behavior likely comes from the fact that in the dataset we created from thesession.org many tunes have “settings”, which are variations of a tune. So the titles will repeat and the tune is varied. Let’s look at how these three transcriptions relate by notating them as three voices:

Screen Shot 2019-10-24 at 09.11.07.png

We can see the model has actually generated two reasonable variations of the tune. This is very impressive behavior! Neither setting, however, improves on the first.

Here’s the generated transcription closest to line number 8623:

X: 33166
T: Breakfast Epiphanies
M: 4/4
L: 1/8
K: Dmaj
(3def|:gecA G2FE|DFAF GBAG|FDAD BDAB|FAAG (3ABA (3def|
gecA G2FE|DFAF GBAG|FDAD BDAB|1FGFD D2 (3def:|2FGFD D2EF||
Geed cAA2|dedA BAGB|Ad~d2 ed (3Bcd|(3efe (3dcB A2cd|
ea~a2 bgaf|gedB cAGB|Ad~d2 ed (3Bcd|1(3efe d2 d2 (3ABc:|2(3efe d2 d2|

I love that title, but it actually appears in the training data. Here’s the dots for the generated transcription:

Screen Shot 2019-10-24 at 09.17.33.png

This tune again has an AABB structure with the register of the second going higher than the first. We see variation and repetition. The A part starts in a very interesting way, going V to IV to I, but it has too much repetition. The B part frankly doesn’t make much sense, and doesn’t relate to the A part. I find this part so strange that I’m not going to bother learning to play this tune. I would say this transcription is not plausible.

Here’s the generated transcription closest to line number 7798:

X: 47628
T: Shoe The Donkey
M: 2/4
L: 1/8
K: Gmaj
|: G/ |B>c BA | AG E/F/G/A/ | B>c BA | d/e/d/B/ G>G |
B>c BA | AG E/F/G/A/ | Bd ef | gB G3/ :|
|: d/ |Bd dB | ce ed/c/ | Bd c/B/A/G/ | AG E2 |
Bd dB | ce e/f/e/d/ | Bd ef | gB G3/ :|

The model has again plagiarized this name from tunes in the training data, but it has created a good polka with coherence.

Screen Shot 2019-10-24 at 09.53.53.png

The B part resembles a part in The Banks of Inverness. The sixteenth note pickup is kind of strange. The A and B parts ending in the same way give a nice coherence. And it’s pretty easy to play on the accordion:

The only changes I would make is to the pickups, raising the G to an A before bar 1 and at the end of bar 4, and lowering the D to a C before bar 9.

Here’s the generated transcription closest to line number 1187:

X: 97541
T:The Victory
M:4/4
K:Cmaj
|: E G | c G E G F A d c | B d g f e c G E | F E D C B, D G, B, | 
C E G c B G F E | D E F G A c B c | d c B A G F E D | 
C E D F E G c e | (3 d e d c B c 2 :| 
|: B c | d G e G f G e G | d G e d c B A G | e c f d g e c A | 
B c d e d 2 B c | d G e G f G e G | d G e d c B A G | 
f e d c B d c A | (3 B A G (3 F E D C 2 :|

Whereas all the transcriptions we have seen so far resemble the training data we used to create the folkrnn v1 model, this one resembles the dataset we used to create the 2nd version — where we transposed and tokenized the transcriptions. Removing the spaces, rebeaming, and transposing to D produces this transcription:

Screen Shot 2019-10-24 at 09.39.34.png

This is different from The Victory in the training data. Again we see an AABB structure. There are cadences, but the one in the A part is unexpected because the part is by and large aimless. The B part is better, and is plausible. The two parts do not relate. I don’t want to spend time learning to play this.

Finally, here’s the transcription at line 7929:

Screen Shot 2019-10-24 at 09.46.31.png

This one is a total failure (and again the model has plagiarized the name). The counting is wrong in all bars except one. The melody doesn’t make a lick of sense.

So of the five transcriptions above, two are plausible. The polka is actually pretty good! All titles by GPT-2 are plagiarized, but I haven’t found much plagiarism in the tunes themselves.

In a future post I will select five transcriptions at random created by folk-rnn (v2) and perform the same kind of analysis. Will the quality of the transcriptions be as good as these ones created by GPT-2? What is gained by increasing the number of model parameters from millions to hundreds of millions, and using a model pretrained on  written English text?

Making sense of the folk-rnn v2 model, part 12

This is part 12 of my loose and varied analyses of the folk-rnn v2 model, which have included parts 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, and 11. In this part, I continue looking at beam search for sampling from the model. We will now investigate restricting the search of the sample space.

Let’s limit the number of branches we explore at each depth in computing the joint probability mass function, and consider searching a fraction of them. First, let’s consider a depth $n=2$ and plot the amount of probability mass we miss at the leaves as a function of transcription step and depth-dependent beam width. In this case, we set the beam width at a particular depth to be a fraction of the total number of branches from that depth.

MissingMassn=2.png

Each line above shows the probability mass we miss at each generation step when we compute P_{X_{t},X_{t+1}|X_{t-1}, \ldots}(x_t,x_{t+1}| x_{t-1}, \ldots ) for only a fraction of the sample space (shown in legend). In these realizations, when only considering those 5% of outcomes having the most probability mass (7 tokens at each depth) at each transcription step, we miss on average about 1% of the total probability mass. When only considering the top 10% (14 tokens at each depth), the average mass we miss is 0.1%. Considering only the top 20% (28 tokens at each depth), the average mass we miss drops another order of magnitude to 0.01%. So it seems we can get by reasonably well only searching a few dozen branches from each branch.

Here’s the missing probability mass at the leaves of a tree with depth 3:

MissingMassn=3.pngThe average amount of mass we miss when considering only the top 5% of outcomes is about 2% of the total probability mass. When only considering the top 10%, the average mass we miss is 0.1%. Considering only the top 20%, the average mass we miss drops another order of magnitude to 0.01%. We also see in some cases that the transcriptions generated using at least 15% of the vocabulary are identical.

Finally, let’s consider a depth of 4. This time, we also restrict the sample space to those 4-tuples having a joint probability greater than 1e-6. (Otherwise, a beam width of 28 results in over 3,000,000 outcomes at each step.)

MissingMassn=4.png

Now we are beginning to see an increase of missing probability mass. For a beam width of 5% at each depth we have about 3% missing. For 10% we have 0.4% missing, and for 15% we have about 0.3%. We also observe, expectedly, that the time it takes to produce a transcription increases. The normal folkrnn v2 model takes about 100 milliseconds to generate a transcription. For a beam width of 10% at each depth, a depth of two takes about 10 seconds. The same width for depth three takes about 30 seconds. And for depth of 4 it is about 5 minutes. The algorithm can be parallelized at the leaves to help reduce this. The beam width can also be restricted to a total number of branches in the entire tree (which we explore below), or adapted to exploring only those branches with the largest probability mass that sum to 0.9 in each layer. Etc.

Let’s look at the transcriptions generated using beam search using a beam width of 10% in trees with depth n.

Screen Shot 2019-07-18 at 16.58.22.png

Henrik Norbeck, who is a walking and talking catalogue of Irish traditional dance music, and who runs the weekly session in Stockholm, remarks:

This is very reminiscent of the first part of The Fermoy Lasses (played by The Dubliners here), to the point where I would even call it a variation on that tune – especially the first part. In fact Seamus Egan’s rendition of The Fermoy Lasses is probably further removed from the original tune than the first part of the generated tune. Another thing I observe is that it feels like it needs a third part. The first two parts are rather similar, but a third part that `goes somewhere else´ would make it a good tune. At the moment with two parts it feels rather boring.

Screen Shot 2019-07-17 at 15.00.14.png

Screen Shot 2019-07-17 at 15.05.33.png

Henrik remarks on this one:

This tune sounds like it could have been composed by Paddy Fahy or Sean Ryan. There are already two tunes by them that are similar to each other – so much that in my mind they are connected – and this generated one becomes a third tune in the same class, but still a distinct tune.

Let’s see what happens with a tree of depth 7 initialized with 6/8 meter and major key tokens, and a beam width of 7% at each depth. This roughly corresponds to the model generating a joint probability distribution over entire bars. After about 16 hours of computation, here’s the resulting transcription:

Screen Shot 2019-07-18 at 16.57.41.png

The end of each part gives this the felling of a slide rather than a jig. The second part of this tune is more interesting than the first part, but I do like how the cadences at the end of both parts are in contrary motion.

The computation time could have been longer if I didn’t restrict the sample space at each step to be far smaller than 137\times 10^7. For instance, I only evaluated the leaves from branches with P_{X_7| X_1, \ldots, X_6}(x_7|x_1,\ldots,x_6) > 0.01. Even so, and for this small beam width, the only step in which the probability mass was less than 0.95 was in the first step (generating the “|:” token), where it was about 0.83.

Though these previous four transcriptions come from the same random seed initialization as “Why are you?” (shown at the very top), each is quite different from one another. I especially like the tune produced with a tree of depth 4. I can’t really make any solid claim yet as to whether the quality of the generated transcriptions improve with deeper trees at this time, but my feeling is that the generated transcriptions seem more coherent with parts that make more sense than when having folkrnn v2 generate one token at a time.

Now let’s look what happens when we set the beam width to be a static number \beta. This means that we build a tree from the root to the leaves using only \beta branches. Now we are missing a major amount of probability mass.

MissingMassn=3.png

Here’s a lovely hornpipe generated from a tree with only 7 branches:Screen Shot 2019-07-18 at 17.01.26.png

Doubling the number of branches but using the same random seed produces a rather poor tune:Screen Shot 2019-07-18 at 17.07.48.png

Increasing the width to 21 but using the same random seed gives this excellent reel:

Screen Shot 2019-07-18 at 17.12.19.png

That bar 12 and 16 quote the main idea of the first part gives coherence to the tune. That first measure does not appear anywhere in the training data.

With this approach to strict beam width, the generation of entire measures at once becomes very fast. Now the generation of entire transcriptions takes only a few seconds with a beam width of 10. Here’s some example outputs created with the same beam width of \beta=10:

Screen Shot 2019-07-18 at 17.42.47.png

Screen Shot 2019-07-18 at 17.42.56.png

Screen Shot 2019-07-18 at 17.43.32.png

One thing we notice is that when we seed folkrnn v2 with a particular meter and mode and sample with small beam widths, it generates quite similar transcriptions each time even though the random seed is different. Here are the results from four different random seeds using 2/4 meter and major mode:

Screen Shot 2019-07-19 at 12.35.49.png

Here’s the same for a 4/4 meter and major mode:

Screen Shot 2019-07-19 at 12.42.57.png

It’s interesting that the melodic contours are very similar. Increasing the beam width introduces more variety in the outputs with the change of the random seed initialization.

Other variations of beam search are possible. For instance, we can restrict searching branches from the root that are within particular pitch classes.

 

 

 

Making sense of the folk-rnn v2 model, part 11

This is part 11 of my loose and varied analyses of the folk-rnn v2 model, which have included parts 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10! Findings from these have appeared in my MuMu 2018 paper, “What do these 5,599,881 parameters mean? An analysis of a specific LSTM music transcription model, starting with the 70,281 parameters of the softmax layer”, and my presentation at the Joint Workshop on Machine Learning for Music ICML 2018: “How Stuff Works: LSTM Model of Folk Music Transcriptions” (slides here). There’s still a lot that can be done here to pick apart and understand what is going on in the model. It’s a fascinating exercise, trying to interpret this complex model. In this part, I start looking at beam search as an alternative approach to sampling from the model.

Let’s briefly return to what folkrnn v2 really is: a model of a joint probability mass distribution P_{X_1,X_2, \ldots}(x_1,x_2, \ldots), where X_t: \mathcal{V} \to [1,137] is a random variable at step t \in [1, 2, \ldots], and the vocabulary \mathcal{V} is a set of ABC-like tokens. This function can be equivalently written by the chain rule:

P_{X_1,X_2, \ldots}(x_1,x_2, \ldots) = P_{X_1}(x_1)P_{X_2|X_1}(x_2|x_1) P_{X_3|X_1,X_2}(x_3|x_1, x_2) P_{X_4|X_1,X_2,X_3}(x_4|x_1, x_2, x_3)\ldots

Training a folkrnn model involves adjusting its parameters to maximize each one of these conditional probabilities for real sequences sampled from a dataset. When generating new sequences, we merely sample from each estimate of P_{X_t|X_1,X_2,\ldots, X_{t-1}}(x_t|x_1, x_2, \ldots, x_{t-1}), until we have sampled the stop token. This produces, for instance, the following transcription:

Another random sampling will give a different sequence. This is how folkrnn.org is implemented, one token sampled at a time from a dynamic posterior distribution over the vocabulary.

We can factor the joint probability distribution another way, however:

P_{X_1,X_2, \ldots}(x_1,x_2, \ldots) = P_{X_1,X_2}(x_1,x_2) P_{X_3,X_4|X_1,X_2}(x_3,x_4|x_1, x_2) P_{X_5,X_6|X_1,X_2,X_3,X_4}(x_5,x_6|x_1, x_2, x_3,x_4)\ldots

These are distributions of pairs of random variables. While folkrnn v2 was not explicitly trained to maximize these for real sequences from the training data, I believe that the use of teacher forcing means it was equivalently trained to maximize these joint probabilities. This factorization also shows a different way to generate sequences using folkrnn v2, and one that can be argued to consider more context in generation:

  1. compute P_{X_1}(x_1)
  2. compute P_{X_2|X_1}(x_2|x_1) for each x_1 \in \mathcal{V}
  3. compute P_{X_1,X_2}(x_1,x_2) = P_{X_2|X_1}(x_2|x_1)P_{X_1}(x_1) for each (x_1,x_2) \in \mathcal{V}\times\mathcal{V}
  4. sample a pair of tokens from this distribution
  5. update the states in the model with the sampled tokens, and repeat the above until one of the pairs of sampled tokens is the stop token.

In this situation, we are computing probability distributions over the sample space \mathcal{V} \times \mathcal{V}. For a vocabulary of 137 tokens, this has 18,769 outcomes. We don’t need to stop at pairs of tokens, because we can also factor the joint probability distribution using tuples of size 3, which leads to a sample space 2,571,353 outcomes. Pretty quickly however our sample space grows larger than trillions of outcomes … so there’s a limit here given the time I have left to live. We can however bring things under control by approximating these joint probability distributions.

Let’s think of this computational procedure as building a tree. For the general case of n-tuples of tokens, we create the first |\mathcal{V}| branches by computing P_{X_1}(x_1). From each of those branches we create |\mathcal{V}| more branches by computing P_{X_2|X_1}(x_2|x_1). We continue in this way to a depth of n and create all the leaves of the tree by computing the product of all values on the branches above it:

P_{X_1}(x_1)P_{X_2|X_1}(x_2|x_1) \ldots P_{X_n|X_1,X_2,\ldots, X_{n-1}}(x_n|x_1, x_2, \ldots,x_{n-1}) = P_{X_1,X_2, \ldots, X_n}(x_1,x_2, \ldots, x_n)

Thinking of the procedure in this way motivates a question: how well can we estimate P_{X_1,X_2, \ldots, X_n}(x_1,x_2, \ldots, x_n) by searching fewer than |\mathcal{V}|^n leaves? We know from the expression above that we are multiplying conditional probabilities, so if at some depth one of those is zero or very small compared to others, we might as well trim that branch then and there. Another possibility is more strict: search only the \beta branches at each depth that have the greatest probability. In this case, computing the probability distribution for tuples of size 3 requires computing \beta^3 leaves instead of |\mathcal{V}|^3 = 2,571,353. Then after having computed those leaves, we ignore all others and scale the joint probability mass distribution to sum to unity. That strategy is known as beam search. Using a beam width of \beta = \infty computes all of the leaves at any depth. Making the width smaller saves computation, but at the price of a poorer estimate of the joint probability distribution. Let’s see how sampling from folkrnn v2 in this way performs.

The following transcription was generated using tuples of size n=2, and beam width of \beta = \infty (meaning we compute the probability distribution over the entire sample space \mathcal{V}\times\mathcal{V}):Screen Shot 2019-07-17 at 09.10.05.pngThough we use the same random initialization as “Why are you …” above, this produces a dramatically different transcription. The first pair of tokens selected are meter and mode, which are both different from before. Repeat bar lines are missing at the end of the two parts, but as a whole the tune holds together with repetition and variation of simple ideas, and cadences at the right places. Here I play it at a slow pace on the Mean Green Machine Folk Machine, imposing an AABB structure:

Here’s another transcription produced with a different random seed initialization.

Screen Shot 2019-07-17 at 09.17.37.png

There are no obvious mistakes and it doesn’t appear to be copied from the training data. And it’s a nice tune, playing well as a hornpipe. Here it is on Mean Green Machine Folk Machine:

Here’s a jig created by seeding the network with a 6/8 meter token:

Screen Shot 2019-07-17 at 09.17.48.png

Another good tune that appears to be novel. Here it is on Mean Green Machine Folk Machine:

In the next part we will look at making \beta < \infty.