On folk-rnn cheating and music generation

The other day we posted the Endless magenta Traditional Music Session, and added 17,000+ tunes to the Endless folk-rnn Traditional Music Session. Listening to the differences between the results generated by the two systems shows just how important the representation can be for statistical machine learning. This is not surprising, but it kicked up an interesting argument I hadn’t heard before.

tweet1

tweet2

There are two implications from that thread. One is that folk-rnn is cheating to generate transcriptions. The second implication is that the use of ABC notation as a representation for music generation is cheating. I want to address both of these, and clear up some confusion.

First, of course folk-rnn is cheating. I have devoted a lot of time lobbying for the awareness of cheating in machine learning systems, specifically in music informatics. I am the organiser of HORSE 2016 and the coming 2017 edition. I just delivered a 2-hour summer school session about how machine learning systems can  appear to be solving a problem, but actually not (slides here). In that presentation I called folk-rnn a “horse” and demonstrated why: it appears able to compose music, but actually it is not.

Simply put, folk-rnn has an extremely brittle understanding of what it’s doing. It appears able to repeat and vary material, count time, create functional cadences, and create long-term structure. It clearly does extremely well in generating novel transcriptions that reflect many characteristics of the transcriptions on which it is trained, and does so without copying the training data. But those abilities evaporate when folk-rnn is pushed even a skosh outside its domain of expertise (whatever that is). This is discussed more fully in our forthcoming article, B. L. Sturm and O. Ben-Tal, “Bringing the models back to music practice: The evaluation of deep learning approaches to music transcription modelling and generation”, Journal of Creative Music Systems.

Is it reasonable to expect anything else of folk-rnn? Is it reasonable to expect anything else of a far more complex artificial model that is necessarily completely divorced from where music actually happens? folk-rnn is essentially trained under duress, being forced to reproduce with high likelihood one symbol at a time from real transcription sequences that are crowd-sourced on thesession.com. That’s not teaching music. (One of my favourite comments from a user at thesession.com is that we need to teach our models to dance first before we can teach them to generate music.) Machine learning can seem like magic, sometimes black magic, but appearances are decieving: machine learning produces systems that can be extraordinarily successful but only in very limited contexts.

To be blunt, folk-rnn is a computational idiot savant and the Endless folk-rnn Traditional Music Session is a parlour trick. However, folk-rnn is not so idiotic that it is just reproducing material from the training data. The manner in which folk-rnn cheats is not plagiarism. In spite of all this, our co-creation with folk-rnn and our collaboration with other professional musicians show just how useful an idiot folk-rnn actually is — both in the tradition from which its trained data comes, and outside.

The second implicit claim from the thread above is that using the ABC notation is cheating in music generation. I find this problematic for a few reasons. First, what exactly is the problem we are trying to solve, and what are the rules of the game? Second, ABC notation was developed as a shorthand to help remember Western European folk tunes. It’s a language for crib notes, but why should it be out of bounds for use in machine learning, especially when we aim to apply machine learning to model folk tune transcriptions? Implicit in such a restriction is that “real” music generation systems should use raw audio samples, or MIDI messages. But how are those representations more relevant to music than ABC or staff notation? I would argue that those things are even further away from where music happens. As representations of music, they are certainly far more arbitrary.

We do not aim to design folk-rnn to “conquer” music. We are interested in building tools for creating music, and how creative use can inform the design of such tools. As magenta’s Doug Eck has said many times, machine learning today is like the technology that made the electric guitar possible, and thus spurned new kinds of music. I really like that analogy. It would be silly to suggest the electric guitar cheats as a musical instrument because it uses an amplifier.

In our work with folk-rnn, we intentionally focus on modeling transcriptions — hence, the titles of our two papers on the system: “Music transcription modelling and composition using deep learning”, and “Bringing the models back to music practice: The evaluation of deep learning approaches to music transcription modelling and generation” (in press). We also focus on a particular kind of music (which we currently describe, “traditional dance music found in Ireland and the UK”). This is because of a mix of practical and personal reasons:

  1. We have access to a lot of relevant data through thesession.com
  2. The characteristics of this music are such that they unfold over time scales that are amenable to current limits of computational modelling
  3. The tradition is alive with many practitioners around us in the UK, many of whom are open to the idea of involving computers
  4. I like to play and listen to this kind of music

The data at our disposal is over 23,000 textual transcriptions collected on thesession.com. The data sets we use are derived from the 13 MB file “sessions_data_clean.txt” on the folk-rnn data page. The first version of folk-rnn was just char-rnn trained on that big text file. The second version was developed in theano and lasagne, and trained on the tokenised data in “allabcwrepeats_parsed”. The third version involves only a change in the tokenisation of the training data, and a fuller coverage of the chromatic pitches by transposition. From our exhaustive review of all past applications of neural networks to music modeling and generation, it appears that folk-rnn is unique in that it models ABC transcriptions themselves, rather than converting them to other forms, like quantised MIDI events, or distributed pitch-time representations. There’s nothing else special about folk-rnn. What folk-rnn shows is that even with simple machine learning models — but ones more complex than Markov models — a good representation can make a huge difference in the usefulness of the resulting model.

The exception that ABC notation is permissible for music generation as long as “repeats are unrolled” is very peculiar. (I assume “unrolled” means that all symbols appearing between “|:” and “:|” are explicitly repeated.) All three versions of folk-rnn do not do this, contrary to the thread above. They all treat repeat symbols, and first and second ending symbols, as elements of the transcription language. Why? Because we design folk-rnn to model transcriptions and not music.

It is critical to note that we do not manually add repeats where we think they should appear in generated transcriptions. We do not post-process the 47,000+ transcriptions on the Endless folk-rnn Traditional Music Session or the 3000 transcriptions in the folk-rnn Session Book Vol. 1 of 10 other than to transpose them to an acceptable pitch range or mode. Those collections are presented, warts and all. There is no curation in there. There is no fixing. When we do edit transcriptions, we make that clear, and show what we have changed, e.g., see my Aug 2015 blog post, Deep learning for assisting the process of music composition (part 4) and the preceding parts.

In section 5 of our 2016 paper, “Music transcription modelling and composition using deep learning”, we describe the results of an experiment where we trained a folk-rnn model using all the ABC transcriptions with unrolled repeats. (This unrolled ABC data can be found on the folk-rnn data page, in the file “allabcworepeats_parsed”.) We found that many of the generated transcriptions still by and large have an AABB form typical in this kind of music. (Fig. 4 in that paper shows an example.) So, why then do we keep training on ABC with repeat signs? Three reasons:

  1. We aim to model transcriptions expressed using ABC notation, and repeat symbols are a part of the language
  2. The kind of music we are focusing on involves repetition
  3. We see no value in making the problem harder to solve.

The bottom line is this: our research aims to create systems that are useful for making music, to gauge their usefulness, to investigate how to tune them for other kinds of music and ways of working, and to leverage that information to improve the design and use of these systems. More broadly, our research is concerned with notions of musicality in machines, and how to measure it. We recognise that these concepts and questions are not new, and there exists a wealth of good research addressing them dating back many decades. Informed by good engineering practice, we are starting from an accessible and restricted domain of music practice that should be and we find is within reach of current methods. By bringing the models back to the practice and seeing what kinds of things they enable, we are trying to do machine learning that matters. The ultimate proof is in the use. Since we released it in 2015, folk-rnn has been used to create more than 23 pieces of new music: the growing list of related compositions is on the folk-rnn project page. Whether or not they are all compelling is debatable, but our recent concert of some of these works provides good proof that at least some are.

2 thoughts on “On folk-rnn cheating and music generation

  1. Pingback: “Lisl’s Stis”: Recurrent Neural Networks for Folk Music Generation | High Noon GMT

  2. Pingback: Taking the Models back to Music Practice: Evaluating Generative Transcription Models built using Deep Learning | High Noon GMT

Leave a comment