Benchmarking “music generation systems”?

A recent conversation on the Google magenta project is about benchmarking “music generation systems” like David Cope’s Experiments in Music Intelligence, magenta’s own recurrent neural networks, our folk-rnn system, the Iamus system, Pachet et al.’s flow machines, etc. This is a very good question because we who design and use these systems want to know whether some change has caused some benefit. Or in what ways our system is better and worse than others.

A google search of “Evaluating music generation systems” returns a lot of links. One of my past Paper of the Day blog posts discusses the article, C. Ariza, “The Interrogator as Critic: The Turing Test and the Evaluation of Generative Music Systems”, Computer Music Journal 33(2): 48–70, 2009. It’s a good article: easy to read and hard to forget. A recent article in a special journal issue devoted to music metacreation focuses on evaluating such systems: K. Agres, J. Forth, and G. A. Wiggins, “Evaluation of Musical Creativity and Musical Metacreation Systems” ACM Computers in Entertainment, Fall 2016. My collaborator Oded Ben-Tal and I will have an article appearing in the second issue of the Journal of Creative Music Systems titled, “Back to music practice: The evaluation of deep learning approaches to music transcription modelling and generation.” In this post, I summarise our upcoming article, and add some new material.

There are many approaches to evaluating “music generation systems”, but the first step is to understand that music is a human-centred activity steeped in rules and conventions that change with function, use, time and place. We try to stay consistent in our description of our folk-rnn system as a music transcription generation system. It is merely generating something that one can use to make music happen. Such a sentiment is echoed in Simon Colton’s 2016 NIPS critique of style transfer (here is a description). One cannot mistake the artifact for the art. Music is a process centred upon humans operating within culture(s).

This means evaluating “music generation systems” in any meaningful way must take into account human behavior and modern culture. No doubt this can be seen as inconvenient because it doesn’t immediately lead to computer-based experiments, cross-validation and the like, which produce with little effort a lot of numbers that can be averaged and compared in objective ways. The phenomenon of music is not so poor that it can be described in such a reduced way. So, when it comes to evaluating “music generation systems”, human or artificial, we need to use a set of methods that are more rich and relevant than comparing training loss curves, log probabilities of sequences, average test errors, and so on.

The second step to evaluating “music generation systems” is to be clear about what “evaluation” or “benchmarking” means. What is the question one is trying to ask through an evaluation? What is the latent quality one is trying to gauge or compare? How do we know a significant result in the lab is a practical result in the real world? As Kiri Wagstaff so nicely argues in her provocative 2012 keynote to ICML, doing machine learning that matters requires taking these tools back to practitioners to measure their real world impacts and failings.

Along these lines, we have thought of a variety of approaches to evaluate our folk-rnn system, and compare it with other generative approaches, each motivated by a different question and application.

  1. First-order sanity check: How do basic descriptive statistics compare between the training material and generated material? This is not about music at this level, but seeing the ways two datasets are similar or different.
  2. Music analysis: How well does a generated transcription work as a composition? How well does it exhibit structure? How coherent is it? How well do its functional elements work, such as melodic ideas, tension and resolution, repetition and variation, etc.? What should be changed, and how, to improve the piece?
  3. Performance analysis: How well does a generated transcription play? Are there awkward transitions, or unconventional figures? Something may “sound right” but not “play right”.
  4. Listening analysis: How plausible does the generated material sound when played? Listening to synthesized recordings is one thing, but hearing it played is another.
  5. Nefarious testing: How does the system behave when we push it outside its “comfort zone”? How fragile/general is its “music knowledge”? Our folk-rnn system seems  able to count in order to correctly place measure lines, but this ability evaporates with minor modifications of an initialisation seed.
  6. Assisted composition: How well does the system contribute to the music composition “pipeline”, both in the conventions of the training data, but also outside? Is it useful for composition? In what ways does it frustrate composition?
  7. Cherry picking: How hard is it to find something of interest, something good, something really good, in a bunch of material generated by a system? Is one in every 20 generated transcriptions good?
  8. Inspiration and stimulation: How does the generated material inspire and stimulate a composer/musician to create music?
  9. Paradigm shifting: To what extent does the output, or thought of the machine generating such material, provoke anxiety in a practitioner? To what extent does it challenge assumptions of creativity and the music making process?

The organisation of our coming workshop and concert provides excellent opportunities for evaluation by bringing our system back to the practioners. In a very real sense, the workshop and concert are just what’s visible to the public. Each event is a vehicle of   opportunities for working together with experts and professional music makers to evaluate dimensions of folk-rnn and other generative systems (we have been testing the magenta basic-rnn model recently, as well as simple Markov chains). In the end, we are not going to have numbers and statistics, but qualitative observations that we will somehow have to translate into engineering decision making!

Partnerships Tickets, Tue, 23 May 2017 at 19:00 | Eventbrite

This unique concert will feature works created with computers as creative partners drawing on a uniquely human tradition: instrumental folk music. We aren’t so interested in whether a computer can compose a piece of music as well as a human, but instead how we composers and musicians can use artificial intelligence to explore creative domains we hadn’t thought of before. This follows on from recent sensational stories of artificial intelligence making both remarkable achievements — a computer beating humans at Jeopardy! — and unintended consequences — a chatbot mimicking racist tropes. We are now living in an age, for better or worse, when artificial intelligence is seamlessly integrated into the daily life of many. It is easy to feel surrounded and threatened, but at the same time empowered by these new tools.

Our concert is centred around a computer program we have trained with over 23,000 “Celtic” tunes — typically played in pubs and festivals around Ireland, France and the UK. We will showcase works involving composers and musicians co-creating music with our program, drawing upon the features it has learned from this tradition, and combining it with human imagination. A trio of traditional musicians led by master Irish musician Daren Banarsë will play one set of computer-generated “Celtic” tunes. Ensemble x.y will perform a work by Oded Ben-Tal, which is a 21st century homage to folk-song arrangements from composers such as Brahms, Britten and Berio. They will also perform a work by Bob L. Sturm created from material the computer program has self-titled “Chicken.” Another piece you will hear involves two computer programs co-creating music together: our system generates a melody and another system harmonises it in various styles it has learned, e.g., a Bach choral. Our concert will provide an exciting glimpse into how new musical opportunities are enabled by partnerships: between musicians from different traditions; between scientists and artists; and last, but not least, between humans and computers.

Source: Partnerships Tickets, Tue, 23 May 2017 at 19:00 | Eventbrite

Two upcoming events!


  1. Saturday March 25, 12-14h, at the London Southbank University, there will be a special workshop as part of the Inside Out Festival where a trio of master Irish musicians will play traditional music generated by our system, folk-rnn. Participants are invited to bring their own instruments and learn one of these new tunes. Registration will be announced shortly.
  2. Tuesday May 23, 19-20h30, at QMUL, there will be a unique concert featuring works created with computers as creative partners drawing on a uniquely human tradition: instrumental folk music. We aren’t so interested in whether a computer can compose a piece of music as well as a human, but instead how we can use artificial intelligence to explore creative domains we hadn’t thought of before. Tickets will be available shortly. Tickets.

The Drunken Pint, a folk-rnn original

Now, here’s a wild tune composed by our folk-rnn system:

T: Drunken Pint, The
M: 2/4
L: 1/8
K: Gmaj
|: G |B/c/d =c>B | AB/G/G B/A/ | G/A/G/F/ ED | Be/d/ dB/A/ |
G>D Bd | c2 B2 | A/B/A/G/ F/G/A/c/ | BG G :|
|: A/B/ |c2 E/F/G | ed- de | c>B AG | G>^F GA |
BG E>D | EG FE | D/^c/d/^c/ d^c | d3- :|

Here is a slightly cleaned version in common practice notation for those playing at home.


I found this tune recently among the 70,000+ transcriptions we had the system generate in August 2015. (Actually, this tune come from a model I built using char-rnn applied to transcriptions I culled from Anyhow, the title is what caught my eye at first — a title created entirely by the system. Then I was happy to see that the tune has an AABB structure, and the system was smart enough to deal with those two odd quaver pickups. It was until I learned to play it that I really begain to appreciate it. What a fun drunken riot this little system has crafted!

Now who wants to create the drunken dance that this piece should accompany??