A recent conversation on the Google magenta project is about benchmarking “music generation systems” like David Cope’s Experiments in Music Intelligence, magenta’s own recurrent neural networks, our folk-rnn system, the Iamus system, Pachet et al.’s flow machines, etc. This is a very good question because we who design and use these systems want to know whether some change has caused some benefit. Or in what ways our system is better and worse than others.
A google search of “Evaluating music generation systems” returns a lot of links. One of my past Paper of the Day blog posts discusses the article, C. Ariza, “The Interrogator as Critic: The Turing Test and the Evaluation of Generative Music Systems”, Computer Music Journal 33(2): 48–70, 2009. It’s a good article: easy to read and hard to forget. A recent article in a special journal issue devoted to music metacreation focuses on evaluating such systems: K. Agres, J. Forth, and G. A. Wiggins, “Evaluation of Musical Creativity and Musical Metacreation Systems” ACM Computers in Entertainment, Fall 2016. My collaborator Oded Ben-Tal and I will have an article appearing in the second issue of the Journal of Creative Music Systems titled, “Back to music practice: The evaluation of deep learning approaches to music transcription modelling and generation.” In this post, I summarise our upcoming article, and add some new material.
There are many approaches to evaluating “music generation systems”, but the first step is to understand that music is a human-centred activity steeped in rules and conventions that change with function, use, time and place. We try to stay consistent in our description of our folk-rnn system as a music transcription generation system. It is merely generating something that one can use to make music happen. Such a sentiment is echoed in Simon Colton’s 2016 NIPS critique of style transfer (here is a description). One cannot mistake the artifact for the art. Music is a process centred upon humans operating within culture(s).
This means evaluating “music generation systems” in any meaningful way must take into account human behavior and modern culture. No doubt this can be seen as inconvenient because it doesn’t immediately lead to computer-based experiments, cross-validation and the like, which produce with little effort a lot of numbers that can be averaged and compared in objective ways. The phenomenon of music is not so poor that it can be described in such a reduced way. So, when it comes to evaluating “music generation systems”, human or artificial, we need to use a set of methods that are more rich and relevant than comparing training loss curves, log probabilities of sequences, average test errors, and so on.
The second step to evaluating “music generation systems” is to be clear about what “evaluation” or “benchmarking” means. What is the question one is trying to ask through an evaluation? What is the latent quality one is trying to gauge or compare? How do we know a significant result in the lab is a practical result in the real world? As Kiri Wagstaff so nicely argues in her provocative 2012 keynote to ICML, doing machine learning that matters requires taking these tools back to practitioners to measure their real world impacts and failings.
Along these lines, we have thought of a variety of approaches to evaluate our folk-rnn system, and compare it with other generative approaches, each motivated by a different question and application.
- First-order sanity check: How do basic descriptive statistics compare between the training material and generated material? This is not about music at this level, but seeing the ways two datasets are similar or different.
- Music analysis: How well does a generated transcription work as a composition? How well does it exhibit structure? How coherent is it? How well do its functional elements work, such as melodic ideas, tension and resolution, repetition and variation, etc.? What should be changed, and how, to improve the piece?
- Performance analysis: How well does a generated transcription play? Are there awkward transitions, or unconventional figures? Something may “sound right” but not “play right”.
- Listening analysis: How plausible does the generated material sound when played? Listening to synthesized recordings is one thing, but hearing it played is another.
- Nefarious testing: How does the system behave when we push it outside its “comfort zone”? How fragile/general is its “music knowledge”? Our folk-rnn system seems able to count in order to correctly place measure lines, but this ability evaporates with minor modifications of an initialisation seed.
- Assisted composition: How well does the system contribute to the music composition “pipeline”, both in the conventions of the training data, but also outside? Is it useful for composition? In what ways does it frustrate composition?
- Cherry picking: How hard is it to find something of interest, something good, something really good, in a bunch of material generated by a system? Is one in every 20 generated transcriptions good?
- Inspiration and stimulation: How does the generated material inspire and stimulate a composer/musician to create music?
- Paradigm shifting: To what extent does the output, or thought of the machine generating such material, provoke anxiety in a practitioner? To what extent does it challenge assumptions of creativity and the music making process?
The organisation of our coming workshop and concert provides excellent opportunities for evaluation by bringing our system back to the practioners. In a very real sense, the workshop and concert are just what’s visible to the public. Each event is a vehicle of opportunities for working together with experts and professional music makers to evaluate dimensions of folk-rnn and other generative systems (we have been testing the magenta basic-rnn model recently, as well as simple Markov chains). In the end, we are not going to have numbers and statistics, but qualitative observations that we will somehow have to translate into engineering decision making!