Our journal article has now appeared: Sturm and Ben-Tal, “Taking the Models back to Music Practice: Evaluating Generative Transcription Models built using Deep Learning”, Journal of Creative Music Systems 2(1), 2017.
My one-line precis: Here are five ways to evaluate a music generation model that are far more meaningful and insightful than the daft “Turing test”.
The contents of this article formed my introduction at the panel, “Issues in the Evaluation of Creative Music Systems”, at the 2nd Conference on the Simulation of Music Creativity. The panel was organised by Róisín Loughran, who also has an article about evaluation in the same journal volume. So, I include below an adaptation of my panel notes.
The topic of evaluation seems to be mentioned quite frequently in music generation as an extremely difficult thing to do, and I wonder why. There is a number of different ways to go about evaluating a creative music system, but where I think people see the difficulty is in arriving to some kind of quantitative result that holds up to scrutiny. Researchers tend to place a lot of weight on numbers. For instance, they use Likert scales with vague terms about musicality, creativity, appreciation; or they count the number of experts (having more than N years musical experience) who are fooled into thinking something computer-generated was composed by Bach. This is understandable since numbers are easy to compare; and with 30 or more of them, standard textbook statistical tests can be applied, and peer-reviewers can be satisfied that something has been shown worthy of publication. There’s not much meaning in that pursuit: how does any of it matter to music practice? Instead, the researcher needs to answer some fundamental questions:
- Why did you create your system?
- How does your system contribute to music practice?
Or, Who is the user and what’s the use?
What then matters is what I will term “The Wagstaff Principles” (inspired from Wagstaff, “Machine learning that matters,” in Proc. Int. Conf. Machine Learning, pp. 529–536, 2012):
- measure the concrete impact of an application of machine learning with practitioners in the originating problem domain;
- with the results from the first principle, improve the particular application of machine learning, the definition of the problem, and the domain of machine learning in general.
This is what we are trying to do in our work with folk-rnn. Our application of a basic off-the-shelf machine learning method is unique in that:
- it is not trying to “solve music”, i.e., it does not aim to create one general system that can represent and generate many kinds of different music;
- it is focused on modeling homophonic music transcriptions of a specific tradition of music;
- it is looking more broadly at what such a model can bring to music practices in and out of the tradition.
A meaningful evaluation of folk-rnn models requires interaction with practitioners in and out of the tradition. In our article, we consider several different users and dimensions of music practice, though not all are explicitly or clearly defined. We describe in the article five different ways we have evaluated models created using folk-rnn:
- First-order sanity check: compare the statistics of 30,000 generated transcriptions with those of over 23,000 training transcriptions. How do the statistics compare between the training material and generated material? This is not really about music at this level, but seeing the ways two datasets are similar or different. The user here is the machine learning engineer, interested in gauging basic degrees of success with model training.
- Nefarious testing: seek the music knowledge limits of the model. How does a system behave when we push it little by little outside its “comfort zone”? How fragile/general is its “music knowledge”? For instance, the folk-rnn models seem at first able to count in order to correctly place measure lines, but this ability evaporates with minor modifications of an initialisation seed. One user here is the machine learning engineer, locating “holes” of the model training. Another user is the composer, exploiting the unique boundaries of a model for generating interesting material — as done by Ben-Tal in his composition “Bastard Tunes”.
- Music analysis: examine the ways in which specific generated transcriptions are successful as music compositions. How well does a generated transcription work as a composition? Does it exhibit acceptable structure? How coherent is it? How well do its elements function, such as melodic ideas, tension and resolution, repetition and variation, etc.? What should be changed in a transcription to improve it as a piece? One user here is the music machine learning engineer, determining higher-level degrees of success in music generation. Another user is the music teacher, generating melodic material for in-class discussion and assignments.
- Performance analysis: take the models to real-world music practitioners. How plausible does the generated material sound when played? How well does a generated transcription play? Are there awkward transitions, or unconventional figures? Something may “sound right” but not “play right”. One practitioner we have worked with has come up with a term to describe some of the generated transcriptions: “QBW (it has quirks but it works)“. One user here is the music machine learning engineer, determining higher-level degrees of success in music generation.
- Assisted composition: in the context of assisted music composition, use the models to create music within the conventions of the training data, and also outside the conventions. How well does the system contribute to the music composition “pipeline”, both in the conventions of the training data and outside? Is it even useful for composition? In what ways does it frustrate composition? One user here is an amatuer composer, like myself in composing The Millennial Whoop Jig and The Millennial Whoop Reel (both provided different perspectives on the limitations of the “autocomplete” model). Another is the designer of a commercial music composition tool. Another user is a professional composer who wants to generate material with which to begin composing by curation.
Since we wrote and submitted our article, we have thought of three other approaches to, or motiviations for, evaluating folk-rnn:
- Cherry picking: How hard is it to find something of interest, something good, something really good, in a bunch of material generated by a model? We have used the same training data, same architecture, different representations, and same synthesis procedure, to create the Endless folk-rnn traditional music session and the Endless magenta traditional music session. I am having a very hard time finding any thing good in the latter — where “good” means material I am moved to work with, and not necessarily something that sounds like traditional Irish music. The few promising ones that I have found (181942_236_13084, 180830_497_4888) require major changes to become something good, e.g., I have to determine the meter, I have to place repetitions, I have to work to make it sound more like a melody and less like noodling. Related to this, see my previous post on “cheating”.
- Inspiration and stimulation: How does the generated material inspire and stimulate a composer/musician to create music? As for cherry picking, I don’t find much from the Endless magenta traditional music session that inspires me. Perhaps this is due to “having a horse in the race”, so our current work aims to explore this with other people.
- Paradigm shifting: To what extent does the output, or thought of the machine generating such material, provoke anxiety in a practitioner? To what extent does it challenge assumptions of creativity and the music making process? This is probably more woo than anything, but is kind of interesting from a broad perspective of computational creativity. More thought is needed here.
None of the 8 approaches above include the daft “Turing test”. I put in quotes Turing test because most of the time this is invoked within the context of music generation it is a distortion of the original. See this Po’D: C. Ariza, “The Interrogator as Critic: The Turing Test and the Evaluation of Generative Music Systems”, Computer Music Journal 33(2): 48–70, 2009. I use the term “daft” because I don’t know why one is attempting to fool a listener. What does one gain by measuring the number of people fooled? Can one do that in meaningful way that is independent of their music experience? Who is an expert? So, you’ve created a system that has generated a chorale that a majority of your experimental subjects ascribes it to J. S. Bach. So what? Who is the user and what’s the use? What exactly are its concrete impacts in the originating problem domain? What does it contribute to music practice? How does that information help improve the model?
Of course, this isn’t the end of the road for our research. We are still working to address the Wagstaff Principles to satisfactory extents for folk-rnn. How can we use what we have learned from a music and performance analysis to improve a model? How can we make a model fail in the right kind of way to be useful for composition? These are not matters that can be addressed by throwing more data at models, using different architectures, computing cross-entropies and the like. The translation from the practice to the engineering decisions we can make is non-trivial. That is what makes this research interesting.