Hello, and welcome to Paper of the Day (Po’D): The Turing Test and the Evaluation of Generative Music Systems Edition. Today’s paper is C. Ariza, “The Interrogator as Critic: The Turing Test and the Evaluation of Generative Music Systems”, Computer Music Journal 33(2): 48–70, 2009. (Sidenote: Apologies for the massive delay in posts here, due to grant writing, publishing, university administrating, composing, and course teaching.)
A “generative music system”, broadly speaking, is a set of formal procedures (algorithms) that produces a sequence of symbols that a human (or other system) might be so inclined to realise as “music”. One of my favorite such systems is David Cope’s Experiments in Music Intelligence (EMI). (Here is a realisation of some its output in the style of Mahler. It’s not entirely convincing to me, but there are some moments when I find the illusion to Mahler’s conventions strong.) Another example is the one we have created using LSTMs with my simple performance scripts. For anyone in the business of generative music systems, it is natural to ask how well they do — and to answer, as YouTube commenter Oscar asks, “What exactly are you trying to achieve by teaching a machine how to compose ?because I think its really stupid” [sic].
In this paper, Ariza does an excellent job elucidating the “Turing Test”, and why many evaluations of generative music systems — though described as musical Turing Tests — are in fact simple listener surveys (and not Turing Tests). For instance, many of the informal evaluations of Cope’s EMI has involved testing whether an audience can tell its “computer-generated” music from “authentic” music. This is essentially what Ariza terms a “discrimination test,” and provides no test of “intelligence” or “artificial thought.” Straight to brass tacks, “The medium of the [Turing test] cannot be altered[: it is language].”
Two models of Turing-test like approaches for evaluating generative music systems are the “Musical Directive Toy Test” (MDtT) and the “Musical Output Toy Test” (MOtT). (These are “toy” tests because the interrogater is just a critic of outputs instead of an active particpant in a two-way natural language discourse.) The MDtT is where the interrogator directs two agents to compose (e.g., harmonise a melody, melodize a harmony), and then determines which of the agents is the computer. The MOtT is where the interrogator inspects the output of two agents, and then determines which of the agents is the computer. These tests provide no measurement for thought, or music understanding.
The discrimination test, encapsulated by a listener survey, is another kind of test of a music generation system, but it has no relevance for what the Turing Test is meant. Furthermore, their implementation can suffer from various problems: role of subjectivity is judgements, ambiguity of task, and so on. Another problem – and one I have been thinking about – is what output to test. Everything produced by a generative system? Some things produced by a human? Which ones? Why those? A random selection? Why a random selection? The “curator” factor cannot be so easily excused from the experiment. Hence, the discrimination test may not be evaluating the music generation system, but the curated output of the music generation system. (Ariza identifies this problem with Cope’s tests of EMI.)
I like that Ariza references the Eliza Effect: “the susceptibility of people to read for more understanding than is warranted into strings of symbols strung together by computers” (from Hofstadter). (I have observed a kind of this anthropomorphisis in music genre recognition research.)
Ariza ends with a nice and strong point: “Generative music syustems gain nothing from associating their output with the Turing Test; worse, overestimation may devalue the real creativity in the design and interface of these systems.” This recalls an observation in his introduction: “The nature of [the success of a music generation system] is rarely questioned.” This points to Oscar’s YouTube comment: “What exactly are you trying to achieve by teaching a machine how to compose ?because I think its really stupid” [sic]. And here is a link to Eight Short Outputs. My experience in using this LSTM to generate new music (and my frustration with using it to generate variations of Christmas carols) provides me (a population of 1) with all the evaluation I need relevant to my specific use case.
Sidenote: I learned from this paper that “CAPTCHA” is an acronym of “Completely Automated Public Turing test to tell Computers and Humans Apart.”