The AI Music Generation Challenge 2020: Summary and Results

The AI Music Generation Challenge 2020 has now finished! The task for competitors was to build an artificial system that generates the most plausible double jigs, as judged against the 365 published in “The Dance Music of Ireland: O’Neill’s 1001” (1907). This collection was selected for a variety of reasons. First, it is recognized and well-studied, and many of the tunes are still played today. Second, the structure of these jigs is clear and consistent, and thus relatively well-defined. Finally, computer encodings of these tunes exist: a cleaned and corrected version of these jigs in ABC notation is here: http://www.norbeck.nu/abc/book. There really is no reason an AI system cannot learn to create plausible double jigs in this style.

There were six competitors in total. We will name the systems of these competitors after places in Ireland: Carrick, Connacht, Glendart, Killashandra, Shandon, and Tralibane. In addition, the benchmark was folk-rnn (v2), which was only seeded with the “M:6/8” token to produce each transcription.

Each competitor had to submit 10,000 transcriptions generated by their system in ABC or staff notation, MIDI, or mp3-compressed audio files. (An exception was made for Tralibane, who submitted only 1,000 transcriptions due to a slow sampling procedure.) Two systems (Carrick and Killashandra) produced ABC notation, which I rendered as staff notation in PDF. Two systems (Connacht, Tralibane) produced staff notation already rendered in PDF. The benchmark also produced staff notation in PDF. Two systems (Glendart, Shandon) produced MIDI, which I rendered as MP3 (timidity, piano, 120 bpm).

Here are 10,001 tunes generated by the benchmark. (Each tune was titled by a seq2seq network I trained for this purpose, but these titles were not included in the materials given to the judges.)

Initially, I expected many more systems, and so thought of an experimental design that spread the work over four judges. I would select five transcriptions at random from those generated by a system, and then assign two from each to each judge, such that each transcription was assessed by at least two judges. That only seven systems needed to be evaluated, I instead had every judge evaluate every transcription.

The four judges were Paudie O’Connor, Kevin Glackin, Jennikel Andersson, and Henrik Norbeck – all experts in Irish traditional dance music and its performance. Each judge was aware of the competition, and that they would be evaluating transcriptions generated by AI systems. Each judge received 35 transcriptions to evaluate, and an evaluation sheet to use for each. The transcription file names were the random numbers used to select them from the collections. The judges received no explicit information about the systems that generated the transcriptions, though it was clear which transcriptions were generated by the same system by their appearance (e.g., MIDI or idiosyncrasies in the staff notation).

In judging, a transcription is rejected from further review if it meets any of the following criteria:

  1. It is plagiarized (P) [this means that the generated tune is an existing tune, or a very close variant];
  2. Its rhythm is not close to that of a double jig, or cannot be played with such a rhythm (R);
  3. Its pitch range is not characteristic, and cannot be made so by transposition (T);
  4. Its mode and accidentals are not characteristic, and cannot be made so by transposition (M).

Transcriptions that pass these four criteria are then assessed by each judge on a scale of 1-5 (strongly disagree to strongly agree) along the following dimensions:

  1. The melody is characteristic of the double jigs in “1001”;
  2. The structure and coherence are characteristic of the double jigs in “1001”;
  3. The tune is playable on an Irish traditional instrument (fiddle, whistle, flute, accordion);
  4. The tune is memorable;
  5. The tune is interesting.

Here’s what a completed evaluation form looks like for transcription #6021:

(Some judges added up the scores to make a total.)

All selected transcriptions generated by Glendart and Shandon were rejected by all judges due to the rhythm criterion. This wasn’t the case because of my conversion from MIDI to audio. One judge remarked of the transcriptions from Shandon: “just random notes, have no rhythm at all”. Another judge remarked of the transcriptions generated by Glendart: “[they] have a rare rhythm that can at some parts be interpreted as a jig, with much goodwill, but as a whole they are not close to the rhythm of a double jig. For example, there are unnatural breaks, plus the parts are uneven and don’t make the double jig sense”. All other judges agreed with these assessments.

Four of the five selected transcriptions generated by Carrick were rejected due to plagiarism. The remaining transcription scored the following:

Scores for a transcription generated by Carrick

The small tally at the right shows the counts of each of the five possible scores from all judges in all five criteria. This transcription scored only 5 fives, and 2 fours, from all the judges.

Two of the five transcriptions generated by Killashandra were rejected due to plagiarism. The scores of the remaining three transcriptions are shown below:

Scores for transcriptions generated by Killashandra

Three judges rejected transcription #5102 for its rhythm (being closer to a slide than a double jig). The other two tunes had a diversity of scores.

All randomly selected transcriptions generated by Tralibane and Connacht passed the four rejection criteria. Here are the resulting scores for each:

Scores for transcriptions generated by Tralibane
Scores for transcriptions generated by Connacht

Finally, here are the results of the five randomly selected transcriptions generated by the benchmark:

Scores for transcriptions generated by the benchmark (folk-rnn v2)

Tune #1641 was rejected due to plagiarism. Judge D rejected tune #4101 due to its pitch range not being characteristic; the other judges rated it very low.

After the judges had completed their scoring of all transcriptions, we all met together and discussed favorites, and which transcriptions to award first and second prize. At this stage, the judges did not know about the systems and did not know how the others had rated the transcriptions.

Judge A said #8091 (Connacht) was their favorite. Judge B had three favorites: #8091 (Connacht), #7983 (benchmark), and #6021 (benchmark). The favorite of Judge C was #8091 (Connacht). Judge D mentioned all their favorites were unfortunately the plagiarized ones, but if they had to choose it would be #7983 (benchmark).

The judges agreed that the clear first place winner was transcription #8091 (Connacht). From the scores above, we see that this tune had 16 top scores in total. Here’s the transcription as the judges received it:

First Place: Transcription #8091 generated by Connacht

In the video below Paudie O’Connor plays the tune and gives it the title, “The AI Man”:

One minor change Paudie made to this transcription is to drop the second A in the first and fifth measures down to a G. A slightly more dramatic change is his drop of the C-sharp in the B part, making it natural throughout. The discussion among the judges on this change highlighted that either would be fine. (In fact, some styles involve substituting C-sharp for C-natural, or even playing in between.)

The judges agreed that the clear second place winning jig was transcription #7983 (benchmark). Here is that transcription as the judges received it:

Second Place: Transcription #7983 generated by benchmark (folk-rnn v2)

In the video below Jennikel Andersson plays the tune and gives it the title, “The Lonesome Fairy”:

The system Connacht was folk-rnn (v2), but sampling using beamsearch (a pair of tokens selected each step from among the at most 20*137 pairs with the largest probability mass) and then selecting from among the generated transcriptions using an artificial “critic”. The critic first rejects any transcription that does not have an appropriate structure with respect to O’Neill’s 365 double jigs. This was formed using features and criteria I researched earlier this year. The critic then rejects any transcription that is too close to any in the folk-rnn v2 training data or O’Neill’s collection. Finally, the critic assembles a collection by comparing accepted transcriptions to all others selected to that point so that no two are too similar. Here are the resulting 10001 transcriptions generated by Connacht.

There was a second stage of evaluation involving the judges looking at quality consistency. In this stage, four transcriptions selected at random from each of the four best performing systems (benchmark, Connacht, Killashandra and Tralibane) were presented to all judges one a time. They were asked to rate the quality of the output with respect to O’Neill’s collection. The three possible choices were low, mid, and high. It worked best at the stage for the judges to give thumbs up, thumbs down, or sideways thumb to denote their judgement. It took about an hour to rate all sixteen without discussion.

To summarize the results for each system, I count the number of its transcriptions that are awarded at least one thumbs up. Of the remainder, I count the transcriptions with at least one sideways thumb. Here are the totals:

Two transcriptions by Connacht received at least one thumbs up. At least two transcriptions generated by each system were rated unanimously as low quality. Combined with the five transcriptions of each system evaluated in more detail by the judges, this shows that the consistency of each of these systems is low – with possibly Connacht at the high end of that low.

These results show show just how far the field has yet to go, even for challenges such as this highly constrained one. For Irish traditional double jigs, even though one believes they are modeling a simple monophonic melodic line (because that’s how some notate it), there are all kinds of implicit characteristics that should be considered, including harmonic motion (some say this kind of music is decorated chord progressions), opportunities for ornamentation (including double stops), playability, variation, etc.

One of the criticisms of some of the judges about this challenge is that it would have been much easier to evaluate the submissions if they were audio files. This would be natural since Irish traditional music is an aural tradition. However, this would have entailed making very poor and inauthentic syntheses of the transcriptions missing most of the qualities of the performance of the music.

Another fine criticism by some of the judges is that the challenge should have started with a tutorial for competition participants about what is Irish traditional music, and what is a double jig. This would have made participants more aware of the qualities their systems should achieve to meet the challenge.

The AI Music Generation Challenge 2021 will be formally launched soon. It will consist of two different tasks:

  1. Build an artificial system that generates the most plausible Scandinavian polskas.
  2. Build an artificial system that generates a plausible second line to a given polska.

Participants can submit to either or both tasks. There will again be judges. We will record an introductory video about what a polska is, and what judges will be looking for in their evaluation. Awards will be given in both tasks. There will also be a final event where people will dance to selected polskas and vote for a winner!

Acknowledgments: The AI Music Generation Challenge 2020 was supported by the project Human Behaviour and Machine Inteligence (HUMAINT), and the project MUSAiC (ERC-2019-COG No. 864189).