folk-rnn composition competition!

Spread the news far and wide!

The aim of this competition is to explore an application of machine learning to music — in particular the online tool at

This model is an example of artificial intelligence trained on traditional tunes, mainly Irish and English. The web interface allows users to generate new melodies using a few parameters (There’s a video on the website that explains how it works). We are seeking works that make creative use of this tool to compose new pieces, which do not need to adhere to the idiom of the training material.

Submissions will be judged on their musical quality and their utilisation of outputs from The winning piece will be performed by a professional ensemble at a public concert in London, UK, in early October 2018. Professional recordings of the performance will be provided to composers, and with the permission of musicians and composers, made available on the project youtube channel:

We welcome submissions from any composer without restriction of age or nationality. Attendance at the concert is not mandatory. There is no cost for submitting a work.

Rules for submitted works:

1. scored for any combination of the following instruments: flute, clarinet, violin, cello, and piano (only 1 each); no use of amplification or electronic instruments is allowed;

2. no longer than 10 minutes in duration;

3. must be derived in some way from material generated by the application at;

4. must be accompanied by a written explanation of how the work comes about from the use of artificial intelligence through the website. Composers can also accompany the text with illustrations (e.g. staff notation).

5. no restrictions on style, or the way the outputs from are used;

Important dates:
– August 31 2018: Submission of PDF score and required accompanying material by email to
– September 15 2018: Notification
– September 25 2018: Performance materials due
– October 9 2018: Concert (London UK)

For more information about the technology, see the following:

If you have questions or comments, contact Dr. Oded Ben-Tal:


Fully funded Media & Arts Technology PhD scholarships at QMUL for UK residents. Apply today!

Applications are now open for a second round of PhD scholarships for UK residents only to start in September 2018. 

Closing date Friday 15th June midnight.


Apply here:


Recording of the April machine folk session


As part of an outreach day at my university QMUL, I organised a group of musicians to play 7 sets of machine folk music. We played 14 tunes (5 of which are real traditional tunes). Here’s the set list:

  1. March to the mainframe (X:488, folk-rnn v2) with The Glas Herry Comment (folk-rnn v1)
  2. The Mal’s Copporim (folk-rnn v1) with Off to California (traditional)
  3. X:1166 (folk-rnn v2) with Rochdale Coconut Dance (traditional)
  4. Oats and Beans (traditional) with Optoly Louden (folk-rnn v1)
  5. Why are you and your 5,599,881 parameters so hard to understand? (folk-rnn v2) with Why are you still singing even when reduced to a 30-dimensional subspace? (folk-rnn v2 with dimension reduction of softmax layer parameters)
  6. The Portobello Hornpipe (traditional) with The 2714 Hornpipe (folk-rnn v2)
  7. Two Burner Brew No. 1 (folk-rnn v2) with The Hairpin Bend (traditional)
Musicians included: Bob Sturm (button accordion), Sandy Rogers (fiddle), Luca Turchet (mandolin), Emmanouil Benetos (piano accordion), Michael Mcloughlin (tin whistle), Dan Stowell (melodica and bones), and Cornelia Metzig (guitar).

Other sounds courtesy of East London Ambulance service and the QMUL clock tower (donging at 16h).

Looking at some Panorama data: spectral dissonance

Background info here.

At first, we looked at madmom tempo estimates of the field recordings in our collection and found impressive agreement with the “ground truth”. And then we looked at dynamic constrasts. Now we look at quantitative features describing the “spectral dissonance” of the music recordings in our collection.

We use the “dissonance” feature in the Essentia feature extraction library. This is not a musical descriptor, but rather a perceptual description of the relationships between spectral components in a 46 ms (at 44100 Hz sampling rate) or 43 ms (at 48000 Hz) frame of the recording. This feature is a number between 0 and 1, where 0 means the spectrum is totally “consonant”, and 1 means the spectrum is totally “dissonant”. The C++ code is here.

We start with our recording of the 1988 performance of Phase II playing “Woman is Boss”. The whole 10 minute recording is analysed below.


The grey line is the time-domain waveform, and the black line is the spectral dissonance feature. We see the spectral dissonance is pretty much distributed around 0.4. There’s a brief decrease of dissonance around 345 seconds (s), which is when someone in the crowd whoops.

Now let’s look at our recording of the 1980 performance of the Trinidad All Stars (playing “Woman on the Bass”).


Here we see a mean spectral dissonance value a bit higher than the previous recording. There are moments where the dissonance decreases, which seem to coincide with the moments when the treble drops out leaving the bass line, e.g., 355 s and 463 s. That is no surprise since we expect when fewer instruments are playing in a mixture, the more “consonant” the spectrum should be.

Now let’s look at our recording of the 1994 performance of Desperadoes playing “Fire Coming Down”.


Not really interesting. Those little drops in dissonance correspond to brief moments in the recording where the audio drops out.

Here’s the spectral dissonance features for Phase II Pan Groove playing “More Love” at the 2013 competition:


This one is a little more interesting, but we see again that this feature coincides with regions where fewer instruments are playing.

From all of these observations then, one thing is clear: This feature does not seem   relevant for our context, or really informative of anything else; and what’s more, the term is very close to being misunderstood. It does not refer to “musical dissonance”, which is why I keep calling it “spectral dissonance”.

Looking at some Panorama data: loudness and contrast in dynamics

Background info here.

Last time, we looked at madmom tempo estimates of the field recordings in our collection and found impressive agreement with the “ground truth”.

Now, let’s look at dynamics. An important aspect of these performances, at least in recent years, is the contrast in dynamic, e.g., the band suddenly playing quietly followed by a crescendo. That about a hundred people can be so coordinated is impressive, and it’s clear from listening to many of these recorded performances that such a device excites the audience.

Can we find these moments automatically from loudness features extracted from the recordings? We extract quantitative loudness features using the ITU-R BS.1770-4 standard. This involves computing the mean square of samples from filtered input blocks of 400 ms duration (hopped by 100 ms). The filtering of the input samples take into consideration the effects of the human head (boosting frequencies above 1.1 kHz by more than 3dB), and the other filter is a high pass filter, which takes something else into account (that is not clear from the specs). Anyhow, it seems there is good agreement between this feature and perceived loudness (of broadcast material). Let’s look at these loudness features for some of our recordings.

We start with our recording of the 1988 performance of Phase II playing “Woman is Boss”. The whole 10 minute recording is analysed below.

CNRSMH_I_2011_023_001_01_sonogram_loudness.pngBelow the spectrogram we see the sound waveform (grey), and on top in black is the  loudness feature. To the left and right are red dashed lines, which denote the “silent” regions of the waveform computed by the aubio silence detector vamp plugin. It is no surprise that the envelope of the audio waveform resembles the loudness feature. Some interesting things we see here are: 1) the aubio silence detector is not a good music/speech discriminator because the first 40 seconds of this recording is the announcer introducting the band; and 2) we see peaks in the loudness at around 60 seconds (s), 190 s, 340 s, and after 500 s. Those correspond to concentrations of energy around 600 Hz, which is cheering from the audience close to recording equipment. These changes in loudness are not changes in the music dynamics. (Or is the audience screams a part of the music?) In this particular performance, I don’t hear much change in the dynamics of the performance.

Now let’s look at our recording of the 1980 performance of the Trinidad All Stars (playing “Woman on the Bass”).
CNRSMH_I_2011_043_001_01_sonogram_loudness.pngAs in the previous recording, the first 14 seconds are speech, and the last 10 seconds applause, which are not picked up by the aubio plugin. Unlike the previous recording, this performance does feature one dynamic contrast: over 400-402 s. At times 194 s, 352 s, and 463 s the treble drops out leaving the bass line. It’s a nice timbral contrast rather than a dynamic one (or maybe they should both be considered related?) Anyhow, looking at the loudness data, we clearly cannot pick out which of those dips are related to real changes in dynamics.

Now let’s look at our recording of the 1994 performance of Desperadoes playing “Fire Coming Down”.


In this performance we find three contrasts in dynamics: 60-69 s, 192-200 s, and 577-584 s. All three of these are visible in the loudness, but perhaps only because we know what we are looking for.

Here’s the loudness features for Phase II Pan Groove playing “More Love” at the 2013 competition:


This performance features several huge crescendi: 214-221 s, 373-378 s, 380-386 s, 441-449 s, 449-456 s, and 501-507 s. The first one is visible in the loudness feature only since we know it’s supposed to be there, but the other five are clearly visible. Here we see that the usefulness of this feature for automatically detecting changes in dynamics like crescendi depends on the crescendi being sufficiently long in duration, and there is no interference from audience… but what fun is that?

From all of these observations then, two things are clear:

  1. We need a reliable way to demarcate the announcement and applause from the music performance so we do not analyse features from the wrong content.
  2. The ITU-R BS.1770-4 loudness standard is not a reliable feature for automatically detecting the kinds of constrasts in dynamics that we are interested in.