Upcoming Machine Folk Performance

The nice people with the camera brought me down from the mountain to describe what I do.

A group of us will be playing machine folk on the QMUL lawn starting around 15h30.


Looking at some Panorama data: spectral dissonance

Background info here.

At first, we looked at madmom tempo estimates of the field recordings in our collection and found impressive agreement with the “ground truth”. And then we looked at dynamic constrasts. Now we look at quantitative features describing the “spectral dissonance” of the music recordings in our collection.

We use the “dissonance” feature in the Essentia feature extraction library. This is not a musical descriptor, but rather a perceptual description of the relationships between spectral components in a 46 ms (at 44100 Hz sampling rate) or 43 ms (at 48000 Hz) frame of the recording. This feature is a number between 0 and 1, where 0 means the spectrum is totally “consonant”, and 1 means the spectrum is totally “dissonant”. The C++ code is here.

We start with our recording of the 1988 performance of Phase II playing “Woman is Boss”. The whole 10 minute recording is analysed below.


The grey line is the time-domain waveform, and the black line is the spectral dissonance feature. We see the spectral dissonance is pretty much distributed around 0.4. There’s a brief decrease of dissonance around 345 seconds (s), which is when someone in the crowd whoops.

Now let’s look at our recording of the 1980 performance of the Trinidad All Stars (playing “Woman on the Bass”).


Here we see a mean spectral dissonance value a bit higher than the previous recording. There are moments where the dissonance decreases, which seem to coincide with the moments when the treble drops out leaving the bass line, e.g., 355 s and 463 s. That is no surprise since we expect when fewer instruments are playing in a mixture, the more “consonant” the spectrum should be.

Now let’s look at our recording of the 1994 performance of Desperadoes playing “Fire Coming Down”.


Not really interesting. Those little drops in dissonance correspond to brief moments in the recording where the audio drops out.

Here’s the spectral dissonance features for Phase II Pan Groove playing “More Love” at the 2013 competition:


This one is a little more interesting, but we see again that this feature coincides with regions where fewer instruments are playing.

From all of these observations then, one thing is clear: This feature does not seem   relevant for our context, or really informative of anything else; and what’s more, the term is very close to being misunderstood. It does not refer to “musical dissonance”, which is why I keep calling it “spectral dissonance”.

Looking at some Panorama data: loudness and contrast in dynamics

Background info here.

Last time, we looked at madmom tempo estimates of the field recordings in our collection and found impressive agreement with the “ground truth”.

Now, let’s look at dynamics. An important aspect of these performances, at least in recent years, is the contrast in dynamic, e.g., the band suddenly playing quietly followed by a crescendo. That about a hundred people can be so coordinated is impressive, and it’s clear from listening to many of these recorded performances that such a device excites the audience.

Can we find these moments automatically from loudness features extracted from the recordings? We extract quantitative loudness features using the ITU-R BS.1770-4 standard. This involves computing the mean square of samples from filtered input blocks of 400 ms duration (hopped by 100 ms). The filtering of the input samples take into consideration the effects of the human head (boosting frequencies above 1.1 kHz by more than 3dB), and the other filter is a high pass filter, which takes something else into account (that is not clear from the specs). Anyhow, it seems there is good agreement between this feature and perceived loudness (of broadcast material). Let’s look at these loudness features for some of our recordings.

We start with our recording of the 1988 performance of Phase II playing “Woman is Boss”. The whole 10 minute recording is analysed below.

CNRSMH_I_2011_023_001_01_sonogram_loudness.pngBelow the spectrogram we see the sound waveform (grey), and on top in black is the  loudness feature. To the left and right are red dashed lines, which denote the “silent” regions of the waveform computed by the aubio silence detector vamp plugin. It is no surprise that the envelope of the audio waveform resembles the loudness feature. Some interesting things we see here are: 1) the aubio silence detector is not a good music/speech discriminator because the first 40 seconds of this recording is the announcer introducting the band; and 2) we see peaks in the loudness at around 60 seconds (s), 190 s, 340 s, and after 500 s. Those correspond to concentrations of energy around 600 Hz, which is cheering from the audience close to recording equipment. These changes in loudness are not changes in the music dynamics. (Or is the audience screams a part of the music?) In this particular performance, I don’t hear much change in the dynamics of the performance.

Now let’s look at our recording of the 1980 performance of the Trinidad All Stars (playing “Woman on the Bass”).
CNRSMH_I_2011_043_001_01_sonogram_loudness.pngAs in the previous recording, the first 14 seconds are speech, and the last 10 seconds applause, which are not picked up by the aubio plugin. Unlike the previous recording, this performance does feature one dynamic contrast: over 400-402 s. At times 194 s, 352 s, and 463 s the treble drops out leaving the bass line. It’s a nice timbral contrast rather than a dynamic one (or maybe they should both be considered related?) Anyhow, looking at the loudness data, we clearly cannot pick out which of those dips are related to real changes in dynamics.

Now let’s look at our recording of the 1994 performance of Desperadoes playing “Fire Coming Down”.


In this performance we find three contrasts in dynamics: 60-69 s, 192-200 s, and 577-584 s. All three of these are visible in the loudness, but perhaps only because we know what we are looking for.

Here’s the loudness features for Phase II Pan Groove playing “More Love” at the 2013 competition:


This performance features several huge crescendi: 214-221 s, 373-378 s, 380-386 s, 441-449 s, 449-456 s, and 501-507 s. The first one is visible in the loudness feature only since we know it’s supposed to be there, but the other five are clearly visible. Here we see that the usefulness of this feature for automatically detecting changes in dynamics like crescendi depends on the crescendi being sufficiently long in duration, and there is no interference from audience… but what fun is that?

From all of these observations then, two things are clear:

  1. We need a reliable way to demarcate the announcement and applause from the music performance so we do not analyse features from the wrong content.
  2. The ITU-R BS.1770-4 loudness standard is not a reliable feature for automatically detecting the kinds of constrasts in dynamics that we are interested in.

Looking at some Panorama data: tempo

Last time, we looked at some of the recordings in our dataset and identified several peculiarities: spoken announcements and introductions at the beginning of recordings, sometimes lasting 10s of seconds; crowd noises throughout, and sometimes much more perceivable than the music; differences in pitch shifting between recordings; recording effects like warble. Furthermore, there are not very many well-defined markers of tempo save the countoff at the beginning of a tune. I find it very hard to tap to the beat when I select random starting positions in a recording. How will our feature extraction algorithms handle this with our recording collection?

Let’s look at tempo at this time. We use two tempo description algorithms. One is the QMUL Tempo and Beat Tracker Vamp plugin, which gives a tempo estimate whenever a change is sensed. The other is madmom tempo, which gives a tempo estimate for the entire piece.

We start with our recording of the 1988 performance of Phase II playing “Woman is Boss”. The whole 10 minute recording is analysed. The black line is the tempo estimates from the QMUL Tempo and Beat Tracker Vamp plugin, and the blue line is from madmom tempo.


Using Tempo Tap, I estimate a tempo of about 140 beats per minute (bpm). madmom says it’s 139 bpm. Let’s go with madmom. For this recording, the first 40 seconds is an announcement. From then to the just about the end is the music. Still, the QMUL tempo tracker is making an octave error for the majority of the recording.

Now let’s look at our recording of the 1980 performance of the Trinidad All Stars (playing “Woman on the Bass”).


Again we see octave errors in the QMUL tracker. madmom estimates a tempo of 136 bpm. I estimate it to be around 135 bpm.

Now let’s look at our recording of the 1994 performance of Desperadoes playing “Fire Coming Down”.


Here I estimate 131 bpm but madmom says 136.

Here’s Phase II Pan Groove playing “More Love” at the 2013 competition:

CNRSMH_E_2016_004_194_003_06_sonogram_tempo.pngI estimate 121 bpm. madmom says 122 bpm.

The oldest recording in our collection is from the first Panorama in 1963. It features the Pan Am North Stars playing an arrangement of “Dan Is The Man”:

Our recording is old enough that it can be auditioned at CREM. Here’s the tempo:

Nice and calm for both of them at 113 bpm. Which is what I count.

From what we have seen, it seems the madmom tempo is actually a reliable estimate of the tempo. Let’s look at the entire collection of tempo estimates:


Nearly all of the tempo estimates of our 93 recordings are between 115 and 140 bpm, but there are some that are suspiciously slow or fast. The slowest is the recording of the 1982 performance of Amoco Renegades playing “Pan Explosion”:

According to my tapulations, this is more like 137 bpm (our recording has a slightly slower speed and lower pitch than the video above).

The fastest tempo estimate of madmom is of the 1985 recording of the Trinidad All Stars playing “Soucouyant.” Here a video where they start playing at a tempo of around 140 bpm but end around a tempo of 145 bpm.

The performance in our recording is faster! I tapstimate it starts around 147 bpm and ends around 154 bpm. So, it seems madmom is not entirely incorrect with our recording, but the performance in our recording may not be accurate.

For the other seven supposedly slow performances I find four tempo estimates that are clearly wrong:

CNRSMH_I_2011_042_001_02 madmom: 100 bpm, me: 136 bpm
CNRSMH_I_2011_045_001_02 madmom: 102 bpm, me: 138 bpm
CNRSMH_I_2011_041_001_02 madmom: 102 bpm, me: 137 bpm
CNRSMH_I_2011_042_001_03 madmom: 105 bpm, me: 143 bpm
CNRSMH_E_2016_004_193_001_03 madmom: 105 bpm, me: 106 bpm
CNRSMH_E_2016_004_193_001_01 madmom: 114 bpm, me: 114 bpm
CNRSMH_E_2016_004_194_001_06 madmom: 111 bpm, me: 110 bpm

For the other two supposedly fast performances I find the tempo estimates are ok:

CNRSMH_E_2016_004_193_002_05 madmom: 146 bpm, me: 148 bpm
CNRSMH_E_2016_004_193_002_02 madmom: 143 bpm, me: 143 bpm

What about all the ones in the middle range? Should we verify all of them? Even so, what conclusions can we make about the tempo conventions considering that our recordings may not accurately reflect the practice?

Looking at some Panorama data

Every year in Trinidad and Tobago since 1963 (save one), the Panorama competition brings together steel bands in the country to compete for the title of being the Champions that year. As part of the DaCaRyH project, we assembled a collection of 93 recordings featuring the top one, two or three ranked Panorama peformances since 1963. We are looking at this smallish corpus, which has a duration of about 14 hours, through the lens of automated feature extraction, followed by human verification. There are several things about this collection of which we must be aware.

Here’s the 1988 performance of Phase II playing “Woman is Boss”.

The video above starts around 62 seconds into our recording. The figure below shows the first 20 seconds of the waveform (mean across stereo channels) and sonogram of our recording (scaled to -80 to 0 dB). The first 12 seconds feature the announcer talking about the group. The countdown of the tune starts around 12.5 seconds. We see the waveform has a significant DC bias. We also see that the recording is bandlimited to 0–9 kHz. And there’s a strange varying notch around 1.8 kHz. Another thing we find is that our recording is slightly higher pitched than the YouTube video, by around 20 cents.


So, our feature extraction pipeline should consider that the beginning of a recording could have narration. Since we are looking at recordings made over 50 years, we have to consider differences in recording technology and their impacts on the feature extraction. There’s also the problem of which recording version to trust. If we are going to look at tuning of the pans, we need a trustworthy reference. A difference of 20 cents is quite large, and casts doubt on the idea that we can extract tuning conventions from these recordings.

In 1980, the Trinidad All Stars won with their performance of “Woman on the Bass”. The video above shows the winning performance. Below is a portion of the waveform and sonogram (scaled to -60 to 0 dB) of our recording of it. The YouTube recording starts around 41 seconds into our recording (the countdown can be seen at the left of the sonogram). The first 41 seconds of our recording features an announcer introducing the band.


There doesn’t seem to be any major tuning discrepancy between these two recordings, but it is clear they were made in different locations at the competition. On the sonogram you can see at 60 seconds a rising chromatic pattern (around 15 seconds into the YouTube video). That dark frequency component that follows at around 900 Hz is someone “whoooooing” in crowd close to the microphone of our recording. In fact, the sound of the crowd is much more present in this recording than the music of the band. I don’t hear any whooooooing in the YouTube video.

So, our analysis of the extracted features should take care in discriminating the effects of the crowd and the band. The sounds of the crowd are an important part of this live music experience, but they will have an impact on extracted music features.

In 1994, the band Desperadoes won with the performance of “Fire Coming Down”, which you can see above. Below is the first 30 seconds of our recording.


We seem to have a warble in the sound, which also exists in the video recording above. Furthermore, the recordings of the second and third place performances for that year features the same warble. So it appears the same problems can occur over all recordings of a competition. This means a chronological analysis of this dataset will have to take care in separating the effects of the year’s recording setup, with the year’s representation of the music practice.

What does the best of the 93 recordings look like? Here is the 2013 Panorama Champions Phase II Pan Groove performing “More Love” at the 2013 competition:

Here is a sonogram of the conclusion of our recording of the performance.


There’s a lovely moment of contrast in the dynamics around 502 seconds, crescendoing into a percussive conclusion at around 512 seconds. The crowd screams and whistles after that point. This recording sounds professionally made, but even so this kind of music is extremely noisy, and naturally “tinny.” It will be a challenge to make sense of feature extraction routines tested on clean studio recordings of a few well-balanced instruments.

Next, we will take a look at some of the features we extract from the signals above.




The bicycle horn is no longer available

When we were about to move from Copenhagen to London, we downsized our things. One thing I wanted to give away was a bicycle horn, so I made a video advertisement:

I left the video up even after the horn was taken because I like it. Since October 2014, the video has been viewed 10,678 times. For some reason, 82% of the total viewing time of this 73-second video comes by way of YouTube’s “Suggested videos”.

Screen Shot 2018-01-30 at 13.58.14.png

How is this video being suggested? And why is the video so popular in India?

Screen Shot 2018-01-30 at 14.28.57.png

And who are the 26 people who gave the video a thumb’s down?