Looking at some Panorama data: loudness and contrast in dynamics

Background info here.

Last time, we looked at madmom tempo estimates of the field recordings in our collection and found impressive agreement with the “ground truth”.

Now, let’s look at dynamics. An important aspect of these performances, at least in recent years, is the contrast in dynamic, e.g., the band suddenly playing quietly followed by a crescendo. That about a hundred people can be so coordinated is impressive, and it’s clear from listening to many of these recorded performances that such a device excites the audience.

Can we find these moments automatically from loudness features extracted from the recordings? We extract quantitative loudness features using the ITU-R BS.1770-4 standard. This involves computing the mean square of samples from filtered input blocks of 400 ms duration (hopped by 100 ms). The filtering of the input samples take into consideration the effects of the human head (boosting frequencies above 1.1 kHz by more than 3dB), and the other filter is a high pass filter, which takes something else into account (that is not clear from the specs). Anyhow, it seems there is good agreement between this feature and perceived loudness (of broadcast material). Let’s look at these loudness features for some of our recordings.

We start with our recording of the 1988 performance of Phase II playing “Woman is Boss”. The whole 10 minute recording is analysed below.

CNRSMH_I_2011_023_001_01_sonogram_loudness.pngBelow the spectrogram we see the sound waveform (grey), and on top in black is the  loudness feature. To the left and right are red dashed lines, which denote the “silent” regions of the waveform computed by the aubio silence detector vamp plugin. It is no surprise that the envelope of the audio waveform resembles the loudness feature. Some interesting things we see here are: 1) the aubio silence detector is not a good music/speech discriminator because the first 40 seconds of this recording is the announcer introducting the band; and 2) we see peaks in the loudness at around 60 seconds (s), 190 s, 340 s, and after 500 s. Those correspond to concentrations of energy around 600 Hz, which is cheering from the audience close to recording equipment. These changes in loudness are not changes in the music dynamics. (Or is the audience screams a part of the music?) In this particular performance, I don’t hear much change in the dynamics of the performance.

Now let’s look at our recording of the 1980 performance of the Trinidad All Stars (playing “Woman on the Bass”).
CNRSMH_I_2011_043_001_01_sonogram_loudness.pngAs in the previous recording, the first 14 seconds are speech, and the last 10 seconds applause, which are not picked up by the aubio plugin. Unlike the previous recording, this performance does feature one dynamic contrast: over 400-402 s. At times 194 s, 352 s, and 463 s the treble drops out leaving the bass line. It’s a nice timbral contrast rather than a dynamic one (or maybe they should both be considered related?) Anyhow, looking at the loudness data, we clearly cannot pick out which of those dips are related to real changes in dynamics.

Now let’s look at our recording of the 1994 performance of Desperadoes playing “Fire Coming Down”.


In this performance we find three contrasts in dynamics: 60-69 s, 192-200 s, and 577-584 s. All three of these are visible in the loudness, but perhaps only because we know what we are looking for.

Here’s the loudness features for Phase II Pan Groove playing “More Love” at the 2013 competition:


This performance features several huge crescendi: 214-221 s, 373-378 s, 380-386 s, 441-449 s, 449-456 s, and 501-507 s. The first one is visible in the loudness feature only since we know it’s supposed to be there, but the other five are clearly visible. Here we see that the usefulness of this feature for automatically detecting changes in dynamics like crescendi depends on the crescendi being sufficiently long in duration, and there is no interference from audience… but what fun is that?

From all of these observations then, two things are clear:

  1. We need a reliable way to demarcate the announcement and applause from the music performance so we do not analyse features from the wrong content.
  2. The ITU-R BS.1770-4 loudness standard is not a reliable feature for automatically detecting the kinds of constrasts in dynamics that we are interested in.

One thought on “Looking at some Panorama data: loudness and contrast in dynamics

  1. Pingback: Looking at some Panorama data: spectral dissonance | High Noon GMT

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s