# Looking at some Panorama data: tempo

Last time, we looked at some of the recordings in our dataset and identified several peculiarities: spoken announcements and introductions at the beginning of recordings, sometimes lasting 10s of seconds; crowd noises throughout, and sometimes much more perceivable than the music; differences in pitch shifting between recordings; recording effects like warble. Furthermore, there are not very many well-defined markers of tempo save the countoff at the beginning of a tune. I find it very hard to tap to the beat when I select random starting positions in a recording. How will our feature extraction algorithms handle this with our recording collection?

Let’s look at tempo at this time. We use two tempo description algorithms. One is the QMUL Tempo and Beat Tracker Vamp plugin, which gives a tempo estimate whenever a change is sensed. The other is madmom tempo, which gives a tempo estimate for the entire piece.

We start with our recording of the 1988 performance of Phase II playing “Woman is Boss”. The whole 10 minute recording is analysed. The black line is the tempo estimates from the QMUL Tempo and Beat Tracker Vamp plugin, and the blue line is from madmom tempo.

Using Tempo Tap, I estimate a tempo of about 140 beats per minute (bpm). madmom says it’s 139 bpm. Let’s go with madmom. For this recording, the first 40 seconds is an announcement. From then to the just about the end is the music. Still, the QMUL tempo tracker is making an octave error for the majority of the recording.

Now let’s look at our recording of the 1980 performance of the Trinidad All Stars (playing “Woman on the Bass”).

Again we see octave errors in the QMUL tracker. madmom estimates a tempo of 136 bpm. I estimate it to be around 135 bpm.

Now let’s look at our recording of the 1994 performance of Desperadoes playing “Fire Coming Down”.

Here I estimate 131 bpm but madmom says 136.

Here’s Phase II Pan Groove playing “More Love” at the 2013 competition:

I estimate 121 bpm. madmom says 122 bpm.

The oldest recording in our collection is from the first Panorama in 1963. It features the Pan Am North Stars playing an arrangement of “Dan Is The Man”:

Our recording is old enough that it can be auditioned at CREM. Here’s the tempo:

Nice and calm for both of them at 113 bpm. Which is what I count.

From what we have seen, it seems the madmom tempo is actually a reliable estimate of the tempo. Let’s look at the entire collection of tempo estimates:

Nearly all of the tempo estimates of our 93 recordings are between 115 and 140 bpm, but there are some that are suspiciously slow or fast. The slowest is the recording of the 1982 performance of Amoco Renegades playing “Pan Explosion”:

According to my tapulations, this is more like 137 bpm (our recording has a slightly slower speed and lower pitch than the video above).

The fastest tempo estimate of madmom is of the 1985 recording of the Trinidad All Stars playing “Soucouyant.” Here a video where they start playing at a tempo of around 140 bpm but end around a tempo of 145 bpm.

The performance in our recording is faster! I tapstimate it starts around 147 bpm and ends around 154 bpm. So, it seems madmom is not entirely incorrect with our recording, but the performance in our recording may not be accurate.

For the other seven supposedly slow performances I find four tempo estimates that are clearly wrong:

CNRSMH_I_2011_042_001_02 madmom: 100 bpm, me: 136 bpm
CNRSMH_I_2011_045_001_02 madmom: 102 bpm, me: 138 bpm
CNRSMH_I_2011_041_001_02 madmom: 102 bpm, me: 137 bpm
CNRSMH_I_2011_042_001_03 madmom: 105 bpm, me: 143 bpm
CNRSMH_E_2016_004_193_001_03 madmom: 105 bpm, me: 106 bpm
CNRSMH_E_2016_004_193_001_01 madmom: 114 bpm, me: 114 bpm
CNRSMH_E_2016_004_194_001_06 madmom: 111 bpm, me: 110 bpm

For the other two supposedly fast performances I find the tempo estimates are ok:

CNRSMH_E_2016_004_193_002_05 madmom: 146 bpm, me: 148 bpm
CNRSMH_E_2016_004_193_002_02 madmom: 143 bpm, me: 143 bpm

What about all the ones in the middle range? Should we verify all of them? Even so, what conclusions can we make about the tempo conventions considering that our recordings may not accurately reflect the practice?

# Looking at some Panorama data

Every year in Trinidad and Tobago since 1963 (save one), the Panorama competition brings together steel bands in the country to compete for the title of being the Champions that year. As part of the DaCaRyH project, we assembled a collection of 93 recordings featuring the top one, two or three ranked Panorama peformances since 1963. We are looking at this smallish corpus, which has a duration of about 14 hours, through the lens of automated feature extraction, followed by human verification. There are several things about this collection of which we must be aware.

Here’s the 1988 performance of Phase II playing “Woman is Boss”.

The video above starts around 62 seconds into our recording. The figure below shows the first 20 seconds of the waveform (mean across stereo channels) and sonogram of our recording (scaled to -80 to 0 dB). The first 12 seconds feature the announcer talking about the group. The countdown of the tune starts around 12.5 seconds. We see the waveform has a significant DC bias. We also see that the recording is bandlimited to 0–9 kHz. And there’s a strange varying notch around 1.8 kHz. Another thing we find is that our recording is slightly higher pitched than the YouTube video, by around 20 cents.

So, our feature extraction pipeline should consider that the beginning of a recording could have narration. Since we are looking at recordings made over 50 years, we have to consider differences in recording technology and their impacts on the feature extraction. There’s also the problem of which recording version to trust. If we are going to look at tuning of the pans, we need a trustworthy reference. A difference of 20 cents is quite large, and casts doubt on the idea that we can extract tuning conventions from these recordings.

In 1980, the Trinidad All Stars won with their performance of “Woman on the Bass”. The video above shows the winning performance. Below is a portion of the waveform and sonogram (scaled to -60 to 0 dB) of our recording of it. The YouTube recording starts around 41 seconds into our recording (the countdown can be seen at the left of the sonogram). The first 41 seconds of our recording features an announcer introducing the band.

There doesn’t seem to be any major tuning discrepancy between these two recordings, but it is clear they were made in different locations at the competition. On the sonogram you can see at 60 seconds a rising chromatic pattern (around 15 seconds into the YouTube video). That dark frequency component that follows at around 900 Hz is someone “whoooooing” in crowd close to the microphone of our recording. In fact, the sound of the crowd is much more present in this recording than the music of the band. I don’t hear any whooooooing in the YouTube video.

So, our analysis of the extracted features should take care in discriminating the effects of the crowd and the band. The sounds of the crowd are an important part of this live music experience, but they will have an impact on extracted music features.

In 1994, the band Desperadoes won with the performance of “Fire Coming Down”, which you can see above. Below is the first 30 seconds of our recording.

We seem to have a warble in the sound, which also exists in the video recording above. Furthermore, the recordings of the second and third place performances for that year features the same warble. So it appears the same problems can occur over all recordings of a competition. This means a chronological analysis of this dataset will have to take care in separating the effects of the year’s recording setup, with the year’s representation of the music practice.

What does the best of the 93 recordings look like? Here is the 2013 Panorama Champions Phase II Pan Groove performing “More Love” at the 2013 competition:

Here is a sonogram of the conclusion of our recording of the performance.

There’s a lovely moment of contrast in the dynamics around 502 seconds, crescendoing into a percussive conclusion at around 512 seconds. The crowd screams and whistles after that point. This recording sounds professionally made, but even so this kind of music is extremely noisy, and naturally “tinny.” It will be a challenge to make sense of feature extraction routines tested on clean studio recordings of a few well-balanced instruments.

Next, we will take a look at some of the features we extract from the signals above.

# The bicycle horn is no longer available

When we were about to move from Copenhagen to London, we downsized our things. One thing I wanted to give away was a bicycle horn, so I made a video advertisement:

I left the video up even after the horn was taken because I like it. Since October 2014, the video has been viewed 10,678 times. For some reason, 82% of the total viewing time of this 73-second video comes by way of YouTube’s “Suggested videos”.

How is this video being suggested? And why is the video so popular in India?

# Making sense of the folk-rnn v2 model, part 8

This is part 8 of what is turning out to be a raft ride on a stream-of-consciousness analysis of the folk-rnn v2 model. After parts 1, 2, 3, 4, 5, 6 and 7 of this series, I don’t feel much closer to answering how it’s working its magic. It’s almost as if this little folk-rnn v2 model doesn’t want me to understand it. Feeling a bit lost, not knowing where to go next, I asked folk-rnn v2 to generate me a tune:

What a quaint and catchy tune! It sounds quite home-y and sentimental to me. It’s exactly the tune I want to play when I’m deep in the muck of inconclusive numerical analyses of computer systems far from a home of understanding. It makes me love you even more, folk-rnn.

OK. Let’s summarise the most interesting things so far.

1. From part 1, we see how each element of the model’s vocabulary relates to the dimensions of the model input and output.
2. From part 1, we see that the model isn’t overfit, but its training could have been halted after about 50 epochs.
3. From part 2, we see that the parameters in the softmax layer — ${\bf V}$ and ${\bf v}$ — are relatively easy to understand with respect to the vocabulary. This also means we can interpret the contributions of each dimension of the third LSTM layer, ${\bf h}_t^{(3)}$.
4. From part 2, we see that the columns of the (full rank) matrix ${\bf V}$ in the softmax layer are by and large highly correlated. (Mangitude correlation with the bias vector is no more than about 0.8 and has a mean of 0.4). Some columns seem to be  very similar to others save a change in polarity. This suggests a lot of redundancy in the encoding of subspaces of $\mathbb{R}^{136}$ in $[-1,1]^{512}$.
5. From part 2, we derive the numerical relationship between a change in a dimension of the input to the softmax, and the change in the probability mass of the associated token. This will be useful for interacting with the model to change its generation disposition, e.g., we can push it to favor or avoid particular pitches, modes, durations, etc.
6. From part 3, we see through the singular value decomposition of ${\bf V}$ that one singular value is an order of magnitude larger than all others. This means the last LSTM layer can cause a very large change in the posterior distribution in one specific direction. This left singular vector seems to significantly increase the probability mass of two tokens in particular: “|:” and “</s>”.
7. From part 3, we see that other left singular vectors have similar interpretations. The second one is spikey on meter and mode tokens. The third one is spikey on a broken rhythm token. (We don’t know whether the model is exploiting these directions, so we should look at projections of ${\bf h}_t^{(3)}$ onto these directions over the course of a transcription generation.)
8. From part 4, we see the parameters of the gates in the first LSTM layer have very different appearances. Since they are close to the input layer, we can interpret them in terms of the vocabulary. For instance, the norms of the columns of the forget gate matrix ${\bf W}_{xf}^{(1)}$ corresponding to meter and mode tokens are very small. (But does this mean that the forget gate is paying more attention to other kinds of tokens?)
9. From part 4, we see the Gramian of the cell gate matrix, ${\bf W}_{xc}^{(1)}$, shows some possible encoding of enharmonic relationships. We see a similar relationship in the Gramian of the output gate matrix, ${\bf W}_{xo}^{(1)}$.
10. From part 4, we see for the first layer LSTM the inner products between the columns of the matrices multiplying the hidden state vector — ${\bf W}_{hi}^{(1)}, {\bf W}_{hf}^{(1)}, {\bf W}_{hc}^{(1)}, {\bf W}_{ho}^{(1)}$ — and the matrices multiplying the input vector — ${\bf W}_{xi}^{(1)}, {\bf W}_{xf}^{(1)}, {\bf W}_{xc}^{(1)}, {\bf W}_{xo}^{(1)}$ — are quite small. This suggests there are two very distinct subspaces of $\mathbb{R}^{512}$ where the network is storing complementary information. (What is that information, and how does it relate to the music transcription generation, and the vocabulary? Do the matrices of the other two LSTM layers have the same property?)
11. From part 5, we see curious things in the model’s posterior distribution over the generation of a tune: regular spiking of probability mass of measure tokens; a decay of probability mass in broken rhythm tokens; and “echoes” of probability mass in the pitch tokens.
12. From part 5 and part 6, we see that the third layer LSTM hidden state activations are very sparse, and show a high degree of periodicity. We see in particular a periodicity with harmonics of about 9 steps. This seems to have to do with the measure tokens.
13. From part 6, we see that when we cut the connections of the nearly 20% of third layer LSTM units having 5% of their power around a period of 9 steps, the model can still generate measure tokens where it should. When we increase the number of cut connections to 27%, we see the model is losing its “counting” ability. When we increase to 33%, the model can no longer “count”. If we snip 33% of the 512 final layer connections at random, we see that it can still “count”. (These snips essentially  restrict the subspaces in $\mathbb{R}^{136}$ that the model can reach, but how can we interpret these in terms of the vocabulary?)
14. From part 6, we see that when we cut the connections of the third layer LSTM units, the model’s melodic “composition” ability disappears quite quickly. This suggests that the model’s ability to “count” and to “compose” a melody have different sensitivities. Its “counting” ability seems more robust than its melodic “sensibility”. (Does this mean melodic information is encoded in a larger subspace of $\mathbb{R}^{512}$ than the timing information?)
15. From part 7, we see that hidden state activations are very sparse at each LSTM layer, but they become more saturated with depth. We also see that hidden state activations at the first LSTM layer do not have as strong a perodic appearance as those of the following two LSTM layers.
16. From part 7, we see that when we snip 33% of the connections in the first LSTM layer having power in a period of around 9 steps, the model can still generate meature tokens where needed. It’s melodic ability does not seem to remain, however.
17. From part 7, we see across all hidden state activations of the three LSTM layers,  sequences with strongs periods of DC, 2.25, 3, 4.5, 6, 8.8, 11.5, 18, and 36 steps. We also see evidence that the periods we observe in the first LSTM layer are due to the input of particular tokens, whereas the periods we observe in the last LSTM layer cause the output of particular tokens.

Now, looking forward, what do we have yet to do (in addition to all those things in parens above)? Here are a few things:

1. How can we interpret the parameters of the middle LSTM layer? If we can relate the subspaces of $[-1,1]^{512}$ of the first LSTM layer to the vocabulary, then we will be able to interpret the transformation performed by the second LSTM layer. Similarly, if we can relate the subspaces of $[-1,1]^{512}$ of the third LSTM layer to the vocabulary, then we will also be able to interpret the transformation performed by the second LSTM layer.
2. What exactly are the gates responsible for in each layer? We have looked at hidden state activations of each layer over the generation of transcriptions, but what about those in the cell gates? Those seem to be unbounded.
3. Has dropout in the training of this model obscured the functions of its elements? If we train the same but without dropout, how does the interpretation change? Does the analysis of the model become more immediate? What if we used instead gated recurrent networks?
4. Is there a way to uncover function in a more automated way, rather than having to look for activations that appear to correspond to particular behaviors? (Like the unit that saturates between quotes in a text LSTM.)
5. What will happen if we supress the output/input of measure tokens for a period of time? Or the end of transcription token? Will the model only be able to “count” for 16 measures or so?

Regardless, folk-rnn brings families together. Here’s my wife and I playing, “Why are you and your 6 million parameters so hard to understand?”