# Making sense of the folk-rnn v2 model, part 5

This holiday break gives me time to dig into the music transcription models that we trained and used in 2016-2017, and figure out what the ~6,000,000 parameters actually mean. In particular, we have been looking at the “version 2” model (see parts 1, 2, 3 and 4 of this series), which generated all of the 30,000 transcriptions compiled in ten volumes, as well as helped compose a large amount of new music, including this transcription just now:

M:4/4
K:Cmaj
B, | G, C C 2 G, C E C | G, C E G F D B, D | G, C C 2 G, C E G | F E D E C E A, B, |
G, C C 2 G, C E C | G, C E F G 3 A | B c d B c B G F | D E F D C 3 :|
|: D | E C C 2 D B, G, A, | B, C D E F 2 D F | E C C 2 D E F D | G ^F G A B 2 A B |
G c c B c 2 B c | d e d c B d c B | G E F D G F D F | E C B, D C 3 :|


In common practice notation, this transcription becomes this:

The model has generated a transcription where all bars are correctly counted with accounting for the two pickups. This transcription features two repeated 8-bar parts (“tune” and “turn”). The transcription is missing a repeat sign at the beginning of the first part — but human transcribers would probably make that omission. The last bars of each part resolve to the tonic (“C”). Each part features some repetition and variation. There is also some resemblance between the tune and the turn. The highest note appears in the turn, which is typically the case in Irish traditional music. And the second part features a “^F”, functions as a strong pull to the dominant key (G), making a V-I cadence right in the middle of the turn.

Here’s a recording of me “interpreting” this transcription in G major in the Octagon at QMUL:

Now let’s see what the network is “thinking” as it generates this output.

First, is the network merely copying its training transcriptions? The first figure, “G, C C 2” appears in 217 of 23,636 transcriptions. Here’s a setting of the tune “The Belharbour“, in which that figure appears in the same position:

We also see it appearing in the same position in the transcriptions “The Big Reel Of Ballynacally“, “Bonnie Anne“, “Bowe’s Favourite“, “The Braes Of Busby“, “The Coalminer’s“, “The Copper Beech“, and others.

What about the first bar, “G, C C 2 G, C E C”? We find this as the first bar of 3 transcriptions: “Dry Bread And Pullit“, “The Premier“, “Swing Swang“. Here’s the  transcription of “Swing Swang“:

Here’s the fantastic trad group Na Draiodoiri playing “Swing Swang” in 1979 as the second in a two-tune set:

https://open.spotify.com/embed/track/4IRTPLHH1t7I7hvD7AxL9J

That motivated me to title the folk-rnn v2 transcription, “Swing Swang Swung”.

Searching the training data for just the second bar, “G, C E G F D B, D”, we find it appears in only 3 transcriptions, but never as the second bar. We don’t find any occurance of bars 4 or 6. So, it appears that the model is not reproducing a single tune, but to some extent is using small bits and pieces of a few other tunes.

Let’s now look at the softmax layer output at each step in the generation. This shows the multinomial probability mass distribution over the tokens.

Here I plot the log10 probability distribution over the tokens (x-axis) for each sequence step (y-axis). I’m only showing log10 probabilities greater than -10. We see a lot of probability mass in the pitch tokens subset, which is to be expected because the majority of tokens in the training transcriptions are of this type. We see a very regular spike of probability mass for the measure tokens “|”. This is being output by the system every 9 steps, but in a few instances it occurs 8th. We also see some probability mass fluctuating around 0.01 for both the triplet grouping token “(3” and the semiquaver duration token “/2”. These never appear in the transcrption, however. Another interesting thing is how the probability mass on the broken rhythm tokens “>” and “<” decays as the sequence grows. A broken rhythm appearing only after a few measures of regular durations would be strange. We also see that the “^F” token appears likely at several points, as well as “_B”. Of these, only the “^F” appears in the generated transcription. We see some other interesting things. The pitch tokens with accidentals appear to have probability mass shadowing that seen in the contour of the melody. And, the distribution of probability mass at the starting points of the two melodies, steps 3 and 74, is spread across a broader range of pitches than at later steps.

Let’s now move a step before the softmax, to the output of the third LSTM layer, i.e., ${\bf h}_t^{(3)}$. As above, I’m going to visualise the output of this layer as the sequence step $t$ increases.

The sequence proceeds from bottom to top, and the dimension of ${\bf h}_t^{(3)}$ is numbered along the x-axis. The first thing to notice is the lovely pastel color of the image. Second, we see a very sparse field. The values appear distributed Laplacian, but with spikes at the extremes:

To get an even better impression of the sparsity of these outputs, let’s sort the activations from smallest (-1, left) to largest (+1, right) for each step and plot their values over the sequence step:

A vast majority of these values are very small. About 80% of all activations have a magnitude less than 0.07. About 2% of all activations at any step have a magnitude greater than 0.8. This means that ${\bf h}_t^{(3)}$ is activating on average 10 columns of ${\bf V}$ in the softmax layer. That bodes well for interpretability! (eventually)

Are there any units that are basically off for the entire sequence? Here’s a plot of the maximum magnitude of each unit sorted from smallest to largest:

We see that all units are on at some point, except two (139 and 219) have magnitudes no larger than 0.004. Eight units (445, 509, 227, 300, 169, 2, 187, 361) exhibit magnitudes above 0.9998.

Another thing to notice about ${\bf h}_t^{(3)}$ from the image of its activations above is that units appear to be oscillating. This calls for a Fourier analysis of each dimension of ${\bf h}_t^{(3)}$ across $t$! Zero-padding each sequence to be length 512, taking their DFT, and computing the magnitude in dB (with reference to the maximum magnitude of each unit), produces the following beautiful picture (basically a close-up image of the dark side of a “death star”):

I’m cutting out all values with less energy than -6 dB (in each column). That means I am only showing, for each unit, frequency components that have at least one-quarter the power of the most powerful component. Here we see some interesting behaviours:

1. The output of two units has a nearly flat magnitude spectrum, which means these units emitted a single large spike, e.g., units 208 and 327.
2. The output of a few units has a regular and closely spaced spikey spectrum, which means these units emitted a few large spikes, e.g., units 289 and 440.
3. The output of some units has a regular and widely spaced spikey spectrum, which means these units emitted several spikes at a regular interval, e.g., units 34, 148, 188. (A pulse train in one domain is a pulse train in the other domain.)
4. We also see many units are producing outputs with concentrations of harmonic energy around a fundamental period of 9 steps, e.g., units 60, 97, 156, and 203.

Let’s plot some of the outputs generated by the units above against the symbols sampled at each step.

Units 208 and 327 appear to be dead most of the time. Units 289 and 440 feature a bit more activity, but not much. The others are a lot more active, but of all these the last three show a regularity of about 9 steps. We can see units 156 and 203 spiking almost every time before the measure token “|” — the exceptions being the beginning of the transcription, and right after the first repeat “|:”.

How will things change if we have our model generate another C-major 4/4 transcription? Will the above appear the same? What will happen when we intervene on the dimensions of ${\bf h}_t^{(3)}$ that exhibit a 9-period sequence? Will the system not be able to correctly output measure tokens?

Stay tuned!