Making sense of the folk-rnn v2 model, part 7

Over the holiday break I started to dig into the music transcription models that we trained and used since 2016, and figure out what the ~6,000,000 parameters actually mean. In particular, we have been looking at the “version 2” model (see parts 1, 2, 3, 4, 5 and 6 of this series), which generated all of the 30,000 transcriptions compiled in ten volumes, as well as helped compose a large amount of new music, including this new little work which I title, “Experimental lobotomy of a deep network with subsequent stimulation (2)”:


This little piece was the outcome of stimulating the v2 model with a two-hot input (or forcing), created from the generated token and a random pitch class token. I gave a weight of 0.7 to the token generated by the model, and 0.3 to the other. The resulting piece is weird, but a good kind of weird.

Anyhow, last time we performed some Fourier analysis on the sequences generated by the last LSTM layer. We then identified units producing sequences having a significant amount of power at a period related to the measure tokens (9 steps), and then excised these from the network to see how the “counting” ability of the system is affected. This showed just how distributed the model’s “counting” ability is.

Now, what is going on in the first and second layer LSTM hidden states during the steps? Here’s the first layer hidden states:

tune_ht_Layer1.pngHere’s the second layer hidden states:

tune_ht_Layer2.pngTime runs from bottom to top. We see a more active scene in the second layer than the first. Here are the distributions of activation values in the hidden states of all three LSTM layers:


This shows that the activations of the hidden states in all three layers are sparse, but that the deeper we travel in the network the more likely we will see large magnitudes. This is also shown in the following picture of the sorted maximum magnitude values of the sequences of each unit.

Let’s look now the magnitude spectra of the hidden state sequences produced by each unit in the first LSTM layer:

This is quite a different picture from what we saw in the third layer, reproduced here:

tune_ht_Layer3_dft.pngThe sequences of the first layer have less diversity of periodicities compared with those of the third layer. For instance, we can see some sequences of period 9 steps and 2.5 steps, but there aren’t many of them. We also see that most of the power in the first layer sequences seems to occur in periods longer than 8 steps.

When we snip 33% of the first layer units exhibiting a sequence with more than 0.025% of its power in periods between [8.7, 9.3], and have the model generate from the same random seed as before, it produces the following transcription:

A D | E 3 C (3 D E D C D | G E C E (3 F G F E C | 
G F D C E D F D | G, 3 E C 2 C A, | (3 G ^F G F E E D C D | 
E 2 D G F A c A | G 2 C G c B d c | B c B G F G D F | 
D E D F B D c B | A /2 B /2 c B c d c A G | 
(3 E F G C B D E G E | F F F C F E F D | E C 2 C B, G, B, A, B, | 
B, 2 E C (3 F F F F E | z C B, C D C C C | B, 2 E C E F z c | 
E F B c G F D c | B c B c g c c d | z g B a B c e G | 
B F F F c G G G | (3 A G F G c c D z B | B 2 E c B G E C | 
(3 A G F A F (3 c G F E F | A G F c G c e d | d c G d c z B G | 
B c B A c G F E | B F E F B E E E |

Here’s the dots:

Screen Shot 2018-01-11 at 23.23.41.png

Not even a good tune, but every measure except one is correctly counted. This is quite different from what we observed when we did the same in the 3rd LSTM layer. These observations suggest that the first LSTM layer is not “counting” measures, but just reacting to the periodic input of the measure token.

Let’s now look at the magnitude spectra of the hidden state of the second LSTM layer:

tune_ht_Layer2_dft.pngThis looks more similar to what we see in the last LSTM layer.

Now, thinking about the sequences produced by all layers, how is power concentrated over the periods? Let’s look at the mean of the normalised spectral magnitudes (computed using a Hann window) for each layer (with one standard error of the mean above and below):


The x-axis is period in steps. For layers 1-3, I normalise by the maximum power in layer 1. The magnitude spectrum of the softmax output is normalised with respect to its maximum power. For layers 1-3, we see a lot of power at DC (coming from the polarity of the sequences). We see large peaks around periods of 36, 18, 9, 4.5, 3 and 2.25 steps. We also see the powers of the sequences of the LSTM layer 1 are by and large less than those in the other two LSTM layers. Layer 2 seems to be producing sequences with the most power, except those with periods of about 2.25 and 9 steps. The spectrum of the softmax layer sequences is more compressed than the others, which is due to softmax operation. We see a high degree of similarity between all of the magnitude spectra. Taking into consideration the standard error of the mean of the sequences in each layer, there seems to be less variation in the DC component of all sequences than at any other period. For most of the peaks, we see power variations that are about 1 dB centred on the mean. Now, what do these peaks mean? Why is there so much power at specific periods, e.g., 2.25 and 8.8 steps?

Before we go any deeper with this analysis, we have to make sure that what we are observing is actually in the sequences and not a product of our analysis decisions. Let’s compare the magnitude spectra of Hann and rectangular windows of length 147 (the length of our sequences):

The width of the main lobe of the Hann window (about 0.014 Hz) is a little less than twice that of the rectangular window (0.008 Hz) (I’m defining width as twice the frequency of the -6 dB point); however, the power of the first sidelobe of the Hann window is more than 30 dB lower that of the main lobe, whereas that of the rectangular window is just 12 dB lower. So, if we are to use a rectangular window for analysis, we will have narrower peaks in the magnitude spectrum, but more peaks coming from the side lobes instead of any frequencies that actually exist in a sequence. Since we are using a Hann window for analysis, we will have wider peaks, and so more uncertainty of the frequencies present in the sequence, but we can be relatively sure that none of the peaks we see at sufficiently high powers (relative to nearby larger peaks) are contributed by sidelobes.

The range of periodicities covered by the main lobes of these analysis windows (their “support”) depends on the frequency to which the window is modulated. This relationship is given in the following table for some key periodicities we are observing in the sequences.

Main lobe Support (steps)
Sequence period (steps) Rectangular Hann
2.25 [2.23, 2.27] [2.21, 2.29]
4.5 [4.4, 4.6] [4.3, 4.7]
8.8 [8.5, 9.1] [8.3, 9.4]
18 [17, 19] [16, 21]
36 [31, 42] [28, 48]
Inf [Inf, 250] [Inf, 143]

So, the Hann window modulated to a period of 8.8 steps will have a main lobe between 8.2 and 9.4 steps. This means that for a peak around 8.8 steps, the “true” period (or periods) lies somewhere between 8.3 and 9.4 steps. (There may be a few frequencies present contributing to the single peak.)



The locations of the first sidelobes are important to distinguish real spectral peaks from fake ones. These are given in the following table for some of the key periods we are seeing:

First sidelobe locations (steps)
Sequence Period (steps) Rectangular Hann
2.25 2.20, 2.30 2.17, 2.33
4.5 4.3, 4.7 4.2, 4.8
8.8 8.0, 9.6 7.7, 10.2
18 15, 22 14, 25
36 26, 56 23, 85
Inf 100 63

The first sidelobes of the Hann window are located at 7.7 and 10.2 steps. But, the power of the sidelobes of the Hann window are at least 30 dB below the peak. That means for the peak we see at 8.8 steps, the first sidelobes of the Hann window are at 7.7 and 10.2 steps at a power less than -38 dB. This is to say that the peaks we see around 7 and 11.5 steps are real, and are not coming from the peak at 8.8 steps.

Since the mean value of the sequences (the DC component) contributes the most power of the sequences from all layers, we need to consider its sidelobes when interpreting long periods. The first sidelobe of the Hann window centred on DC occurs at a period of 63 steps. This means that the peak we see at a periodicity of 36 steps about 8 dB lower than DC is not coming from a sidelobe of the window. It is real.

So, now that we are sure which peaks in the mean magnitude are real, how can we interpret them? What is contributing to the peaks at 2.25, 3, 4.5, 6, 8.8, 11.5, 18, 36 steps and to DC? Knowing how the system operates, the magnitude spectra of any sequence produced by our LSTM layers must arise from the sequence input to the model, which itself arises from the model’s output sequence (recurrence!). First, the peak we see at DC comes from the input of tokens that appear every step or only once, which includes any token, and in particular “M:4/4”, “K:Cmaj”, “(3” and “4”. Second, the peaks we see at around 8.8 steps in the magnitude spectra of the three layers are coming from, at least, the input of tokens that appear every 8-9 steps, e.g., the measure token “|”. (Review the transcription generated by the model from these sequences.) The input of  “|” occuring every 8-9 steps might also contributing to the peaks we see at multiples of that period, e.g., 18 and 36 steps, if the sequence shows modulation (which we do see on some of the sequences). We can’t say that all of the peak at 8.8 steps is caused by “|” because there may be other things with such a period.

Third, looking back at the generated transcription, we see the repeat measure token “:|” occurring twice: at step 75 and then 69 steps later. So, we should see a peak around that period. However, we won’t see any peak there in our Hann-windowed sequences because its main lobe is too wide. Using a rectangular window might resolve this:


We see a peak around a period of 100 steps in LSTM layers 2 and 3, which is where the first sidelobe of the rectangular window occurs — so caution must be taken with that peak. But we do see peaks between 60 and 67 steps. The latter is probably related to the “:|” token ouput, but 60 steps seems too small of a period. (Performing an analysis on the rectified sequences does not show a peak in the expected place either.)

What is causing the peak around 2.25 steps in every layer? One cause could be that every other token is a token, but that would make a peak at 2 steps. Instead, it seems like every other token most of the time is a pitch token, and sometimes, like when a duration is specified or a bar, a pitch token occurs 3 steps later. Since the pitch tokens are quite frequent in this transcription, it makes sense that sequences would exhibit power at a period related to the pitch tokens occuring every other step most of the time.

What is causing the peak at 36 steps? One cause could be the repetition of some of the transcription, e.g., “d c d e c 2 A G | A d d” in the second half of the transcription occurs 35 steps later, and “A D D” occurs 20 steps later and 38 steps later. So, the power we see in longer periodicities is relatable to larger structures

Other clues for interpreting the magnitude spectrum of the posterior sequences comes from inspecting its output during the generation of the transcription:

We see some specific tokens that turn on and off with a period of about 2: “(3”, “2”, “/2”. So, at least the peak around 2 steps that we see in the magnitude spectrum of the mean softmax sequence comes from these. These particular tokens do not appear much in the generated transcription: “/2” never appears, “(3” appears once, and “2” appears 9 times.

Now, are the periodicities we see in the sequences of the layers caused by the output of the model, or do they cause the output? Surely the answer depends on which layer we are talking about. We sample from the posterior sequence, and that means the posterior sequence causes the output. We input a token to the first layer, and that causes its sequence. But what about layers 2 and 3? Is the periodicity of 8.8 samples in LSTM layer 3 caused by the input of a measure token every 8.8 steps, or is it causing the output of a measure token every 8.8 steps?

Let’s see how two of the 9-step periodic outputs in the first and third LSTM layers coincide with the inputting and outputting of measure tokens? Let’s take the sequence produced by a unit in each of the LSTM layers exhibiting a periodicity of 9 steps and plot them against the generated transcription.


The vertical dashed lines show those steps in which the measure token shown at bottom is generated. The next step is when it is input to the model. We can clearly see the sequence produced by LSTM layer 1 unit 247 spikes in every step but three when a measure token is the input. It doesn’t spike the two times when the input is “:|”, or the first time a “|” is input. LSTM layer 3 unit 156 spikes in every step when a measure token is output. And LSTM layer 2 unit 97 is almost zero for two steps when the output is a measure token. We also see the sequence produced by LSTM layer 2 unit has a period of about 2 plus some change steps.

We need to disambiguate temporal characteristics of the sequences of each layer that share similar periods. This will hopefully help us see relationships between the model input and output.

Stay tuned!


One thought on “Making sense of the folk-rnn v2 model, part 7

  1. Pingback: Making sense of the folk-rnn v2 model, part 8 | High Noon GMT

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s