# The Cunning Storm, a folk-rnn (v1) original

One thing I can say about folk-rnn as opposed to other AI composition systems is that it can bring families together, ’round the table after dinner to peck out some glorious tunes of a totally synthetic tradition. Here’s my wife and I playing a folk-rnn (v1) original, which it has titled, “The Cunning Storm”:

We aren’t sure about the title, because it doesn’t fit the tune… at least with the common definition of “storm” — but maybe that’s just what makes this storm “cunning”. It’s a quaint little storm! Or maybe the system translated our last name from German, and so it should be “The Cunning Sturm”?

Either way, here’s the verbatim output of the system (which can be found on page 3,090 of the folk-rnn (v1) Session book, Volume 4 of 20):

You can see I wasn’t quite satisfied with bars 5, 7 and 8. The turn is totally unchanged, save the second ending. Anyhow, with these few changes, “The Cunning Storm” becomes one of my favourite machine folk ditties!

# Making sense of the folk-rnn v2 model, part 5

This holiday break gives me time to dig into the music transcription models that we trained and used in 2016-2017, and figure out what the ~6,000,000 parameters actually mean. In particular, we have been looking at the “version 2” model (see parts 1, 2, 3 and 4 of this series), which generated all of the 30,000 transcriptions compiled in ten volumes, as well as helped compose a large amount of new music, including this transcription just now:

M:4/4
K:Cmaj
B, | G, C C 2 G, C E C | G, C E G F D B, D | G, C C 2 G, C E G | F E D E C E A, B, |
G, C C 2 G, C E C | G, C E F G 3 A | B c d B c B G F | D E F D C 3 :|
|: D | E C C 2 D B, G, A, | B, C D E F 2 D F | E C C 2 D E F D | G ^F G A B 2 A B |
G c c B c 2 B c | d e d c B d c B | G E F D G F D F | E C B, D C 3 :|


In common practice notation, this transcription becomes this:

The model has generated a transcription where all bars are correctly counted with accounting for the two pickups. This transcription features two repeated 8-bar parts (“tune” and “turn”). The transcription is missing a repeat sign at the beginning of the first part — but human transcribers would probably make that omission. The last bars of each part resolve to the tonic (“C”). Each part features some repetition and variation. There is also some resemblance between the tune and the turn. The highest note appears in the turn, which is typically the case in Irish traditional music. And the second part features a “^F”, functions as a strong pull to the dominant key (G), making a V-I cadence right in the middle of the turn.

Here’s a recording of me “interpreting” this transcription on my G/C Hohner Merlin II:

I would definitely make some changes to improve the tune, but our aim here is not to compose yet another machine folk tune (many can be heard here). Instead, I want to see what the network is “thinking” as it generates this output.

First, is the network merely copying its training transcriptions? The first figure, “G, C C 2” appears in 217 of 23,636 transcriptions. Here’s a setting of the tune “The Belharbour“, in which that figure appears in the same position:

We also see it appearing in the same position in the transcriptions “The Big Reel Of Ballynacally“, “Bonnie Anne“, “Bowe’s Favourite“, “The Braes Of Busby“, “The Coalminer’s“, “The Copper Beech“, and others.

What about the first bar, “G, C C 2 G, C E C”? We find this as the first bar of 3 transcriptions: “Dry Bread And Pullit“, “The Premier“, “Swing Swang“. Here’s the  transcription of “Swing Swang“:

Here’s the fantastic trad group Na Draiodoiri playing “Swing Swang” in 1979 as the second in a two-tune set:

https://open.spotify.com/embed/track/4IRTPLHH1t7I7hvD7AxL9J

Searching the training data for just the second bar, “G, C E G F D B, D”, we find it appears in only 3 transcriptions, but never as the second bar. We don’t find any occurance of bars 4 or 6. So, it appears that the model is not reproducing a single tune, but to some extent is using small bits and pieces of a few other tunes.

Let’s now look at the softmax layer output at each step in the generation. This shows the multinomial probability mass distribution over the tokens.

Here I plot the log10 probability distribution over the tokens (x-axis) for each sequence step (y-axis). I’m only showing log10 probabilities greater than -10. We see a lot of probability mass in the pitch tokens subset, which is to be expected because the majority of tokens in the training transcriptions are of this type. We see a very regular spike of probability mass for the measure tokens “|”. This is being output by the system every 9 steps, but in a few instances it occurs 8th. We also see some probability mass fluctuating around 0.01 for both the triplet grouping token “(3” and the semiquaver duration token “/2”. These never appear in the transcrption, however. Another interesting thing is how the probability mass on the broken rhythm tokens “>” and “<” decays as the sequence grows. A broken rhythm appearing only after a few measures of regular durations would be strange. We also see that the “^F” token appears likely at several points, as well as “_B”. Of these, only the “^F” appears in the generated transcription. We see some other interesting things. The pitch tokens with accidentals appear to have probability mass shadowing that seen in the contour of the melody. And, the distribution of probability mass at the starting points of the two melodies, steps 3 and 74, is spread across a broader range of pitches than at later steps.

Let’s now move a step before the softmax, to the output of the third LSTM layer, i.e., ${\bf h}_t^{(3)}$. As above, I’m going to visualise the output of this layer as the sequence step $t$ increases.

The sequence proceeds from bottom to top, and the dimension of ${\bf h}_t^{(3)}$ is numbered along the x-axis. The first thing to notice is the lovely pastel color of the image. Second, we see a very sparse field. The values appear distributed Laplacian, but with spikes at the extremes:

To get an even better impression of the sparsity of these outputs, let’s sort the activations from smallest (-1, left) to largest (+1, right) for each step and plot their values over the sequence step:

A vast majority of these values are very small. About 80% of all activations have a magnitude less than 0.07. About 2% of all activations at any step have a magnitude greater than 0.8. This means that ${\bf h}_t^{(3)}$ is activating on average 10 columns of ${\bf V}$ in the softmax layer. That bodes well for interpretability! (eventually)

Are there any units that are basically off for the entire sequence? Here’s a plot of the maximum magnitude of each unit sorted from smallest to largest:

We see that all units are on at some point, except two (139 and 219) have magnitudes no larger than 0.004. Eight units (445, 509, 227, 300, 169, 2, 187, 361) exhibit magnitudes above 0.9998.

Another thing to notice about ${\bf h}_t^{(3)}$ from the image of its activations above is that units appear to be oscillating. This calls for a Fourier analysis of each dimension of ${\bf h}_t^{(3)}$ across $t$! Zero-padding each sequence to be length 512, taking their DFT, and computing the magnitude in dB (with reference to the maximum magnitude of each unit), produces the following beautiful picture (basically a close-up image of the dark side of a “death star”):

I’m cutting out all values with less energy than -6 dB (in each column). That means I am only showing, for each unit, frequency components that have at least one-quarter the power of the most powerful component. Here we see some interesting behaviours:

1. The output of two units has a nearly flat magnitude spectrum, which means these units emitted a single large spike, e.g., units 208 and 327.
2. The output of a few units has a regular and closely spaced spikey spectrum, which means these units emitted a few large spikes, e.g., units 289 and 440.
3. The output of some units has a regular and widely spaced spikey spectrum, which means these units emitted several spikes at a regular interval, e.g., units 34, 148, 188. (A pulse train in one domain is a pulse train in the other domain.)
4. We also see many units are producing outputs with concentrations of harmonic energy around a fundamental period of 9 steps, e.g., units 60, 97, 156, and 203.

Let’s plot some of the outputs generated by the units above against the symbols sampled at each step.

Units 208 and 327 appear to be dead most of the time. Units 289 and 440 feature a bit more activity, but not much. The others are a lot more active, but of all these the last three show a regularity of about 9 steps. We can see units 156 and 203 spiking almost every time before the measure token “|” — the exceptions being the beginning of the transcription, and right after the first repeat “|:”.

How will things change if we have our model generate another C-major 4/4 transcription? Will the above appear the same? What will happen when we intervene on the dimensions of ${\bf h}_t^{(3)}$ that exhibit a 9-period sequence? Will the system not be able to correctly output measure tokens?

Stay tuned!

# Making sense of the folk-rnn v2 model, part 4

This holiday break gives me time to dig into the music transcription models that we trained and used this year, and figure out what those ~6,000,000 parameters actually mean. In particular, I’ll be looking at the “version 2” model, which generated all of the 30,000 transcriptions compiled in ten volumes, as well as helped compose a large amount of new music, including the three transcriptions interpreted by Torbjörn Hultmark:

and my 2017 Xmas (re)work, “Carol of the Cells”:

As a reminder, the first LSTM layer transforms a vector ${\bf x}_t$ in $\mathbb{R}^{137}$ by the following algorithm:

${\bf i}_t^{(1)} \leftarrow \sigma({\bf W}_{xi}^{(1)}{\bf x}_t + {\bf W}_{hi}^{(1)}{\bf h}_{t-1}^{(1)} +{\bf b}_i^{(1)})$
${\bf f}_t^{(1)} \leftarrow \sigma({\bf W}_{xf}^{(1)}{\bf x}_t + {\bf W}_{hf}^{(1)}{\bf h}_{t-1}^{(1)} +{\bf b}_f^{(1)})$
${\bf c}_t^{(1)} \leftarrow {\bf f}_t^{(1)}\odot{\bf c}_{t-1}^{(1)} + {\bf i}_t^{(1)} \odot \tanh({\bf W}_{xc}^{(1)}{\bf x}_t + {\bf W}_{hc}^{(1)}{\bf h}_{t-1}^{(1)} +{\bf b}_c^{(1)})$
${\bf o}_t^{(1)} \leftarrow \sigma({\bf W}_{xo}^{(1)}{\bf x}_t + {\bf W}_{ho}^{(1)}{\bf h}_{t-1}^{(1)} +{\bf b}_o^{(1)})$
${\bf h}_t^{(1)} \leftarrow {\bf o}_t^{(1)}\odot \tanh({\bf c}_{t}^{(1)})$

This layer is trained to map each standard basis vector of $\mathbb{R}^{137}$ to a point inside the $\ell_\infty$-unit ball of $\mathbb{R}^{512}$. Each ${\bf x}_t$ is first transformed by the 512×137 matrices ${\bf W}_{xi}^{(1)}, {\bf W}_{xf}^{(1)}, {\bf W}_{xc}^{(1)}, {\bf W}_{xo}^{(1)}$. Since our input vectors ${\bf x}_t$ have a sparsity of 1 this means we are just seeing single columns from these matrices. If ${\bf x}_t$ has a 1 in row 12, then ${\bf W}_{xi}^{(1)}{\bf x}_t$ is column 12 of ${\bf W}_{xi}^{(1)}$. Let’s look at the columns of these four matrices.

Here’s the matrix of the input gate ${\bf W}_{xi}^{(1)}$:

It appears that the token “” — the first one to be encounted by the model — activates a colum of negative numbers with a mean of -0.58. The second column, corresponding to token “” has a very different appearance. Here they are plotted together.

So the first token seen by the folk-rnn v2 model, “” injects a bunch of negative numbers into the system. The model has never been trained to model what comes after a “” token. So, in reality, we can lop off this column of each of these first-layer matrices.

Here’s the matrix of the forget gate ${\bf W}_{xf}^{(1)}$:

The values in this matrix do not vary much. The largest variance we see in its colums correspond to measure tokens, some duration tokens (“/2”, “>”), the triplet grouping token “(3”, and the pitch tokens “G,”, “A,”, “B,”, “C”, “D”, “E”, “F”, “G”, “A”, “B”, “c”, “d”, “e”, “f”, “g”, “a”, “b”, “c'” and “d'”.

Here’s the matrix of the cell gate ${\bf W}_{xc}^{(1)}$:

We can clearly see variety in many of its columns, particularly, meter and mode, some duration, and several pitch tokens. Finally, here’s the matrix of the output gate ${\bf W}_{xo}^{(1)}$:

Looking at the norms of the columns of these matrices shows their clear preferences for particular tokens. (I plot the negative norms of two of the matrices for readibility.)

The matrix of the forget gate ${\bf W}_{xf}^{(1)}$ is not very sensitive to meter and mode tokens, but has preferences for the measure tokens, some pitch tokens, the triplet grouping token, and some specific duration tokens. The matrix of the output gate ${\bf W}_{xo}^{(1)}$ has columns with norms that can be about 2-3 times that of the other gates, and is quite sensitive to some pitch and duration tokens.

Let’s look at the Gramian matrices of each of these to see how the columns are related by the direction they point. Here’s the Gramian of the input gate matrix ${\bf W}_{xi}^{(1)}$ (with each of its column normalised first):

We can see clear groupings of kinds of tokens. The columns corresponding to meter and mode tokens kind of point in the same directions, but a bit opposite of the pitch tokens. The pitch tokens point in the same direction, and point along directions of several duration and grouping tokens as well. The tokens of columns that are most unlike all others include “”, “M:3/2”, “^C,”, “D”, “E”, “F”, “A”, “B”, “e”, and “=f'”.

Here’s the Gramian of the forget gate matrix ${\bf W}_{xf}^{(1)}$:

This one shows more correspondence between columns of pitch tokens and mode tokens than does the Gramian of the input gate matrix ${\bf W}_{xi}^{(1)}$. The columns weighting the modes and meters are by and large orthogonal.

Here’s the Gramian of the cell gate matrix ${\bf W}_{xc}^{(1)}$:

This shows some interesting diagonal structures! Let’s zoom in on some of these in the pitch tokens.

This clearly shows that the columns corresponding to “A” and “=A”, “B” and “=B”, and so on point strongly in the same directions of $\mathbb{R}^{512}$. These pitches are equivalent. We also see strong correspondence between pitches that are a half-step apart, e.g., “=D”, “_D”, and “^D”, “=E” and “_E”, etc. There are similar diagonal structures for the tokens without the natural sign “=”. It is interesting that the model has picked up on this equivalence. I don’t know how this arises, but it may be by virtue of accidentals appearing in C-major transcriptions that have accidentals, e.g.,

T:A Curious Denis Murphy
M:6/8
K:Cmaj
|: A | A B c d e f | g a g g f d | c 3 c B c | d e f d B G |
A B c d e f | g 3 g f d | c B c A d c | B G G G 2 :|
|: A | _B 2 d c A A | _B G E F G A | _B A B c =B c | d e f d B G |
_B 2 d c A A | _B G E F G A | _B d d c A A | A G G G 2 :|


I don’t know if they are so plentiful, however. Finally, here’s the Gramian of the output gate matrix ${\bf W}_{xo}^{(1)}$:

This shows some of the same diagonal strutures we see in the Gramian of the cell gate matrix ${\bf W}_{xc}^{(1)}$, but also includes relationships between meter and mode tokens.

Let’s look at the bias in each one of these gates of the first layer:

The norms of these four bias vectors are quite different: input: 18.5; forget: 99.9; cell: 3.5; and output: 22.2.  We can see that the bias of the forget gate is entirely positive, whereas the bias of the input and output gates are for the most part negative. It is interesting to note that of the four gates, the forget gate has a bias with a norm that is more than twice the norm of any columns in the forget gate matrix ${\bf W}_{xf}^{(1)}$. That means the bias of the forget gate will really push in its direction the values of the linear transformation ${\bf W}_{xf}^{(1)}{\bf x}_t$.

Let’s plot the inner products between the normed bias vector and the normalised columns of each matrix to see the relationships between the directions:

We can see that in the input and output gates, the bias vectors point strongly in the direction of the columns corresponding to the start transcription token. The bias in the forget gate points mostly in the opposite direction of the column having to do with this token. This means that when the first layer encounters ${\bf x}_t$ that encodes “<s>”, then ${\bf W}_{xi}^{(1)}{\bf x}_t$ will produce a vector with a norm of about 14, and adding ${\bf b}_i^{(1)}$ to that will make its norm about double. The opposite happens in the forget gate: ${\bf W}_{xf}^{(1)}{\bf x}_t$ creates a vector that has a norm of about 2, and then adding the bias creates a vector of norm of about 98. Here’s a plot showing the result of this addition after a sigmoid operation:

We can thus see that when the forget gate encounters the token “<s>” it could be producing a vector that is close to saturation. In fact, since the norm of each column of ${\bf W}_{xf}^{(1)}$ is small for nearly all tokens, adding the bias vector is likely to result in saturation. Here is a plot of the mean value of the sigmoid of each column of ${\bf W}_{xf}^{(1)}$ added to the bias ${\bf b}_f^{(1)}$:

We see very little difference in the values across the tokens when it is saturated. The behaviour of the bias in the input gate is different:

Adding the bias pulls the mean values down. We see the same behaviour in the output gate:

Since the bias of the cell gate is so small, we see very little difference in the mean values (note the change in y-scale):

However, there are more actors here than just the biases and the input vector matrices.

The affine transformations of the gates we have analysed so far in this first layer are independent of time, or sequence position more accurately. The mapping in each gate is time-dependent, however; and this is where the previous hidden state vector ${\bf h}_{t-1}$, comes into play. Let’s now analyse the parameters ${\bf W}_{hi}^{(1)}, {\bf W}_{hf}^{(1)}, {\bf W}_{hc}^{(1)}$, and ${\bf W}_{ho}^{(1)}$.

The images of the matrices themselves do not say much. Let’s look instead at the set of inner products between the normalised columns of the input vector matrices summed with the biases, and the normalised columns of the hidden state matrices, e.g., we are looking at the cosine of the angle between $[{\bf W}_{xi}^{(1)}]_{:k} + {\bf b}_i$ and ${\bf W}_{hi}^{(1)}$. We know that each ${\bf x}_t$ will select only one column of ${\bf W}_{xi}^{(1)}$ and add it to the bias. So, how do the columns of ${\bf W}_{hi}^{(1)}$ relate?

Here’s the result for the input gate. (The tokens on the x-axis are telling us which column of ${\bf W}_{xi}^{(1)}$ we are comparing to ${\bf W}_{hi}^{(1)}$.)

Clearly, we’ve entered The Matrix.

More telling is the distribution of values in this matrix:

This shows that most of the vectors in ${\bf W}_{hi}^{(1)}$ bear little resemblance to the vectors in ${\bf W}_{xi}^{(1)}$. If the angle between two vectors is between 0.46 and 0.54, then the magnitude orthogonal projection of one onto the other is no greater than 0.13 the product of their lengths. This says the columns in ${\bf W}_{hi}^{(1)}$ point in directions that are largely orthogonal to those in ${\bf W}_{xi}^{(1)}$. However, not any linear combination of the columns in ${\bf W}_{hi}^{(1)}$ will be different from those in ${\bf W}_{xi}^{(1)}$. We find ${\bf W}_{hi}^{(1)}$ has full rank, and so any point in $\mathbb{R}^{512}$ can be reached. However, we must be careful since ${\bf h}_t \in [-1,1]^{512}$.

Now, what about the other gates? Here’s the above in the forget gate.

The histogram shows the distribution of values to be more Laplacian distributed than those of the input gate.

Here’s the above for the cell gate:

The histogram shows these values are more Student t-distributed, but still there is very little resemblance in the directions of the collection of vectors.

Finally, here’s the above for the output gate:

And the histogram:

In each gate of the first layer, we see that the contributions from the input vector summed with the bias could be quite different from that of the hidden state vector. All of the above seem to suggest that the hidden state vectors inhabit subspaces of $[-1, 1]^{512}$ that are a bit separated from the column spaces of the input vector matrices… which is kind of neat to think about: there is a region in $[-1, 1]^{512}$ where the first layer is storing information that is a bit orthogonal to the current input.

Let’s look at the singular value decomposition of these hidden state matrices and see how the subspaces spanned by its singular vectors correspond to the columns of the input vector matrices. What I am interested in measuring is how each column of ${\bf W}_{xi}^{(1)}$ influences the transformation accomplished by ${\bf W}_{hi}^{(1)}$. So, I’m going to perform a singular value decomposition of ${\bf W}_{hi}^{(1)}$, then form approximations of the matrix by summing outer products of its left and right singular vectors, and find the norms of the resulting transformations of the columns of ${\bf W}_{xi}^{(1)}$. That is, given the first left and right singular vectors, ${\bf u}_1$ and ${\bf v}_1$ respectively, and the kth column of ${\bf W}_{xi}^{(1)}, {\bf w}_k$, I will compute $\|{\bf u}_1{\bf v}_1^T {\bf w}_k\|/ \|{\bf w}_k\|_2$. Then I will compute $\|({\bf u}_1 + {\bf u}_2)({\bf v}_1 + {\bf v}_2)^T{\bf w}_k\|/ \|{\bf w}_k\|_2$, and so on.

First, here are the singular values for all the hidden state matrices of the first layer.

We see there are some directions that will be amplified by these matrices, but not to the extent we saw for the output matrix. Now, here’s the result of the above procedure for the input gate:

This shows how combination of singular vectors of ${\bf W}_{hi}^{(1)}$ can begin to resemble particular directions of columns in ${\bf W}_{xi}^{(1)}$. In particular, after only the first few singular vectors, the resulting subspace contains most of the directions of the columns corresponding to tokens “<s>”, “M:4/4”, “M:6/8”, “K:CMaj”, “|” and “2”.

Here’s the same but with the forget gate:

It takes only a few singuar vectors to capture most of “/2” and “>”, but it takes many more to point along the dimensions related to meter tokens. Is this because the forget gate hidden state vector is taking care of the meter?

Here’s the same for cell gate:

It seems that we need to combine many singular vectors of ${\bf W}_{hf}^{(1)}$ for the subspace to encompass much of the columns corresponding to meter, mode, measure, and duration tokens.

Finally, here’s the result for the output gate:

By and large the directions present in the singular vectors of ${\bf W}_{ho}^{(1)}$ are very different from columns in ${\bf W}_{xo}^{(1)}$. It takes combining the single rank matrices constituting about 50% of the cumulative singular values before the column space includes about 50% of any column in ${\bf W}_{xo}^{(1)}$. So it seems the action of these two matrices are quite different.

And with that it seems I’ve made a mess of wires without any clearer indication of what the hell leads where in the folk-rnn v2 model. I think it’s time to perform a step-by-step simulation of the model.

Stay tuned!

# Making sense of the folk-rnn v2 model, part 3

This holiday break gives me time to dig into the music transcription models that we trained and used this year, and figure out what those ~6,000,000 parameters actually mean. In particular, I’ll be looking at the “version 2” model, which generated all of the 30,000 transcriptions compiled in ten volumes, as well as helped compose a large amount of new music, including #1068 appearing as the second tune in the following set played by three fantastic musicians:

and the lovely tune #4542

Last time we looked at the internal algorithms of this model (the three LSTM layers and the final softmax layer), and its training. Then we looked into the parameters of the softmax layer, which helped us get ideas of what the units of the last LSTM layer might be up to. Now it’s time to see how these units might be working together as a group.

Let’s perform a singular value decomposition of the full-rank matrix ${\bf V}$. This will give us one orthonormal basis for $\mathbb{R}^{512}$ and one for $\mathbb{R}^{137}$, ordered such that ${\bf V}{\bf w}_i = s_i {\bf u}_i$ with $s_1 > s_2 > \ldots > s_{137}$. First, let’s look at the sorted singular values.

The largest singular value is huge! About 509. This means that if the last layer of the LSTM produced the output ${\bf h}_t^{(3)} = {\bf w}_1$, then ${\bf V}{\bf h}_t^{(3)} = 509{\bf u}_1$. Since ${\bf u}_1$ has unit Euclidean norm, at least one of its values must be greater than $1/\sqrt{137}$, which means a value with magnitude of at least 42 (and of at most 509). This is much larger than any of the values we see in the bias vector ${\bf v}$, and by the above relationship between $\delta$ and $\alpha$, this could pretty much mean making some tokens very, very likely.

What does this right singular vector ${\bf w}_1$ look like? And what does the corresponding left singular vector ${\bf u}_1$ look like? Here’s ${\bf w}_1$.

All of these values are within the range of the output units of the LSTM layer. Looking at the cumulative histogram of their values we can see a trimodal distribution.

If the last LSTM layer produced these values, about 70% of them are producing activations with magnitudes between 0.04 and 0.06, and the rest are producing activations between -0.04 and 0.04. Sine the output units of this LSTM layer are bounded between -1 and 1, this means the model has a lot of room to vary the probability distribution it is generating.

Here’s the first left singular vector ${\bf u}_1$ (multiplied by -1 to flip so that the $\delta$ corresponding to the “<s>” token is small.)

We can see from this that the action of this first right singular vector increases the probability of particular tokens: “</s>”, “|:”, “:|”, “C,”, “D,”, “E,”, “F,”, “]”, “6” and “3/2” (among others). Let’s plot the probability distribution that results when we add $-509{\bf u}_1$ to the bias.

If we sample from this, we would see “|:” 27% of the time, “</s>” 17% of the time, and “_B,” about 10% of the time. The token “|:” signifies the start of a repeated phrase. In the training data, this occurs at the beginning of the 8 bar tune (the A part of an AABB tune) and/or at the beginning of the 8 bar turn (the B part). The token “</s>” represents the end of a transcription. This often occurs in the training data at the end of the turn. Hence, it makes sense that the model would put a high probability on both of those tokens. But why the “_B,”? (I don’t know yet.)

So, is this model exploiting the first left singular vector of the fully connected layer to start a new repeated section, or to finish a transcription? That’s something to test soon.

The second and third largest singular values are 65.76 and 63.45. Here are the corresponding right singular vectors.

These seem spikier than ${\bf w}_1$, and their amplitudes appear distributed more Gaussian. Here are the corresponding left singular vectors.

${\bf w}_2$ appears to decrease the probability of meter and key tokens, and increase the probability of some of the measure and pitch tokens. ${\bf w}_3$ doesn’t change the distribution of the meter and key tokens, but increases the probability of the measure tokens while decreasing the probability of the pitch tokens. Both increase the probability of duration tokens “2”, “/2” and “>” and “<“. When we add each to the bias with their singular values (and the negative singular values), we find the following probability distributions.

Hence, if the last layer of the LSTM were to output ${\bf u}_3$, then the model would output a “>” token 98% of the time. If it were to output $-{\bf u}_3$, then the model would produce a “b” token almost 60% of the time, and a “g” or “G,” about 30% of the time. If the last layer of the LSTM were to output $-{\bf u}_2$, then it would produce a “K:Cmaj” or “K:Cdor” about 80% of the time. Interesting… We could do this all day, but we don’t know if the folk-rnn v2 model is actually exploiting these directions in $\mathbb{R}^{512}$ to generate single tokens.

Stay tuned!

# Making sense of the folk-rnn v2 model, part 2

This holiday break gives me time to dig into the music transcription models that we trained and used this year, and figure out what those ~6,000,000 parameters actually mean. In particular, I’ll be looking at the “version 2” model, which generated all of the 30,000 transcriptions compiled in ten volumes, as well as helped compose a large amount of new music, including such greats as the four movements of Oded Ben-Tal’s fantastic “Bastard Tunes”:

and X:7153 harmonised by DeepBach and played on the St. Dunstan and All Saints church organ by Richard Salmon

The most immediate thing we can do is analyse the parameters of the layers closest to the token vocabulary. Let’s start with the parameters of the softmax output layer, i.e., the 137×512 matrix ${\bf V}$ and the 137-dimensional vector ${\bf v}$. These  produce by a linear combination of the columns of ${\bf V}$ and the bias ${\bf v}$ what we treat as a multinomial probability distribution over the vocabulary. The last LSTM layer gives the weights for this linear combination, i.e., ${\bf h}_t^{(3)}$. Hence, we can interpret the columns of ${\bf V}$ and ${\bf v}$ as “shapes” contributing to the probability distribution generated by the model. (The softmax merely compresses this shape since it is a monotonically increasing function.) Each dimension of a column corresponds to the probability of a particular token. Increase the value in that dimension and we increase its probability; decrease the value and we decrease its probability. Below is a plot of all columns of ${\bf V}$ (black) and the bias ${\bf v}$ (red).

We can see some interesting things. If the input to the softmax layer were all zeros the resulting shape would look like the red line. We see low values for the tokens “<s>”, “=f'”, “16”, “(7”, “^f'” and “=C,”. These tokens are rarely generated by the model, and in fact are rare in the training data — with the exception of “<s>” which is always the first token input to the model (the v2 model never generates this token). In the training data with 4,032,490 tokens, we find “(7” appears only once, “=f'” and “^f'” appear only twice each, “16” appers only five times, and “=C,” appears only six times. The largest value of the bias corresponds to the token “C”, which is sensible since we have transposed all 23,000 transcriptions in the training data to modes with the root C. “C” appears 191,424 times in the training data.

Another interesting thing to observe is how the columns of ${\bf V}$ together  resemble a landscape viewed over water with a mean of 0. In fact, the wide matrix ${\bf V}$ has full rank, which means its null space has a non-zero dimension (375), and so there are ways to sum the columns to create a zero vector, or even to completely cancel the bias ${\bf v}$. With ${\bf V}$ full rank, we know that our model can potentially reach any point on the positive face of the $\ell_1$-unit ball of $\latex \mathbb{R}^{137}$. This would not be possible if the number of output units of the last LSTM layer was smaller than 137.

How do all the columns of ${\bf V}$ relate to one another? We can get an idea by visualising the Gramian of the matrix, i.e., $\hat {\bf V}^T \hat{\bf V}$, where $[\hat{\bf V}]_{i} := [{\bf V}]_{i}/\|[{\bf V}]_{i}\|_2^2$. (The notation $[{\bf A}]_{i}$ means the $i$th column of ${\bf A}$.) Below we see the Gramian of ${\bf V}$, plotted with reference to the units of the last LSTM layer.We see a lot of dark red and dark blue, which means the units in the last LSTM layer resemble the unfourtantely high partisanship in the USA. No, really, it means that many of the columns in ${\bf V}$ point in nearly the same directions (up to a change of sign); but there are some vectors that point in more unique ways. Here’s a scatter plot of all the values in the image above sorted according to the variance we see along each row (or column equivalently).

That looks exactly like an xray of an empty burrito. Let’s zoom in on those units weighting columns that are quite different from nearly everything else in ${\bf V}$:

The output units 175, 216 and 497 of the last LSMT layer are contributing in unique ways to the distribution ${\bf p}_t$. Let’s look at the corresponding columns and see if we can’t divine some interpretation.

It looks like column 175 points most strongly in the direction of tokens “]” and “5”, in the opposite direction of “9” and “(2”.  Column 216 points most strongly in the direciton of tokens “=c” and “/2>”, and in the opposite direction of tokens “^g” and “_B”. And column 497 points most strongly in the direction of tokens “=B,”, “_C” and “_c”, and in the opposite direction of tokens “_A,” and “^g”. Taken together with the above, this shows that the folk-rnn v2 model can make “]” and “5” more probable and “9” and “(2” less probable by increasing the output of unit 175 of the last LSTM layer. That’s not so meaningful. However, increasing the output of unit 497 will increase the probability of   “=B,”, “_C” and “_c” and decrease the probability of “_A,” and “^g”. What’s very interesting here is that “=B,”, “_C” and “_c” are the same pitch class, as are “_A,” and “^g”. Unit 497 is treating the probabilities of the tokens in those two groups in similar ways. Has this unit learned about the relationship between these tokens?

Since these values are additive inside the softmax, we can derive what they mean in terms of the change in probability. For the $n$th token with pre-softmax value $[{\bf w}]_n$ displaced by some $\delta$, we want to find $\alpha$ in the following:

$[{\bf p'}_t ]_n = \alpha [{\bf p}_t ]_n$

where ${\bf p'}_t = \textrm{softmax}({\bf w} + \delta {\bf e}_n)$, and ${\bf e}_n$ is the $n$th standard basis vector of $\mathbb{R}^{137}$. For the above to satisfy the axioms of probability, we must restrict the value of $\alpha$: $0 \le \alpha \le 1/[{\bf p}_t ]_n$.

After a bit of algebra, we find the following nice relationship:

$\alpha = e^\delta / (1-(1-e^\delta)[{\bf p}_t ]_n), -\infty < \delta < \infty$

or

$\delta = \log \alpha + \log \left [(1-[{\bf p}_t ]_n)/(1-\alpha [{\bf p}_t ]_n) \right ], 0 \le \alpha \le 1/[{\bf p}_t ]_n$.

As $\delta \to 0$, the probability of the token does not change. As $\delta \to -\infty$, the probability of the token goes to zero. And $\alpha \to 1/[{\bf p}_t ]_n$ as $\delta \to \infty$. Also, as $[{\bf p}_t ]_n \to 0$, then $\alpha \to e^\delta$.

Applying this to the above, we can see that if the last LSTM layer unit weighting column 497 of ${\bf V}$ were to spit out a +1, then, all other things being equal, it would increase the probability of “=B,” by a factor of about 4 if it were initially about 5% probable. If it spit out a -1, then, all other things being equal, it would decrease the probability of “=B,” by a factor of about 1/4 if it were about 5% probable.

Returning to the scatter plot of the Gramian values above, we see that some columns of ${\bf V}$ are almost identical. The largest value we see is 0.9968 (columns 187 and 2) and the smallest is -0.9937 (187 and 361). Here are the six columns of ${\bf V}$ that point most in the same directions (up to a scalar):

We get the sense here that units 2, 187 and 361 represent nearly the same information, as do units 34, 156 and 203. Interestingly, the columns 2, 187 and 361 don’t seem interested in three octaves of pitch tokens from “A,” to “a”.  Furthermore, they don’t seem to greatly change the probability of any specific tokens. However, this is not the case. Here’s the normalised vectors resulting from the differences between four of these normalised vectors.

Units 2, 187 and 361 may be weighting columns that point in nearly the same direction, but this shows their sums and differences points strongly in the direction of increasing the probability of some duration tokens “/2”, “2” and “>”, and decreasing the probability of other duration tokens “4” and “6”. The sums and differences of the other three columns do not show such a strong tendency for particular tokens, but we do see their differences point in the direction of mode tokens “K:Cmaj”, “K:Cmin”, “K:Cdor” and “K:Cmix”.

Anyhow, this shows how two units may be combining vectors that point in nearly the same direction (thanks to 137-dimensional space), but their combination can produce marked effects in specific directions. It is important to interpret the columns of ${\bf V}$, and likewise the units of the last LSTM layer, as a group and not as individual embodiments of our vocabulary.

Stay tuned!