Making sense of the folk-rnn v2 model, part 4

This holiday break gives me time to dig into the music transcription models that we trained and used this year, and figure out what those ~6,000,000 parameters actually mean. In particular, I’ll be looking at the “version 2” model, which generated all of the 30,000 transcriptions compiled in ten volumes, as well as helped compose a large amount of new music, including the three transcriptions interpreted by Torbjörn Hultmark:

and my 2017 Xmas (re)work, “Carol of the Cells”:

As a reminder, the first LSTM layer transforms a vector ${\bf x}_t$ in $\mathbb{R}^{137}$ by the following algorithm:

${\bf i}_t^{(1)} \leftarrow \sigma({\bf W}_{xi}^{(1)}{\bf x}_t + {\bf W}_{hi}^{(1)}{\bf h}_{t-1}^{(1)} +{\bf b}_i^{(1)})$
${\bf f}_t^{(1)} \leftarrow \sigma({\bf W}_{xf}^{(1)}{\bf x}_t + {\bf W}_{hf}^{(1)}{\bf h}_{t-1}^{(1)} +{\bf b}_f^{(1)})$
${\bf c}_t^{(1)} \leftarrow {\bf f}_t^{(1)}\odot{\bf c}_{t-1}^{(1)} + {\bf i}_t^{(1)} \odot \tanh({\bf W}_{xc}^{(1)}{\bf x}_t + {\bf W}_{hc}^{(1)}{\bf h}_{t-1}^{(1)} +{\bf b}_c^{(1)})$
${\bf o}_t^{(1)} \leftarrow \sigma({\bf W}_{xo}^{(1)}{\bf x}_t + {\bf W}_{ho}^{(1)}{\bf h}_{t-1}^{(1)} +{\bf b}_o^{(1)})$
${\bf h}_t^{(1)} \leftarrow {\bf o}_t^{(1)}\odot \tanh({\bf c}_{t}^{(1)})$

This layer is trained to map each standard basis vector of $\mathbb{R}^{137}$ to a point inside the $\ell_\infty$-unit ball of $\mathbb{R}^{512}$. Each ${\bf x}_t$ is first transformed by the 512×137 matrices ${\bf W}_{xi}^{(1)}, {\bf W}_{xf}^{(1)}, {\bf W}_{xc}^{(1)}, {\bf W}_{xo}^{(1)}$. Since our input vectors ${\bf x}_t$ have a sparsity of 1 this means we are just seeing single columns from these matrices. If ${\bf x}_t$ has a 1 in row 12, then ${\bf W}_{xi}^{(1)}{\bf x}_t$ is column 12 of ${\bf W}_{xi}^{(1)}$. Let’s look at the columns of these four matrices.

Here’s the matrix of the input gate ${\bf W}_{xi}^{(1)}$:

It appears that the token “” — the first one to be encounted by the model — activates a colum of negative numbers with a mean of -0.58. The second column, corresponding to token “” has a very different appearance. Here they are plotted together.

So the first token seen by the folk-rnn v2 model, “” injects a bunch of negative numbers into the system. The model has never been trained to model what comes after a “” token. So, in reality, we can lop off this column of each of these first-layer matrices.

Here’s the matrix of the forget gate ${\bf W}_{xf}^{(1)}$:

The values in this matrix do not vary much. The largest variance we see in its colums correspond to measure tokens, some duration tokens (“/2”, “>”), the triplet grouping token “(3”, and the pitch tokens “G,”, “A,”, “B,”, “C”, “D”, “E”, “F”, “G”, “A”, “B”, “c”, “d”, “e”, “f”, “g”, “a”, “b”, “c'” and “d'”.

Here’s the matrix of the cell gate ${\bf W}_{xc}^{(1)}$:

We can clearly see variety in many of its columns, particularly, meter and mode, some duration, and several pitch tokens. Finally, here’s the matrix of the output gate ${\bf W}_{xo}^{(1)}$:

Looking at the norms of the columns of these matrices shows their clear preferences for particular tokens. (I plot the negative norms of two of the matrices for readibility.)

The matrix of the forget gate ${\bf W}_{xf}^{(1)}$ is not very sensitive to meter and mode tokens, but has preferences for the measure tokens, some pitch tokens, the triplet grouping token, and some specific duration tokens. The matrix of the output gate ${\bf W}_{xo}^{(1)}$ has columns with norms that can be about 2-3 times that of the other gates, and is quite sensitive to some pitch and duration tokens.

Let’s look at the Gramian matrices of each of these to see how the columns are related by the direction they point. Here’s the Gramian of the input gate matrix ${\bf W}_{xi}^{(1)}$ (with each of its column normalised first):

We can see clear groupings of kinds of tokens. The columns corresponding to meter and mode tokens kind of point in the same directions, but a bit opposite of the pitch tokens. The pitch tokens point in the same direction, and point along directions of several duration and grouping tokens as well. The tokens of columns that are most unlike all others include “”, “M:3/2”, “^C,”, “D”, “E”, “F”, “A”, “B”, “e”, and “=f'”.

Here’s the Gramian of the forget gate matrix ${\bf W}_{xf}^{(1)}$:

This one shows more correspondence between columns of pitch tokens and mode tokens than does the Gramian of the input gate matrix ${\bf W}_{xi}^{(1)}$. The columns weighting the modes and meters are by and large orthogonal.

Here’s the Gramian of the cell gate matrix ${\bf W}_{xc}^{(1)}$:

This shows some interesting diagonal structures! Let’s zoom in on some of these in the pitch tokens.

This clearly shows that the columns corresponding to “A” and “=A”, “B” and “=B”, and so on point strongly in the same directions of $\mathbb{R}^{512}$. These pitches are equivalent. We also see strong correspondence between pitches that are a half-step apart, e.g., “=D”, “_D”, and “^D”, “=E” and “_E”, etc. There are similar diagonal structures for the tokens without the natural sign “=”. It is interesting that the model has picked up on this equivalence. I don’t know how this arises, but it may be by virtue of accidentals appearing in C-major transcriptions that have accidentals, e.g.,

T:A Curious Denis Murphy
M:6/8
K:Cmaj
|: A | A B c d e f | g a g g f d | c 3 c B c | d e f d B G |
A B c d e f | g 3 g f d | c B c A d c | B G G G 2 :|
|: A | _B 2 d c A A | _B G E F G A | _B A B c =B c | d e f d B G |
_B 2 d c A A | _B G E F G A | _B d d c A A | A G G G 2 :|


I don’t know if they are so plentiful, however. Finally, here’s the Gramian of the output gate matrix ${\bf W}_{xo}^{(1)}$:

This shows some of the same diagonal strutures we see in the Gramian of the cell gate matrix ${\bf W}_{xc}^{(1)}$, but also includes relationships between meter and mode tokens.

Let’s look at the bias in each one of these gates of the first layer:

The norms of these four bias vectors are quite different: input: 18.5; forget: 99.9; cell: 3.5; and output: 22.2.  We can see that the bias of the forget gate is entirely positive, whereas the bias of the input and output gates are for the most part negative. It is interesting to note that of the four gates, the forget gate has a bias with a norm that is more than twice the norm of any columns in the forget gate matrix ${\bf W}_{xf}^{(1)}$. That means the bias of the forget gate will really push in its direction the values of the linear transformation ${\bf W}_{xf}^{(1)}{\bf x}_t$.

Let’s plot the inner products between the normed bias vector and the normalised columns of each matrix to see the relationships between the directions:

We can see that in the input and output gates, the bias vectors point strongly in the direction of the columns corresponding to the start transcription token. The bias in the forget gate points mostly in the opposite direction of the column having to do with this token. This means that when the first layer encounters ${\bf x}_t$ that encodes “<s>”, then ${\bf W}_{xi}^{(1)}{\bf x}_t$ will produce a vector with a norm of about 14, and adding ${\bf b}_i^{(1)}$ to that will make its norm about double. The opposite happens in the forget gate: ${\bf W}_{xf}^{(1)}{\bf x}_t$ creates a vector that has a norm of about 2, and then adding the bias creates a vector of norm of about 98. Here’s a plot showing the result of this addition after a sigmoid operation:

We can thus see that when the forget gate encounters the token “<s>” it could be producing a vector that is close to saturation. In fact, since the norm of each column of ${\bf W}_{xf}^{(1)}$ is small for nearly all tokens, adding the bias vector is likely to result in saturation. Here is a plot of the mean value of the sigmoid of each column of ${\bf W}_{xf}^{(1)}$ added to the bias ${\bf b}_f^{(1)}$:

We see very little difference in the values across the tokens when it is saturated. The behaviour of the bias in the input gate is different:

Adding the bias pulls the mean values down. We see the same behaviour in the output gate:

Since the bias of the cell gate is so small, we see very little difference in the mean values (note the change in y-scale):

However, there are more actors here than just the biases and the input vector matrices.

The affine transformations of the gates we have analysed so far in this first layer are independent of time, or sequence position more accurately. The mapping in each gate is time-dependent, however; and this is where the previous hidden state vector ${\bf h}_{t-1}$, comes into play. Let’s now analyse the parameters ${\bf W}_{hi}^{(1)}, {\bf W}_{hf}^{(1)}, {\bf W}_{hc}^{(1)}$, and ${\bf W}_{ho}^{(1)}$.

The images of the matrices themselves do not say much. Let’s look instead at the set of inner products between the normalised columns of the input vector matrices summed with the biases, and the normalised columns of the hidden state matrices, e.g., we are looking at the cosine of the angle between $[{\bf W}_{xi}^{(1)}]_{:k} + {\bf b}_i$ and ${\bf W}_{hi}^{(1)}$. We know that each ${\bf x}_t$ will select only one column of ${\bf W}_{xi}^{(1)}$ and add it to the bias. So, how do the columns of ${\bf W}_{hi}^{(1)}$ relate?

Here’s the result for the input gate. (The tokens on the x-axis are telling us which column of ${\bf W}_{xi}^{(1)}$ we are comparing to ${\bf W}_{hi}^{(1)}$.)

Clearly, we’ve entered The Matrix.

More telling is the distribution of values in this matrix:

This shows that most of the vectors in ${\bf W}_{hi}^{(1)}$ bear little resemblance to the vectors in ${\bf W}_{xi}^{(1)}$. If the angle between two vectors is between 0.46 and 0.54, then the magnitude orthogonal projection of one onto the other is no greater than 0.13 the product of their lengths. This says the columns in ${\bf W}_{hi}^{(1)}$ point in directions that are largely orthogonal to those in ${\bf W}_{xi}^{(1)}$. However, not any linear combination of the columns in ${\bf W}_{hi}^{(1)}$ will be different from those in ${\bf W}_{xi}^{(1)}$. We find ${\bf W}_{hi}^{(1)}$ has full rank, and so any point in $\mathbb{R}^{512}$ can be reached. However, we must be careful since ${\bf h}_t \in [-1,1]^{512}$.

Now, what about the other gates? Here’s the above in the forget gate.

The histogram shows the distribution of values to be more Laplacian distributed than those of the input gate.

Here’s the above for the cell gate:

The histogram shows these values are more Student t-distributed, but still there is very little resemblance in the directions of the collection of vectors.

Finally, here’s the above for the output gate:

And the histogram:

In each gate of the first layer, we see that the contributions from the input vector summed with the bias could be quite different from that of the hidden state vector. All of the above seem to suggest that the hidden state vectors inhabit subspaces of $[-1, 1]^{512}$ that are a bit separated from the column spaces of the input vector matrices… which is kind of neat to think about: there is a region in $[-1, 1]^{512}$ where the first layer is storing information that is a bit orthogonal to the current input.

Let’s look at the singular value decomposition of these hidden state matrices and see how the subspaces spanned by its singular vectors correspond to the columns of the input vector matrices. What I am interested in measuring is how each column of ${\bf W}_{xi}^{(1)}$ influences the transformation accomplished by ${\bf W}_{hi}^{(1)}$. So, I’m going to perform a singular value decomposition of ${\bf W}_{hi}^{(1)}$, then form approximations of the matrix by summing outer products of its left and right singular vectors, and find the norms of the resulting transformations of the columns of ${\bf W}_{xi}^{(1)}$. That is, given the first left and right singular vectors, ${\bf u}_1$ and ${\bf v}_1$ respectively, and the kth column of ${\bf W}_{xi}^{(1)}, {\bf w}_k$, I will compute $\|{\bf u}_1{\bf v}_1^T {\bf w}_k\|/ \|{\bf w}_k\|_2$. Then I will compute $\|({\bf u}_1 + {\bf u}_2)({\bf v}_1 + {\bf v}_2)^T{\bf w}_k\|/ \|{\bf w}_k\|_2$, and so on.

First, here are the singular values for all the hidden state matrices of the first layer.

We see there are some directions that will be amplified by these matrices, but not to the extent we saw for the output matrix. Now, here’s the result of the above procedure for the input gate:

This shows how combination of singular vectors of ${\bf W}_{hi}^{(1)}$ can begin to resemble particular directions of columns in ${\bf W}_{xi}^{(1)}$. In particular, after only the first few singular vectors, the resulting subspace contains most of the directions of the columns corresponding to tokens “<s>”, “M:4/4”, “M:6/8”, “K:CMaj”, “|” and “2”.

Here’s the same but with the forget gate:

It takes only a few singuar vectors to capture most of “/2” and “>”, but it takes many more to point along the dimensions related to meter tokens. Is this because the forget gate hidden state vector is taking care of the meter?

Here’s the same for cell gate:

It seems that we need to combine many singular vectors of ${\bf W}_{hf}^{(1)}$ for the subspace to encompass much of the columns corresponding to meter, mode, measure, and duration tokens.

Finally, here’s the result for the output gate:

By and large the directions present in the singular vectors of ${\bf W}_{ho}^{(1)}$ are very different from columns in ${\bf W}_{xo}^{(1)}$. It takes combining the single rank matrices constituting about 50% of the cumulative singular values before the column space includes about 50% of any column in ${\bf W}_{xo}^{(1)}$. So it seems the action of these two matrices are quite different.

And with that it seems I’ve made a mess of wires without any clearer indication of what the hell leads where in the folk-rnn v2 model. I think it’s time to perform a step-by-step simulation of the model.

Stay tuned!