Making sense of the folk-rnn v2 model, part 3

This holiday break gives me time to dig into the music transcription models that we trained and used this year, and figure out what those ~6,000,000 parameters actually mean. In particular, I’ll be looking at the “version 2” model, which generated all of the 30,000 transcriptions compiled in ten volumes, as well as helped compose a large amount of new music, including #1068 appearing as the second tune in the following set played by three fantastic musicians:

and the lovely tune #4542

Last time we looked at the internal algorithms of this model (the three LSTM layers and the final softmax layer), and its training. Then we looked into the parameters of the softmax layer, which helped us get ideas of what the units of the last LSTM layer might be up to. Now it’s time to see how these units might be working together as a group.

Let’s perform a singular value decomposition of the full-rank matrix ${\bf V}$. This will give us one orthonormal basis for $\mathbb{R}^{512}$ and one for $\mathbb{R}^{137}$, ordered such that ${\bf V}{\bf w}_i = s_i {\bf u}_i$ with $s_1 > s_2 > \ldots > s_{137}$. First, let’s look at the sorted singular values.

The largest singular value is huge! About 509. This means that if the last layer of the LSTM produced the output ${\bf h}_t^{(3)} = {\bf w}_1$, then ${\bf V}{\bf h}_t^{(3)} = 509{\bf u}_1$. Since ${\bf u}_1$ has unit Euclidean norm, at least one of its values must be greater than $1/\sqrt{137}$, which means a value with magnitude of at least 42 (and of at most 509). This is much larger than any of the values we see in the bias vector ${\bf v}$, and by the above relationship between $\delta$ and $\alpha$, this could pretty much mean making some tokens very, very likely.

What does this right singular vector ${\bf w}_1$ look like? And what does the corresponding left singular vector ${\bf u}_1$ look like? Here’s ${\bf w}_1$.

All of these values are within the range of the output units of the LSTM layer. Looking at the cumulative histogram of their values we can see a trimodal distribution.

If the last LSTM layer produced these values, about 70% of them are producing activations with magnitudes between 0.04 and 0.06, and the rest are producing activations between -0.04 and 0.04. Sine the output units of this LSTM layer are bounded between -1 and 1, this means the model has a lot of room to vary the probability distribution it is generating.

Here’s the first left singular vector ${\bf u}_1$ (multiplied by -1 to flip so that the $\delta$ corresponding to the “<s>” token is small.)

We can see from this that the action of this first right singular vector increases the probability of particular tokens: “</s>”, “|:”, “:|”, “C,”, “D,”, “E,”, “F,”, “]”, “6” and “3/2” (among others). Let’s plot the probability distribution that results when we add $-509{\bf u}_1$ to the bias.

If we sample from this, we would see “|:” 27% of the time, “</s>” 17% of the time, and “_B,” about 10% of the time. The token “|:” signifies the start of a repeated phrase. In the training data, this occurs at the beginning of the 8 bar tune (the A part of an AABB tune) and/or at the beginning of the 8 bar turn (the B part). The token “</s>” represents the end of a transcription. This often occurs in the training data at the end of the turn. Hence, it makes sense that the model would put a high probability on both of those tokens. But why the “_B,”? (I don’t know yet.)

So, is this model exploiting the first left singular vector of the fully connected layer to start a new repeated section, or to finish a transcription? That’s something to test soon.

The second and third largest singular values are 65.76 and 63.45. Here are the corresponding right singular vectors.

These seem spikier than ${\bf w}_1$, and their amplitudes appear distributed more Gaussian. Here are the corresponding left singular vectors.

${\bf w}_2$ appears to decrease the probability of meter and key tokens, and increase the probability of some of the measure and pitch tokens. ${\bf w}_3$ doesn’t change the distribution of the meter and key tokens, but increases the probability of the measure tokens while decreasing the probability of the pitch tokens. Both increase the probability of duration tokens “2”, “/2” and “>” and “<“. When we add each to the bias with their singular values (and the negative singular values), we find the following probability distributions.

Hence, if the last layer of the LSTM were to output ${\bf u}_3$, then the model would output a “>” token 98% of the time. If it were to output $-{\bf u}_3$, then the model would produce a “b” token almost 60% of the time, and a “g” or “G,” about 30% of the time. If the last layer of the LSTM were to output $-{\bf u}_2$, then it would produce a “K:Cmaj” or “K:Cdor” about 80% of the time. Interesting… We could do this all day, but we don’t know if the folk-rnn v2 model is actually exploiting these directions in $\mathbb{R}^{512}$ to generate single tokens.

Stay tuned!