This is part 8 of what is turning out to be a raft ride on a stream-of-consciousness analysis of the folk-rnn v2 model. After parts 1, 2, 3, 4, 5, 6 and 7 of this series, I don’t feel much closer to answering how it’s working its magic. It’s almost as if this little folk-rnn v2 model doesn’t want me to understand it. Feeling a bit lost, not knowing where to go next, I asked folk-rnn v2 to generate me a tune:

What a quaint and catchy tune! It sounds quite home-y and sentimental to me. It’s exactly the tune I want to play when I’m deep in the muck of inconclusive numerical analyses of computer systems far from a home of understanding. *It makes me love you even more, folk-rnn.*

OK. Let’s summarise the most interesting things so far.

- From part 1, we see how each element of the model’s vocabulary relates to the dimensions of the model input and output.
- From part 1, we see that the model isn’t overfit, but its training could have been halted after about 50 epochs.
- From part 2, we see that the parameters in the softmax layer — and — are relatively easy to understand with respect to the vocabulary. This also means we can interpret the contributions of each dimension of the third LSTM layer, .
- From part 2, we see that the columns of the (full rank) matrix in the softmax layer are by and large highly correlated. (Mangitude correlation with the bias vector is no more than about 0.8 and has a mean of 0.4). Some columns seem to be very similar to others save a change in polarity. This suggests a lot of redundancy in the encoding of subspaces of in .
- From part 2, we derive the numerical relationship between a change in a dimension of the input to the softmax, and the change in the probability mass of the associated token. This will be useful for interacting with the model to change its generation disposition, e.g., we can push it to favor or avoid particular pitches, modes, durations, etc.
- From part 3, we see through the singular value decomposition of that one singular value is an order of magnitude larger than all others. This means the last LSTM layer can cause a very large change in the posterior distribution in one specific direction. This left singular vector seems to significantly increase the probability mass of two tokens in particular: “|:” and “</s>”.
- From part 3, we see that other left singular vectors have similar interpretations. The second one is spikey on meter and mode tokens. The third one is spikey on a broken rhythm token. (We don’t know whether the model is exploiting these directions, so we should look at projections of onto these directions over the course of a transcription generation.)
- From part 4, we see the parameters of the gates in the first LSTM layer have very different appearances. Since they are close to the input layer, we can interpret them in terms of the vocabulary. For instance, the norms of the columns of the forget gate matrix corresponding to meter and mode tokens are very small. (But does this mean that the forget gate is paying more attention to other kinds of tokens?)
- From part 4, we see the Gramian of the cell gate matrix, , shows some possible encoding of enharmonic relationships. We see a similar relationship in the Gramian of the output gate matrix, .
- From part 4, we see for the first layer LSTM the inner products between the columns of the matrices multiplying the hidden state vector — — and the matrices multiplying the input vector — — are quite small. This suggests there are two very distinct subspaces of where the network is storing complementary information. (What is that information, and how does it relate to the music transcription generation, and the vocabulary? Do the matrices of the other two LSTM layers have the same property?)
- From part 5, we see curious things in the model’s posterior distribution over the generation of a tune: regular spiking of probability mass of measure tokens; a decay of probability mass in broken rhythm tokens; and “echoes” of probability mass in the pitch tokens.
- From part 5 and part 6, we see that the third layer LSTM hidden state activations are very sparse, and show a high degree of periodicity. We see in particular a periodicity with harmonics of about 9 steps. This seems to have to do with the measure tokens.
- From part 6, we see that when we cut the connections of the nearly 20% of third layer LSTM units having 5% of their power around a period of 9 steps, the model can still generate measure tokens where it should. When we increase the number of cut connections to 27%, we see the model is losing its “counting” ability. When we increase to 33%, the model can no longer “count”. If we snip 33% of the 512 final layer connections
*at random*, we see that it can still “count”. (These snips essentially restrict the subspaces in that the model can reach, but how can we interpret these in terms of the vocabulary?) - From part 6, we see that when we cut the connections of the third layer LSTM units, the model’s melodic “composition” ability disappears quite quickly. This suggests that the model’s ability to “count” and to “compose” a melody have different sensitivities. Its “counting” ability seems more robust than its melodic “sensibility”. (Does this mean melodic information is encoded in a larger subspace of than the timing information?)
- From part 7, we see that hidden state activations are very sparse at each LSTM layer, but they become more saturated with depth. We also see that hidden state activations at the first LSTM layer do not have as strong a perodic appearance as those of the following two LSTM layers.
- From part 7, we see that when we snip 33% of the connections in the first LSTM layer having power in a period of around 9 steps, the model can still generate meature tokens where needed. It’s melodic ability does not seem to remain, however.
- From part 7, we see across all hidden state activations of the three LSTM layers, sequences with strongs periods of DC, 2.25, 3, 4.5, 6, 8.8, 11.5, 18, and 36 steps. We also see evidence that the periods we observe in the first LSTM layer are due to the input of particular tokens, whereas the periods we observe in the last LSTM layer cause the output of particular tokens.

Now, looking forward, what do we have yet to do (in addition to all those things in parens above)? Here are a few things:

- How can we interpret the parameters of the middle LSTM layer? If we can relate the subspaces of of the first LSTM layer to the vocabulary, then we will be able to interpret the transformation performed by the second LSTM layer. Similarly, if we can relate the subspaces of of the third LSTM layer to the vocabulary, then we will also be able to interpret the transformation performed by the second LSTM layer.
- What exactly are the gates responsible for in each layer? We have looked at hidden state activations of each layer over the generation of transcriptions, but what about those in the cell gates? Those seem to be unbounded.
- Has dropout in the training of this model obscured the functions of its elements? If we train the same but without dropout, how does the interpretation change? Does the analysis of the model become more immediate? What if we used instead gated recurrent networks?
- Is there a way to uncover function in a more automated way, rather than having to look for activations that appear to correspond to particular behaviors? (Like the unit that saturates between quotes in a text LSTM.)
- What will happen if we supress the output/input of measure tokens for a period of time? Or the end of transcription token? Will the model only be able to “count” for 16 measures or so?

Regardless, *folk-rnn* brings families together. Here’s my wife and I playing, “Why are you and your 6 million parameters so hard to understand?”