# Making sense of the folk-rnn v2 model, part 9

This is part 9 of my loose and varied analyses of the folk-rnn v2 model, which have included parts 1, 2, 3, 4, 5, 6, 7, and 8. As a brief review, the folkrnn v2 model maps elements of the standard basis of $\mathbb{R}^{137}$ onto the positive surface of the unit L1-ball in $\mathbb{R}^{137}$ by a series of nonlinear transformations. Denote an input by $t$. The first layer transforms this by the following algorithm: ${\bf i}_t^{(1)} \leftarrow \sigma({\bf W}_{xi}^{(1)}{\bf x}_t + {\bf W}_{hi}^{(1)}{\bf h}_{t-1}^{(1)} +{\bf b}_i^{(1)})$ ${\bf f}_t^{(1)} \leftarrow \sigma({\bf W}_{xf}^{(1)}{\bf x}_t + {\bf W}_{hf}^{(1)}{\bf h}_{t-1}^{(1)} +{\bf b}_f^{(1)})$ ${\bf c}_t^{(1)} \leftarrow {\bf f}_t^{(1)}\odot{\bf c}_{t-1}^{(1)} + {\bf i}_t^{(1)} \odot \tanh({\bf W}_{xc}^{(1)}{\bf x}_t + {\bf W}_{hc}^{(1)}{\bf h}_{t-1}^{(1)} +{\bf b}_c^{(1)})$ ${\bf o}_t^{(1)} \leftarrow \sigma({\bf W}_{xo}^{(1)}{\bf x}_t + {\bf W}_{ho}^{(1)}{\bf h}_{t-1}^{(1)} +{\bf b}_o^{(1)})$ ${\bf h}_t^{(1)} \leftarrow {\bf o}_t^{(1)}\odot \tanh({\bf c}_{t}^{(1)})$

The second layer does the same but with different parameters, and acting on the first-layer hidden state activation ${\bf h}_t^{(1)}$ to produce ${\bf h}_t^{(2)}$. The third layer does the same but with different parameters, and acting on ${\bf h}_t^{(2)}$ to produce ${\bf h}_t^{(3)}$. The in-gate activation of layer $n$ in step $t$ is ${\bf i}_t^{(n)}$. That of the out-gate is ${\bf o}_t^{(n)}$. That of the forget gate is ${\bf f}_t^{(n)}$. And that of the cell gate is ${\bf c}_t^{(n)}$. The final softmax layer maps ${\bf h}_t^{(3)}$ to a point on the positive surface of the L1 unit ball, which is also the probability mass distribution over the vocabulary $\mathcal{V}$: ${\bf p}_t \leftarrow \textrm{softmax}({\bf V}{\bf h}_t^{(3)} + {\bf v})$

In previous parts of this series, we analyzed the parameters, e.g., ${\bf W}_{xi}^{(1)}$. In this post we look at the activations in these layers during the generation of a transcription. Let’s consider this one from my MuMe 2018 paper and part 8 of this endless series: Let’s first look at the one-hot encoded inputs. Denote $\mathbf{X} := [\mathbf{x}_1 \mathbf{x}_2 \ldots \mathbf{x}_t]$ as the matrix of concatenated one-hot encoded vectors. The following shows the Gramian matrix $\mathbf{X}^T \mathbf{X}$: The axes a labeled with the tokens of the transcription. I draw a line at each measure token so we can more easily relate the structures we see in the picture to the bars in the notated transcription above. We clearly see a few different structures. Forward-slash diagonals show repetitions. We see material in bar 1 repeated in bars 2, 3, 5, 6, 9, and 13. The largest repetitions are in bars 5-6 and 13-14. We also see some short backward-slash diagonals. These are also repetitions but reversed, e.g., “A, 2 G,” and “G, 2 A,” in bars 1 and 2. We can see all of this from the notated transcription. How will the various nonlinear transformations performed within folkrnn v2 relate to each other for this generation?

Let’s first consider the softmax output of the last layer, which is also a probability distribution over the vocabulary. We take all these points output by folkrnn v2 and find their pairwise distances on the positive surface of the L1-unit ball. If two of these vectors have a distance of 2, then their supports do not overlap, or equivalently their distributions contain mass distributions at different tokens. Pairs of points colored sanguine have a distance in [0,0.02], and those colored powder blue have a distance in [1.98, 2]. All others in [0.02,1.98] are colored all gray scale. A few things become clear. First, a large number of pairs are far from each other. Of the 5,671 unique pairs of different vectors, 2,642 have a distance greater than 1.98. Second, the probability distributions in nearly all steps generating measure tokens are close to identical. Third, the distributions co-occurring with the repetitions in the first part are highly similar, reflecting the structures in $\mathbf{X}^T \mathbf{X}$ seen above. We do not, however, see the short backward-slash diagonals. Fourth, the distributions produced during the first and penultimate bars of each part overlap with nearly all other distributions except for those producing measure tokens; and the distributions produced after the middle of each section (bars 5 and 13) seem to have the greatest dissimilarity to all others. (Does the latter arise from the model having learned how parts of a tune define other parts?)

Now let’s look at the similarities of activations within the network during the generation of the transcription above. First, let’s look at the hidden state activations in the third layer. We construct the matrix $\widehat{\mathbf{H}}^{(3)} := [{\bf h}_1^{(3)}/\|{\bf h}_1^{(3)}\|, {\bf h}_2^{(3)}\|{\bf h}_2^{(3)}\|, \ldots ]$, and look at the magnitude of the Gramian $(\widehat{\mathbf{H}}^{(3)})^T \widehat{\mathbf{H}}^{(3)}$. In this case sanguine shows activations that point in the same direction (ignoring polarity), powder blue shows activations that are orthogonal, and all others are colored gray scale, with darker shades showing smaller angles between pairs of activations. These patterns are intriguing! It seems that more or less as the model steps through the generation, the hidden state activations of the third layer point in similar directions relative to the bar lines. That is, the hidden state activations near to step $\tau$ after each barline point in similar directions, but different from those in the other steps after each barline.

This feature becomes less present in the hidden state activations of the shallower layers above. Here is the magnitude of $(\widehat{\mathbf{H}}^{(2)})^T \widehat{\mathbf{H}}^{(2)}$: And here is that of the first layer $(\widehat{\mathbf{H}}^{(1)})^T \widehat{\mathbf{H}}^{(1)}$: Let’s compare all of these to $\mathbf{X}^T \mathbf{X}$. The following animation cycles through each pair starting with the first layer (and pausing at the last layer). First we see a shift of one pixel to the top right, which comes from the one-step delay between the input and output fed back into the network, i.e., each hidden state activation comes from processing the output generated in the previous step. Second, we see that the magnitude Gramian of the activations in the 2nd layer bear the most resemblance to $\mathbf{X}^T \mathbf{X}$. I don’t know why. Could it be that the second layer  decides what to put and the third layer decides where to put?

When we look at the activations of the out gate in each layer, we see a high similarity with the hidden state activations of the same layer – expected given how the hidden state activation is a function of the output gate activations. As before, we assemble the matrix $\widehat{\mathbf{O}}^{(n)} := [{\bf o}_1^{(n)}/\|{\bf o}_1^{(n)}\|, {\bf o}_2^{(n)}\|{\bf o}_2^{(n)}\|, \ldots ]$, and look at the Gramian $(\widehat{\mathbf{O}}^{(n)})^T \widehat{\mathbf{O}}^{(n)}$. Here is an animation cycling through the Gramian of these out-gate activations for each layer (pausing at the last layer): Let’s have a look at the cell gate activations. First we look at the Gramian of $\tanh\widehat{\mathbf{C}}^{(n)} := [\tanh{\bf c}_1^{(n)}/\|\tanh{\bf c}_1^{(n)}\|, \tanh{\bf c}_2^{(n)}\|\tanh{\bf c}_2^{(n)}\|, \ldots ]$. Here is an animation showing these from the first to the third layer (pausing at the last layer): We can see faint echos of the structures in the Gramian of the activations of the out gate and hidden state of each layer. Now let’s have a look at the Gramians without the hyperbolic tangent, i.e., of $\widehat{\mathbf{C}}^{(n)} := [{\bf c}_1^{(n)}/\|{\bf c}_1^{(n)}\|, {\bf c}_2^{(n)}\|{\bf c}_2^{(n)}\|, \ldots ]$ (pausing at the last layer): This shows that the cell gate activation of a layer saturates more and more in the same direction as the generation process runs. The extent of this saturation is least present in the first layer, and appears to exist in all of the second part of the transcription in the second and third layers.

Here are the Gramians of the in-gate activations of each layer (pausing at the last layer): And that leaves the activations produced by the forget gate: These activations appear to be nearly saturated across the entire generation, but we do see the same structures as in the Gramians of the in- and out-gate activations.

Now, in the work I presented at the 2018 ICML Workshop: Machine Learning for Music, I show how each one of the four gates in the first layer seem to store information about token types in different subspaces of $(0,1)^{512}$ (or $\mathbb{R}^{512}$ for the cell gate). Here’s the relevant slide: Let’s look at the following matrix: $(\widehat{\mathbf{I}}^{(1)})^T widehat{\mathbf{O}}^{(1)}$, that is the set of inner products of all unit-norm activations of the first layer in gate with those of the first layer out gate. Here they are with outgate activations along the vertical axis (pausing on the last layer): These show none of the structures above. The activations of these two gates are thus pointing in ways that do not strongly relate over the steps. It is the same when we compare the forget-gate activations to those of both the in- and out-gate activations. These structures do not appear either in the comparison of the hidden state activations and in-, out- and forget-gate activations. So, taken with my above theoretical observation of the parameters in the gates of the first layer, it seems the same holds true of the deeper layers: each gate of a layer is projecting information in ways that are not directly related.

Now, how does all of the above change with a different transcription?