Making sense of the folk-rnn v2 model, part 11

This is part 11 of my loose and varied analyses of the folk-rnn v2 model, which have included parts 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10! Findings from these have appeared in my MuMu 2018 paper, “What do these 5,599,881 parameters mean? An analysis of a specific LSTM music transcription model, starting with the 70,281 parameters of the softmax layer”, and my presentation at the Joint Workshop on Machine Learning for Music ICML 2018: “How Stuff Works: LSTM Model of Folk Music Transcriptions” (slides here). There’s still a lot that can be done here to pick apart and understand what is going on in the model. It’s a fascinating exercise, trying to interpret this complex model. In this part, I start looking at beam search as an alternative approach to sampling from the model.

Let’s briefly return to what folkrnn v2 really is: a model of a joint probability mass distribution P_{X_1,X_2, \ldots}(x_1,x_2, \ldots), where X_t: \mathcal{V} \to [1,137] is a random variable at step t \in [1, 2, \ldots], and the vocabulary \mathcal{V} is a set of ABC-like tokens. This function can be equivalently written by the chain rule:

P_{X_1,X_2, \ldots}(x_1,x_2, \ldots) = P_{X_1}(x_1)P_{X_2|X_1}(x_2|x_1) P_{X_3|X_1,X_2}(x_3|x_1, x_2) P_{X_4|X_1,X_2,X_3}(x_4|x_1, x_2, x_3)\ldots

Training a folkrnn model involves adjusting its parameters to maximize each one of these conditional probabilities for real sequences sampled from a dataset. When generating new sequences, we merely sample from each estimate of P_{X_t|X_1,X_2,\ldots, X_{t-1}}(x_t|x_1, x_2, \ldots, x_{t-1}), until we have sampled the stop token. This produces, for instance, the following transcription:

Another random sampling will give a different sequence. This is how is implemented, one token sampled at a time from a dynamic posterior distribution over the vocabulary.

We can factor the joint probability distribution another way, however:

P_{X_1,X_2, \ldots}(x_1,x_2, \ldots) = P_{X_1,X_2}(x_1,x_2) P_{X_3,X_4|X_1,X_2}(x_3,x_4|x_1, x_2) P_{X_5,X_6|X_1,X_2,X_3,X_4}(x_5,x_6|x_1, x_2, x_3,x_4)\ldots

These are distributions of pairs of random variables. While folkrnn v2 was not explicitly trained to maximize these for real sequences from the training data, I believe that the use of teacher forcing means it was equivalently trained to maximize these joint probabilities. This factorization also shows a different way to generate sequences using folkrnn v2, and one that can be argued to consider more context in generation:

  1. compute P_{X_1}(x_1)
  2. compute P_{X_2|X_1}(x_2|x_1) for each x_1 \in \mathcal{V}
  3. compute P_{X_1,X_2}(x_1,x_2) = P_{X_2|X_1}(x_2|x_1)P_{X_1}(x_1) for each (x_1,x_2) \in \mathcal{V}\times\mathcal{V}
  4. sample a pair of tokens from this distribution
  5. update the states in the model with the sampled tokens, and repeat the above until one of the pairs of sampled tokens is the stop token.

In this situation, we are computing probability distributions over the sample space \mathcal{V} \times \mathcal{V}. For a vocabulary of 137 tokens, this has 18,769 outcomes. We don’t need to stop at pairs of tokens, because we can also factor the joint probability distribution using tuples of size 3, which leads to a sample space 2,571,353 outcomes. Pretty quickly however our sample space grows larger than trillions of outcomes … so there’s a limit here given the time I have left to live. We can however bring things under control by approximating these joint probability distributions.

Let’s think of this computational procedure as building a tree. For the general case of n-tuples of tokens, we create the first |\mathcal{V}| branches by computing P_{X_1}(x_1). From each of those branches we create |\mathcal{V}| more branches by computing P_{X_2|X_1}(x_2|x_1). We continue in this way to a depth of n and create all the leaves of the tree by computing the product of all values on the branches above it:

P_{X_1}(x_1)P_{X_2|X_1}(x_2|x_1) \ldots P_{X_n|X_1,X_2,\ldots, X_{n-1}}(x_n|x_1, x_2, \ldots,x_{n-1}) = P_{X_1,X_2, \ldots, X_n}(x_1,x_2, \ldots, x_n)

Thinking of the procedure in this way motivates a question: how well can we estimate P_{X_1,X_2, \ldots, X_n}(x_1,x_2, \ldots, x_n) by searching fewer than |\mathcal{V}|^n leaves? We know from the expression above that we are multiplying conditional probabilities, so if at some depth one of those is zero or very small compared to others, we might as well trim that branch then and there. Another possibility is more strict: search only the \beta branches at each depth that have the greatest probability. In this case, computing the probability distribution for tuples of size 3 requires computing \beta^3 leaves instead of |\mathcal{V}|^3 = 2,571,353. Then after having computed those leaves, we ignore all others and scale the joint probability mass distribution to sum to unity. That strategy is known as beam search. Using a beam width of \beta = \infty computes all of the leaves at any depth. Making the width smaller saves computation, but at the price of a poorer estimate of the joint probability distribution. Let’s see how sampling from folkrnn v2 in this way performs.

The following transcription was generated using tuples of size n=2, and beam width of \beta = \infty (meaning we compute the probability distribution over the entire sample space \mathcal{V}\times\mathcal{V}):Screen Shot 2019-07-17 at 09.10.05.pngThough we use the same random initialization as “Why are you …” above, this produces a dramatically different transcription. The first pair of tokens selected are meter and mode, which are both different from before. Repeat bar lines are missing at the end of the two parts, but as a whole the tune holds together with repetition and variation of simple ideas, and cadences at the right places. Here I play it at a slow pace on the Mean Green Machine Folk Machine, imposing an AABB structure:

Here’s another transcription produced with a different random seed initialization.

Screen Shot 2019-07-17 at 09.17.37.png

There are no obvious mistakes and it doesn’t appear to be copied from the training data. And it’s a nice tune, playing well as a hornpipe. Here it is on Mean Green Machine Folk Machine:

Here’s a jig created by seeding the network with a 6/8 meter token:

Screen Shot 2019-07-17 at 09.17.48.png

Another good tune that appears to be novel. Here it is on Mean Green Machine Folk Machine:

In the next part we will look at making \beta < \infty.




SweDS19: Second call of presentations, posters and sponsors

The Swedish Workshop on Data Science (SweDS) is a national event aiming to maintain and develop data science research and its application in Sweden by fostering the exchange of ideas and promoting collaboration within and across disciplines. SweDS brings together researchers and practitioners working in a variety of academic, commercial or other sectors, and in the past has included presentations from a variety of domains, e.g., computer science, linguistics, economics, archaeology, environmental science, education, journalism, medicine, healthcare, biology, sociology, psychology, history, physics, chemistry, geography, forestry, design, and music.

SweDS19 is organised by the School of Electrical Engineering and Computer Science, KTH.

October 15–16, KTH, Stockholm Sweden

Continue reading

Some observations from my week at the 2019 Joe Mooney Summer School

I arrived to the 2019 Joe Mooney Summer School knowing how to play about 70 tunes, but I left a week later knowing how to play three. That’s a good thing.

This was my first “music camp” – at 43 years old! I didn’t know what to expect, other than lots of music. I signed up for the courses in button accordion, with my D/G box – quite a strange tuning in Ireland but nonetheless not entirely incompatible with the music (more on that below).

The concert to open the week featured the group “Buttons & Bows“. Among the players is the superstar accordion player Jackie Daly. I later learned that Daly’s playing style is quite different from that of the accordion tutors, who seemed to all be students of Joe Burke, who was greatly influenced by the playing of Paddy O’Brien. At a few points in the concert Daly made comments to the extent that polkas aren’t given the respect they deserve. Then he would play a set of polkas. He related one funny story about a friend of his slagging polkas. So Daly wrote a polka and named it after his friend. I will be revisiting the way I play the Ballydesmond polkas, and will model them on Daly’s style. He will also be publishing a book soon collecting his compositions from his many years of playing.

On the first day of classes I found myself in a small room of about 60 accordion students. The average age was surely below 15. The youngest was probably 6 or 7. I was one of 10 adults, at least four of whom had traveled from outside Ireland (including Australia, Canada, England and Sweden). We each had to play a tune individually to be assigned to one of the five tutors – including two All Ireland Champions! When my turn came I started to play “Pigeon on the Gate”, but I wasn’t far into it before I was assigned to level 3, Nuala Hehir. Some students played a scale, a jig or a polka for their grading, but the tutors asked if they knew any reels. Reels are the most technically demanding to play.

There were 15 students in my class, including 6 adults. We each had to play a tune solo again for the tutor to hear. I played a bit of “Drowsy Maggie.” It wasn’t long before the tutor recognised several non-traditional characteristics – which is entirely to be expected since I haven’t had proper lessons in Irish accordion. More on this below.

In the six days of the course, we learned to play four tunes: two reels and two jigs. The first reel we learned is called “Crossing the Shannon” (called “The Funny Reel” here: The tutor played the entire tune for us to give us an idea of what it sounds like. Then she wrote up a textual ABC-like notation on the white board:


The circles denote crochets. The ticks on the letter denote an octave above the middle. Numbers denote fingering for B/C accordions. And the slur underneath two pitches denote sliding a finger on two buttons for B/C accordions.

The course proceeded with the tutor playing a few bars at a time with the ornamentation, and then the students playing along several times. In this tune, the important ornaments are cuts and a roll. Every second D’ can be rolled: D’-E’-D’-C#’-D’. In this case the roll happens in the duration of a crotchet. The E’ is a cut on the D’. A cut should be a nearly imperceptible blip. It doesn’t have any tonal value, but subtly changes the attack of a note. Cuts are often used by accordion, fiddle and flute and whistle when a note repeats. My D/G box can play a D roll only with a change of bellows direction to catch the E and the C#. A roll has to be smooth, so all pitches of the roll have to be played with the same bellows direction. Since we could not find any alternative, I must live with just cutting the D’ with an #F’.

The tutor had each student individually reproduce bars of the tune and coached them into improving it. Then she continued through these steps until we had a whole part. In the first day we made it through the first part of the tune, and recorded the tutor playing the second part at a slow speed so we can individually work on it for the next day.

On the second day we work-shopped the first part of “Crossing the Shannon” and moved on to learning the second part in the same way. Learning the second part wasn’t too hard because it mostly repeats material we already learned in the first part. By the end of the first half we had our first tune! The tutor had each student individually play the entire tune with repeats and helped them improve rolls and cuts, etc. She encouraged the students to not read the notation on the board.

In the second part of the day, the tutor gave us a single reel, “Glentown Reel”:


In this tune we have cuts, rolls, and triplets – all of which are possible on my accordion. The lines over the B’s remind the B/C student to play the outside row B. Some of the cuts are also made explicit. Learning this tune took most of day three. Before the end of the session, the tutor gave us a part of a jig (the second ending would be given the following day):


The tutor didn’t remember what it was called, but remembered she learned it from a particular teacher. She had us play a part of the first section, and then played the entire tune solo so we could record it and learn by ourselves at home. With the help of a friend I learned that the jig is similar to one called “The Road to Granard”.

On day four we went through both “Crossing the Shannon” and “Glentown Reel”, and finished learning the jig with the two endings (not pictured in the notation above). This jig has no rolls, but does involve cuts and triplets. Also, the tutor varied the use of triplets and showed how not every note needs to be ornamented in the same way. She also showed how a tune can be played beautifully without ornamentation.

On day five we went through all our tunes. Then the tutor asked whether any of us had another jig we wanted to work on. I suggested “Scatter the Mud”, but it wasn’t until I played it that she recognized it. Apparently, the version I played was not what she had learned. She confirmed with another tutor that the version she plays is closer to the right one, but she would have to do some research to make sure:


The sources of tunes are very important. The way a tune goes is not to be found on the internet, but in historical sources, like O’Neill’s collections, or the way particular masters play it and have recorded it. She warned us in considering the sources of our tunes.

The class on day six consisted of playing through all our tunes again, with some individual work, and then meeting with all the accordion students to play one or two tunes we learned. All five groups learned different tunes, none of which I had ever heard. Tutors deliberately choose rare tunes so that everyone can experience learning them fresh.

The week was also filled with many sessions happening around the high street of Drumshanbo, starting early in the day and ending very late at night. In any one of the four pubs, there could be four sessions going on. The high street also featured many children playing music together, some dancing, with hats out for money. It was great to see such enthusiasm from these young kids, many of which are playing very well! I attended sessions every night for the first four nights, and played in three, but by my third class I realised that I can play many tunes at speed without too many mistakes and can lead sets, but I’m not playing tunes in the “proper” traditional way.

Early on my tutor recognized some of my untraditional characteristics. One is my use of “Sharon Shannon” rolls, which are like triplets on the same note without any cuts. Another characteristic is my use of bass. B/C accordions have a much more limited bass side than my accordion, so the things I was doing didn’t sound right to her. Another characteristic I have is a general lack of rolls, cuts, and proper triplets. These ornaments, along with the rhythms, are what bring these tunes to life and gives them a dynamism. A bad habit I have developed is playing staccato. This means that when I play the accordion it doesn’t sound like an accordion. Now, in some contexts that could be called masterful, but this is not one of those contexts. So I decided that I would benefit more from going over the tunes and ornaments I was learning at slow speeds than repeating playing all my tunes at speed in non-traditional ways.

I look forward to next year when I can audition with “Pigeon on the Gate” played in a traditional style!

Making sense of the folk-rnn v2 model, part 10

This is part 10 of my loose and varied analyses of the folk-rnn v2 model, which have included parts 1, 2, 3, 4, 5, 6, 7, 8 and 9. In the last part we looked at the similarities of the activations inside folkrnn v2 as it generated a particular transcription. Today we are looking at how our observations change for a different generated transcription.

Here’s a strange transcription generated by folkrnn v2:

Screen Shot 2019-07-21 at 13.13.53.png

The second and third parts have a counting error, which can be fixed easy enough. The Scotch snap in bar 9 is unexpected. I like the raised leading tone in the last bar. Otherwise this is a pretty boring tune.

Here’s the Gramian of the matrix of one-hot encoded input/outputs:

Onehot.pngWe see a lot of repetitions, forwards and backwards, which comes from the stepwise up and down the minor scale.

Here’s the Gramian of the softmax output vectors:

softmax.pngWe again see a large number of pairs are far from each other. Of the 12,720 unique pairs of different vectors, 5,745 have a distance greater than 1.98. Second, the probability distributions in some of the steps generating measure tokens are close to identical — which is different from before where nearly all of them appeared quite similar. However, we see at each point when a measure token is produced that the distributions are very different from all others (the crisscrossing powder blue lines). Third, many of the structures seen in \mathbf{X}^T \mathbf{X} are here as well, including some of the backward-slash diagonals. Fourth, we do not see the distributions produced during the first and penultimate bars of each part as overlapping much with other distributions. The distributions produced in the third bar seems to have the greatest dissimilarity to all others — which is curious because of its similarity to the first bar.

Looking at the Gramian of the normalised hidden-state activations of the three layers shows the same kinds of structures we saw before:

ezgif-1-4ac47a82bfe4.gifAgain, the Gramian of the normalised layer-two hidden-state activations appears most similar to the Gramian of the one-hot encoded input. The diagonal lines in the Gramian of the layer-3 hidden state activations are not as strong as before. And there now appear several shorter diagonal lines between the stronger ones.

Here is an animation showing the Gramian from the out gate activations in each layer:

ezgif-1-ac36201b9305.gifThere’s still similarity with the Gramians of the hidden state activations. The grid patterns are interesting. From the first layer output activations they demarcate the three parts of the tune. In the third layer the grid shows the bars.

Here’re the Gramians of the cell gate activations with the hyperbolic tangent:


And here they are without the nonlinearity:


Again we see the cell gate activation of each layer saturates more and more in the same direction as the generation process runs. The extent of this saturation is least present in the first layer, and appears to exist in all of the second and third parts of the transcription in the second layers. The cell gate activations in the third layer are curiously calm.

Here are the Gramians of the in-gate activations of each layer (pausing at the last layer):


Not much going on here that we don’t see in the other gates. And here is the Gramian of the unit norm forget gate activations of all layers:

ezgif-1-7050a92b060a.gifThe three sections are clearly visible.

As before, comparing the activations between gates in each layer does not show any of these structures.

So it seems many of our observations hold!




Making sense of the folk-rnn v2 model, part 9

This is part 9 of my loose and varied analyses of the folk-rnn v2 model, which have included parts 1, 2, 3, 4, 5, 6, 7, and 8. As a brief review, the folkrnn v2 model maps elements of the standard basis of \mathbb{R}^{137} onto the positive surface of the unit L1-ball in \mathbb{R}^{137} by a series of nonlinear transformations. Denote an input by t. The first layer transforms this by the following algorithm:

{\bf i}_t^{(1)} \leftarrow \sigma({\bf W}_{xi}^{(1)}{\bf x}_t + {\bf W}_{hi}^{(1)}{\bf h}_{t-1}^{(1)} +{\bf b}_i^{(1)})
{\bf f}_t^{(1)} \leftarrow \sigma({\bf W}_{xf}^{(1)}{\bf x}_t + {\bf W}_{hf}^{(1)}{\bf h}_{t-1}^{(1)} +{\bf b}_f^{(1)})
{\bf c}_t^{(1)} \leftarrow {\bf f}_t^{(1)}\odot{\bf c}_{t-1}^{(1)} + {\bf i}_t^{(1)} \odot \tanh({\bf W}_{xc}^{(1)}{\bf x}_t + {\bf W}_{hc}^{(1)}{\bf h}_{t-1}^{(1)} +{\bf b}_c^{(1)})
{\bf o}_t^{(1)} \leftarrow \sigma({\bf W}_{xo}^{(1)}{\bf x}_t + {\bf W}_{ho}^{(1)}{\bf h}_{t-1}^{(1)} +{\bf b}_o^{(1)})
{\bf h}_t^{(1)} \leftarrow {\bf o}_t^{(1)}\odot \tanh({\bf c}_{t}^{(1)})

The second layer does the same but with different parameters, and acting on the first-layer hidden state activation {\bf h}_t^{(1)} to produce {\bf h}_t^{(2)}. The third layer does the same but with different parameters, and acting on {\bf h}_t^{(2)} to produce {\bf h}_t^{(3)}. The in-gate activation of layer n in step t is {\bf i}_t^{(n)}. That of the out-gate is {\bf o}_t^{(n)}. That of the forget gate is {\bf f}_t^{(n)}. And that of the cell gate is {\bf c}_t^{(n)}. The final softmax layer maps {\bf h}_t^{(3)} to a point on the positive surface of the L1 unit ball, which is also the probability mass distribution over the vocabulary \mathcal{V}:

{\bf p}_t \leftarrow \textrm{softmax}({\bf V}{\bf h}_t^{(3)} + {\bf v})

In previous parts of this series, we analyzed the parameters, e.g., {\bf W}_{xi}^{(1)}. In this post we look at the activations in these layers during the generation of a transcription. Let’s consider this one from my MuMe 2018 paper and part 8 of this endless series:

Screen Shot 2019-07-20 at 00.33.50.png

Let’s first look at the one-hot encoded inputs. Denote \mathbf{X} := [\mathbf{x}_1 \mathbf{x}_2 \ldots \mathbf{x}_t] as the matrix of concatenated one-hot encoded vectors. The following shows the Gramian matrix \mathbf{X}^T \mathbf{X}:

The axes a labeled with the tokens of the transcription. I draw a line at each measure token so we can more easily relate the structures we see in the picture to the bars in the notated transcription above. We clearly see a few different structures. Forward-slash diagonals show repetitions. We see material in bar 1 repeated in bars 2, 3, 5, 6, 9, and 13. The largest repetitions are in bars 5-6 and 13-14. We also see some short backward-slash diagonals. These are also repetitions but reversed, e.g., “A, 2 G,” and “G, 2 A,” in bars 1 and 2. We can see all of this from the notated transcription. How will the various nonlinear transformations performed within folkrnn v2 relate to each other for this generation?

Let’s first consider the softmax output of the last layer, which is also a probability distribution over the vocabulary. We take all these points output by folkrnn v2 and find their pairwise distances on the positive surface of the L1-unit ball. If two of these vectors have a distance of 2, then their supports do not overlap, or equivalently their distributions contain mass distributions at different tokens.

Pairs of points colored sanguine have a distance in [0,0.02], and those colored powder blue have a distance in [1.98, 2]. All others in [0.02,1.98] are colored all gray scale. A few things become clear. First, a large number of pairs are far from each other. Of the 5,671 unique pairs of different vectors, 2,642 have a distance greater than 1.98. Second, the probability distributions in nearly all steps generating measure tokens are close to identical. Third, the distributions co-occurring with the repetitions in the first part are highly similar, reflecting the structures in \mathbf{X}^T \mathbf{X} seen above. We do not, however, see the short backward-slash diagonals. Fourth, the distributions produced during the first and penultimate bars of each part overlap with nearly all other distributions except for those producing measure tokens; and the distributions produced after the middle of each section (bars 5 and 13) seem to have the greatest dissimilarity to all others. (Does the latter arise from the model having learned how parts of a tune define other parts?)

Now let’s look at the similarities of activations within the network during the generation of the transcription above. First, let’s look at the hidden state activations in the third layer. We construct the matrix \widehat{\mathbf{H}}^{(3)} := [{\bf h}_1^{(3)}/\|{\bf h}_1^{(3)}\|, {\bf h}_2^{(3)}\|{\bf h}_2^{(3)}\|, \ldots ], and look at the magnitude of the Gramian (\widehat{\mathbf{H}}^{(3)})^T \widehat{\mathbf{H}}^{(3)}.

hidddenstate_3.pngIn this case sanguine shows activations that point in the same direction (ignoring polarity), powder blue shows activations that are orthogonal, and all others are colored gray scale, with darker shades showing smaller angles between pairs of activations. These patterns are intriguing! It seems that more or less as the model steps through the generation, the hidden state activations of the third layer point in similar directions relative to the bar lines. That is, the hidden state activations near to step \tau after each barline point in similar directions, but different from those in the other steps after each barline.

This feature becomes less present in the hidden state activations of the shallower layers above. Here is the magnitude of (\widehat{\mathbf{H}}^{(2)})^T \widehat{\mathbf{H}}^{(2)}:

hidddenstate_2.pngAnd here is that of the first layer (\widehat{\mathbf{H}}^{(1)})^T \widehat{\mathbf{H}}^{(1)}:

hidddenstate_1.pngLet’s compare all of these to \mathbf{X}^T \mathbf{X}. The following animation cycles through each pair starting with the first layer (and pausing at the last layer).

ezgif-1-4bc87acfa319.gifFirst we see a shift of one pixel to the top right, which comes from the one-step delay between the input and output fed back into the network, i.e., each hidden state activation comes from processing the output generated in the previous step. Second, we see that the magnitude Gramian of the activations in the 2nd layer bear the most resemblance to \mathbf{X}^T \mathbf{X}. I don’t know why. Could it be that the second layer  decides what to put and the third layer decides where to put?

When we look at the activations of the out gate in each layer, we see a high similarity with the hidden state activations of the same layer – expected given how the hidden state activation is a function of the output gate activations. As before, we assemble the matrix \widehat{\mathbf{O}}^{(n)} := [{\bf o}_1^{(n)}/\|{\bf o}_1^{(n)}\|, {\bf o}_2^{(n)}\|{\bf o}_2^{(n)}\|, \ldots ], and look at the Gramian (\widehat{\mathbf{O}}^{(n)})^T \widehat{\mathbf{O}}^{(n)}. Here is an animation cycling through the Gramian of these out-gate activations for each layer (pausing at the last layer):

ezgif-1-c3501a120ac6.gifLet’s have a look at the cell gate activations. First we look at the Gramian of \tanh\widehat{\mathbf{C}}^{(n)} := [\tanh{\bf c}_1^{(n)}/\|\tanh{\bf c}_1^{(n)}\|, \tanh{\bf c}_2^{(n)}\|\tanh{\bf c}_2^{(n)}\|, \ldots ]. Here is an animation showing these from the first to the third layer (pausing at the last layer):

ezgif-1-6d264c13bdd5.gifWe can see faint echos of the structures in the Gramian of the activations of the out gate and hidden state of each layer. Now let’s have a look at the Gramians without the hyperbolic tangent, i.e., of \widehat{\mathbf{C}}^{(n)} := [{\bf c}_1^{(n)}/\|{\bf c}_1^{(n)}\|, {\bf c}_2^{(n)}\|{\bf c}_2^{(n)}\|, \ldots ] (pausing at the last layer):

ezgif-1-a19fd1e096fd.gifThis shows that the cell gate activation of a layer saturates more and more in the same direction as the generation process runs. The extent of this saturation is least present in the first layer, and appears to exist in all of the second part of the transcription in the second and third layers.

Here are the Gramians of the in-gate activations of each layer (pausing at the last layer):

ezgif-1-7fe2f786e9d3.gifAnd that leaves the activations produced by the forget gate:


These activations appear to be nearly saturated across the entire generation, but we do see the same structures as in the Gramians of the in- and out-gate activations.

Now, in the work I presented at the 2018 ICML Workshop: Machine Learning for Music, I show how each one of the four gates in the first layer seem to store information about token types in different subspaces of (0,1)^{512} (or \mathbb{R}^{512} for the cell gate). Here’s the relevant slide:

Screen Shot 2019-07-20 at 13.46.45.png

Let’s look at the following matrix: (\widehat{\mathbf{I}}^{(1)})^T widehat{\mathbf{O}}^{(1)}, that is the set of inner products of all unit-norm activations of the first layer in gate with those of the first layer out gate. Here they are with outgate activations along the vertical axis (pausing on the last layer):

These show none of the structures above. The activations of these two gates are thus pointing in ways that do not strongly relate over the steps. It is the same when we compare the forget-gate activations to those of both the in- and out-gate activations. These structures do not appear either in the comparison of the hidden state activations and in-, out- and forget-gate activations. So, taken with my above theoretical observation of the parameters in the gates of the first layer, it seems the same holds true of the deeper layers: each gate of a layer is projecting information in ways that are not directly related.

Now, how does all of the above change with a different transcription?

Unintended Uses

When we created in 2018, we intended it to be a venue for crowd-sourcing “Machine Folk” — music created by or with artificial intelligence. So far, over 700 tunes have been added, most of them by anonymous users of And over 60 recordings have also be added.

However, outside of my own use of the website, by far the biggest use has been strange automatic registrations like the below. There have been over 800 of these created since we opened the website. With a two step process to registration none of them have become fully registered.

Screen Shot 2019-07-11 at 10.55.41.png

Are these automatic registrations performed by bots? Are these bots coming for the music? Or are they spammers trying to infiltrate the bustling world of machine folk enthusiasts (current population in the low single digits)?