# Musik/Data – Data/Musik Oct 15 2019

Tickets are now available: https://www.kmh.se/konserter—evenemang/alla/musik-data—data-musik.html

Here’s the promotional image:

# Making sense of the folk-rnn v2 model, part 12

This is part 12 of my loose and varied analyses of the folk-rnn v2 model, which have included parts 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, and 11. In this part, I continue looking at beam search for sampling from the model. We will now investigate restricting the search of the sample space.

Let’s limit the number of branches we explore at each depth in computing the joint probability mass function, and consider searching a fraction of them. First, let’s consider a depth $n=2$ and plot the amount of probability mass we miss at the leaves as a function of transcription step and depth-dependent beam width. In this case, we set the beam width at a particular depth to be a fraction of the total number of branches from that depth.

Each line above shows the probability mass we miss at each generation step when we compute $P_{X_{t},X_{t+1}|X_{t-1}, \ldots}(x_t,x_{t+1}| x_{t-1}, \ldots )$ for only a fraction of the sample space (shown in legend). In these realizations, when only considering those 5% of outcomes having the most probability mass (7 tokens at each depth) at each transcription step, we miss on average about 1% of the total probability mass. When only considering the top 10% (14 tokens at each depth), the average mass we miss is 0.1%. Considering only the top 20% (28 tokens at each depth), the average mass we miss drops another order of magnitude to 0.01%. So it seems we can get by reasonably well only searching a few dozen branches from each branch.

Here’s the missing probability mass at the leaves of a tree with depth 3:

The average amount of mass we miss when considering only the top 5% of outcomes is about 2% of the total probability mass. When only considering the top 10%, the average mass we miss is 0.1%. Considering only the top 20%, the average mass we miss drops another order of magnitude to 0.01%. We also see in some cases that the transcriptions generated using at least 15% of the vocabulary are identical.

Finally, let’s consider a depth of 4. This time, we also restrict the sample space to those 4-tuples having a joint probability greater than 1e-6. (Otherwise, a beam width of 28 results in over 3,000,000 outcomes at each step.)

Now we are beginning to see an increase of missing probability mass. For a beam width of 5% at each depth we have about 3% missing. For 10% we have 0.4% missing, and for 15% we have about 0.3%. We also observe, expectedly, that the time it takes to produce a transcription increases. The normal folkrnn v2 model takes about 100 milliseconds to generate a transcription. For a beam width of 10% at each depth, a depth of two takes about 10 seconds. The same width for depth three takes about 30 seconds. And for depth of 4 it is about 5 minutes. The algorithm can be parallelized at the leaves to help reduce this. The beam width can also be restricted to a total number of branches in the entire tree (which we explore below), or adapted to exploring only those branches with the largest probability mass that sum to 0.9 in each layer. Etc.

Let’s look at the transcriptions generated using beam search using a beam width of 10% in trees with depth $n$.

Henrik Norbeck, who is a walking and talking catalogue of Irish traditional dance music, and who runs the weekly session in Stockholm, remarks:

This is very reminiscent of the first part of The Fermoy Lasses (played by The Dubliners here), to the point where I would even call it a variation on that tune – especially the first part. In fact Seamus Egan’s rendition of The Fermoy Lasses is probably further removed from the original tune than the first part of the generated tune. Another thing I observe is that it feels like it needs a third part. The first two parts are rather similar, but a third part that `goes somewhere else´ would make it a good tune. At the moment with two parts it feels rather boring.

Henrik remarks on this one:

This tune sounds like it could have been composed by Paddy Fahy or Sean Ryan. There are already two tunes by them that are similar to each other – so much that in my mind they are connected – and this generated one becomes a third tune in the same class, but still a distinct tune.

Let’s see what happens with a tree of depth 7 initialized with 6/8 meter and major key tokens, and a beam width of 7% at each depth. This roughly corresponds to the model generating a joint probability distribution over entire bars. After about 16 hours of computation, here’s the resulting transcription:

The end of each part gives this the felling of a slide rather than a jig. The second part of this tune is more interesting than the first part, but I do like how the cadences at the end of both parts are in contrary motion.

The computation time could have been longer if I didn’t restrict the sample space at each step to be far smaller than $137\times 10^7$. For instance, I only evaluated the leaves from branches with $P_{X_7| X_1, \ldots, X_6}(x_7|x_1,\ldots,x_6) > 0.01$. Even so, and for this small beam width, the only step in which the probability mass was less than 0.95 was in the first step (generating the “|:” token), where it was about 0.83.

Though these previous four transcriptions come from the same random seed initialization as “Why are you?” (shown at the very top), each is quite different from one another. I especially like the tune produced with a tree of depth 4. I can’t really make any solid claim yet as to whether the quality of the generated transcriptions improve with deeper trees at this time, but my feeling is that the generated transcriptions seem more coherent with parts that make more sense than when having folkrnn v2 generate one token at a time.

Now let’s look what happens when we set the beam width to be a static number $\beta$. This means that we build a tree from the root to the leaves using only $\beta$ branches. Now we are missing a major amount of probability mass.

Here’s a lovely hornpipe generated from a tree with only 7 branches:

Doubling the number of branches but using the same random seed produces a rather poor tune:

Increasing the width to 21 but using the same random seed gives this excellent reel:

That bar 12 and 16 quote the main idea of the first part gives coherence to the tune. That first measure does not appear anywhere in the training data.

With this approach to strict beam width, the generation of entire measures at once becomes very fast. Now the generation of entire transcriptions takes only a few seconds with a beam width of 10. Here’s some example outputs created with the same beam width of $\beta=10$:

One thing we notice is that when we seed folkrnn v2 with a particular meter and mode and sample with small beam widths, it generates quite similar transcriptions each time even though the random seed is different. Here are the results from four different random seeds using 2/4 meter and major mode:

Here’s the same for a 4/4 meter and major mode:

It’s interesting that the melodic contours are very similar. Increasing the beam width introduces more variety in the outputs with the change of the random seed initialization.

Other variations of beam search are possible. For instance, we can restrict searching branches from the root that are within particular pitch classes.

# Making sense of the folk-rnn v2 model, part 11

This is part 11 of my loose and varied analyses of the folk-rnn v2 model, which have included parts 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10! Findings from these have appeared in my MuMu 2018 paper, “What do these 5,599,881 parameters mean? An analysis of a specific LSTM music transcription model, starting with the 70,281 parameters of the softmax layer”, and my presentation at the Joint Workshop on Machine Learning for Music ICML 2018: “How Stuff Works: LSTM Model of Folk Music Transcriptions” (slides here). There’s still a lot that can be done here to pick apart and understand what is going on in the model. It’s a fascinating exercise, trying to interpret this complex model. In this part, I start looking at beam search as an alternative approach to sampling from the model.

Let’s briefly return to what folkrnn v2 really is: a model of a joint probability mass distribution $P_{X_1,X_2, \ldots}(x_1,x_2, \ldots)$, where $X_t: \mathcal{V} \to [1,137]$ is a random variable at step $t \in [1, 2, \ldots]$, and the vocabulary $\mathcal{V}$ is a set of ABC-like tokens. This function can be equivalently written by the chain rule:

$P_{X_1,X_2, \ldots}(x_1,x_2, \ldots) = P_{X_1}(x_1)P_{X_2|X_1}(x_2|x_1) P_{X_3|X_1,X_2}(x_3|x_1, x_2) P_{X_4|X_1,X_2,X_3}(x_4|x_1, x_2, x_3)\ldots$

Training a folkrnn model involves adjusting its parameters to maximize each one of these conditional probabilities for real sequences sampled from a dataset. When generating new sequences, we merely sample from each estimate of $P_{X_t|X_1,X_2,\ldots, X_{t-1}}(x_t|x_1, x_2, \ldots, x_{t-1})$, until we have sampled the stop token. This produces, for instance, the following transcription:

Another random sampling will give a different sequence. This is how folkrnn.org is implemented, one token sampled at a time from a dynamic posterior distribution over the vocabulary.

We can factor the joint probability distribution another way, however:

$P_{X_1,X_2, \ldots}(x_1,x_2, \ldots) = P_{X_1,X_2}(x_1,x_2) P_{X_3,X_4|X_1,X_2}(x_3,x_4|x_1, x_2) P_{X_5,X_6|X_1,X_2,X_3,X_4}(x_5,x_6|x_1, x_2, x_3,x_4)\ldots$

These are distributions of pairs of random variables. While folkrnn v2 was not explicitly trained to maximize these for real sequences from the training data, I believe that the use of teacher forcing means it was equivalently trained to maximize these joint probabilities. This factorization also shows a different way to generate sequences using folkrnn v2, and one that can be argued to consider more context in generation:

1. compute $P_{X_1}(x_1)$
2. compute $P_{X_2|X_1}(x_2|x_1)$ for each $x_1 \in \mathcal{V}$
3. compute $P_{X_1,X_2}(x_1,x_2) = P_{X_2|X_1}(x_2|x_1)P_{X_1}(x_1)$ for each $(x_1,x_2) \in \mathcal{V}\times\mathcal{V}$
4. sample a pair of tokens from this distribution
5. update the states in the model with the sampled tokens, and repeat the above until one of the pairs of sampled tokens is the stop token.

In this situation, we are computing probability distributions over the sample space $\mathcal{V} \times \mathcal{V}$. For a vocabulary of 137 tokens, this has 18,769 outcomes. We don’t need to stop at pairs of tokens, because we can also factor the joint probability distribution using tuples of size 3, which leads to a sample space 2,571,353 outcomes. Pretty quickly however our sample space grows larger than trillions of outcomes … so there’s a limit here given the time I have left to live. We can however bring things under control by approximating these joint probability distributions.

Let’s think of this computational procedure as building a tree. For the general case of $n$-tuples of tokens, we create the first $|\mathcal{V}|$ branches by computing $P_{X_1}(x_1)$. From each of those branches we create $|\mathcal{V}|$ more branches by computing $P_{X_2|X_1}(x_2|x_1)$. We continue in this way to a depth of $n$ and create all the leaves of the tree by computing the product of all values on the branches above it:

$P_{X_1}(x_1)P_{X_2|X_1}(x_2|x_1) \ldots P_{X_n|X_1,X_2,\ldots, X_{n-1}}(x_n|x_1, x_2, \ldots,x_{n-1}) = P_{X_1,X_2, \ldots, X_n}(x_1,x_2, \ldots, x_n)$

Thinking of the procedure in this way motivates a question: how well can we estimate $P_{X_1,X_2, \ldots, X_n}(x_1,x_2, \ldots, x_n)$ by searching fewer than $|\mathcal{V}|^n$ leaves? We know from the expression above that we are multiplying conditional probabilities, so if at some depth one of those is zero or very small compared to others, we might as well trim that branch then and there. Another possibility is more strict: search only the $\beta$ branches at each depth that have the greatest probability. In this case, computing the probability distribution for tuples of size 3 requires computing $\beta^3$ leaves instead of $|\mathcal{V}|^3 = 2,571,353$. Then after having computed those leaves, we ignore all others and scale the joint probability mass distribution to sum to unity. That strategy is known as beam search. Using a beam width of $\beta = \infty$ computes all of the leaves at any depth. Making the width smaller saves computation, but at the price of a poorer estimate of the joint probability distribution. Let’s see how sampling from folkrnn v2 in this way performs.

The following transcription was generated using tuples of size $n=2$, and beam width of $\beta = \infty$ (meaning we compute the probability distribution over the entire sample space $\mathcal{V}\times\mathcal{V}$):Though we use the same random initialization as “Why are you …” above, this produces a dramatically different transcription. The first pair of tokens selected are meter and mode, which are both different from before. Repeat bar lines are missing at the end of the two parts, but as a whole the tune holds together with repetition and variation of simple ideas, and cadences at the right places. Here I play it at a slow pace on the Mean Green Machine Folk Machine, imposing an AABB structure:

Here’s another transcription produced with a different random seed initialization.

There are no obvious mistakes and it doesn’t appear to be copied from the training data. And it’s a nice tune, playing well as a hornpipe. Here it is on Mean Green Machine Folk Machine:

Here’s a jig created by seeding the network with a 6/8 meter token:

Another good tune that appears to be novel. Here it is on Mean Green Machine Folk Machine:

In the next part we will look at making $\beta < \infty$.

# SweDS19: Second call of presentations, posters and sponsors

http://www.kth.se/sweds19

The Swedish Workshop on Data Science (SweDS) is a national event aiming to maintain and develop data science research and its application in Sweden by fostering the exchange of ideas and promoting collaboration within and across disciplines. SweDS brings together researchers and practitioners working in a variety of academic, commercial or other sectors, and in the past has included presentations from a variety of domains, e.g., computer science, linguistics, economics, archaeology, environmental science, education, journalism, medicine, healthcare, biology, sociology, psychology, history, physics, chemistry, geography, forestry, design, and music.

SweDS19 is organised by the School of Electrical Engineering and Computer Science, KTH.

October 15–16, KTH, Stockholm Sweden

# Some observations from my week at the 2019 Joe Mooney Summer School

I arrived to the 2019 Joe Mooney Summer School knowing how to play about 70 tunes, but I left a week later knowing how to play three. That’s a good thing.

This was my first “music camp” – at 43 years old! I didn’t know what to expect, other than lots of music. I signed up for the courses in button accordion, with my D/G box – quite a strange tuning in Ireland but nonetheless not entirely incompatible with the music (more on that below).

The concert to open the week featured the group “Buttons & Bows“. Among the players is the superstar accordion player Jackie Daly. I later learned that Daly’s playing style is quite different from that of the accordion tutors, who seemed to all be students of Joe Burke, who was greatly influenced by the playing of Paddy O’Brien. At a few points in the concert Daly made comments to the extent that polkas aren’t given the respect they deserve. Then he would play a set of polkas. He related one funny story about a friend of his slagging polkas. So Daly wrote a polka and named it after his friend. I will be revisiting the way I play the Ballydesmond polkas, and will model them on Daly’s style. He will also be publishing a book soon collecting his compositions from his many years of playing.

On the first day of classes I found myself in a small room of about 60 accordion students. The average age was surely below 15. The youngest was probably 6 or 7. I was one of 10 adults, at least four of whom had traveled from outside Ireland (including Australia, Canada, England and Sweden). We each had to play a tune individually to be assigned to one of the five tutors – including two All Ireland Champions! When my turn came I started to play “Pigeon on the Gate”, but I wasn’t far into it before I was assigned to level 3, Nuala Hehir. Some students played a scale, a jig or a polka for their grading, but the tutors asked if they knew any reels. Reels are the most technically demanding to play.

There were 15 students in my class, including 6 adults. We each had to play a tune solo again for the tutor to hear. I played a bit of “Drowsy Maggie.” It wasn’t long before the tutor recognised several non-traditional characteristics – which is entirely to be expected since I haven’t had proper lessons in Irish accordion. More on this below.

In the six days of the course, we learned to play four tunes: two reels and two jigs. The first reel we learned is called “Crossing the Shannon” (called “The Funny Reel” here: https://youtu.be/FXqlUCOZBcc?t=50). The tutor played the entire tune for us to give us an idea of what it sounds like. Then she wrote up a textual ABC-like notation on the white board:

The circles denote crochets. The ticks on the letter denote an octave above the middle. Numbers denote fingering for B/C accordions. And the slur underneath two pitches denote sliding a finger on two buttons for B/C accordions.

The course proceeded with the tutor playing a few bars at a time with the ornamentation, and then the students playing along several times. In this tune, the important ornaments are cuts and a roll. Every second D’ can be rolled: D’-E’-D’-C#’-D’. In this case the roll happens in the duration of a crotchet. The E’ is a cut on the D’. A cut should be a nearly imperceptible blip. It doesn’t have any tonal value, but subtly changes the attack of a note. Cuts are often used by accordion, fiddle and flute and whistle when a note repeats. My D/G box can play a D roll only with a change of bellows direction to catch the E and the C#. A roll has to be smooth, so all pitches of the roll have to be played with the same bellows direction. Since we could not find any alternative, I must live with just cutting the D’ with an #F’.

The tutor had each student individually reproduce bars of the tune and coached them into improving it. Then she continued through these steps until we had a whole part. In the first day we made it through the first part of the tune, and recorded the tutor playing the second part at a slow speed so we can individually work on it for the next day.

On the second day we work-shopped the first part of “Crossing the Shannon” and moved on to learning the second part in the same way. Learning the second part wasn’t too hard because it mostly repeats material we already learned in the first part. By the end of the first half we had our first tune! The tutor had each student individually play the entire tune with repeats and helped them improve rolls and cuts, etc. She encouraged the students to not read the notation on the board.

In the second part of the day, the tutor gave us a single reel, “Glentown Reel”:

In this tune we have cuts, rolls, and triplets – all of which are possible on my accordion. The lines over the B’s remind the B/C student to play the outside row B. Some of the cuts are also made explicit. Learning this tune took most of day three. Before the end of the session, the tutor gave us a part of a jig (the second ending would be given the following day):

The tutor didn’t remember what it was called, but remembered she learned it from a particular teacher. She had us play a part of the first section, and then played the entire tune solo so we could record it and learn by ourselves at home. With the help of a friend I learned that the jig is similar to one called “The Road to Granard”.

On day four we went through both “Crossing the Shannon” and “Glentown Reel”, and finished learning the jig with the two endings (not pictured in the notation above). This jig has no rolls, but does involve cuts and triplets. Also, the tutor varied the use of triplets and showed how not every note needs to be ornamented in the same way. She also showed how a tune can be played beautifully without ornamentation.

On day five we went through all our tunes. Then the tutor asked whether any of us had another jig we wanted to work on. I suggested “Scatter the Mud”, but it wasn’t until I played it that she recognized it. Apparently, the version I played was not what she had learned. She confirmed with another tutor that the version she plays is closer to the right one, but she would have to do some research to make sure:

The sources of tunes are very important. The way a tune goes is not to be found on the internet, but in historical sources, like O’Neill’s collections, or the way particular masters play it and have recorded it. She warned us in considering the sources of our tunes.

The class on day six consisted of playing through all our tunes again, with some individual work, and then meeting with all the accordion students to play one or two tunes we learned. All five groups learned different tunes, none of which I had ever heard. Tutors deliberately choose rare tunes so that everyone can experience learning them fresh.

The week was also filled with many sessions happening around the high street of Drumshanbo, starting early in the day and ending very late at night. In any one of the four pubs, there could be four sessions going on. The high street also featured many children playing music together, some dancing, with hats out for money. It was great to see such enthusiasm from these young kids, many of which are playing very well! I attended sessions every night for the first four nights, and played in three, but by my third class I realised that I can play many tunes at speed without too many mistakes and can lead sets, but I’m not playing tunes in the “proper” traditional way.

Early on my tutor recognized some of my untraditional characteristics. One is my use of “Sharon Shannon” rolls, which are like triplets on the same note without any cuts. Another characteristic is my use of bass. B/C accordions have a much more limited bass side than my accordion, so the things I was doing didn’t sound right to her. Another characteristic I have is a general lack of rolls, cuts, and proper triplets. These ornaments, along with the rhythms, are what bring these tunes to life and gives them a dynamism. A bad habit I have developed is playing staccato. This means that when I play the accordion it doesn’t sound like an accordion. Now, in some contexts that could be called masterful, but this is not one of those contexts. So I decided that I would benefit more from going over the tunes and ornaments I was learning at slow speeds than repeating playing all my tunes at speed in non-traditional ways.

I look forward to next year when I can audition with “Pigeon on the Gate” played in a traditional style!

# Making sense of the folk-rnn v2 model, part 10

This is part 10 of my loose and varied analyses of the folk-rnn v2 model, which have included parts 1, 2, 3, 4, 5, 6, 7, 8 and 9. In the last part we looked at the similarities of the activations inside folkrnn v2 as it generated a particular transcription. Today we are looking at how our observations change for a different generated transcription.

Here’s a strange transcription generated by folkrnn v2:

The second and third parts have a counting error, which can be fixed easy enough. The Scotch snap in bar 9 is unexpected. I like the raised leading tone in the last bar. Otherwise this is a pretty boring tune.

Here’s the Gramian of the matrix of one-hot encoded input/outputs:

We see a lot of repetitions, forwards and backwards, which comes from the stepwise up and down the minor scale.

Here’s the Gramian of the softmax output vectors:

We again see a large number of pairs are far from each other. Of the 12,720 unique pairs of different vectors, 5,745 have a distance greater than 1.98. Second, the probability distributions in some of the steps generating measure tokens are close to identical — which is different from before where nearly all of them appeared quite similar. However, we see at each point when a measure token is produced that the distributions are very different from all others (the crisscrossing powder blue lines). Third, many of the structures seen in $\mathbf{X}^T \mathbf{X}$ are here as well, including some of the backward-slash diagonals. Fourth, we do not see the distributions produced during the first and penultimate bars of each part as overlapping much with other distributions. The distributions produced in the third bar seems to have the greatest dissimilarity to all others — which is curious because of its similarity to the first bar.

Looking at the Gramian of the normalised hidden-state activations of the three layers shows the same kinds of structures we saw before:

Again, the Gramian of the normalised layer-two hidden-state activations appears most similar to the Gramian of the one-hot encoded input. The diagonal lines in the Gramian of the layer-3 hidden state activations are not as strong as before. And there now appear several shorter diagonal lines between the stronger ones.

Here is an animation showing the Gramian from the out gate activations in each layer:

There’s still similarity with the Gramians of the hidden state activations. The grid patterns are interesting. From the first layer output activations they demarcate the three parts of the tune. In the third layer the grid shows the bars.

Here’re the Gramians of the cell gate activations with the hyperbolic tangent:

And here they are without the nonlinearity:

Again we see the cell gate activation of each layer saturates more and more in the same direction as the generation process runs. The extent of this saturation is least present in the first layer, and appears to exist in all of the second and third parts of the transcription in the second layers. The cell gate activations in the third layer are curiously calm.

Here are the Gramians of the in-gate activations of each layer (pausing at the last layer):

Not much going on here that we don’t see in the other gates. And here is the Gramian of the unit norm forget gate activations of all layers:

The three sections are clearly visible.

As before, comparing the activations between gates in each layer does not show any of these structures.

So it seems many of our observations hold!