# Making sense of the folk-rnn v2 model, part 12

This is part 12 of my loose and varied analyses of the folk-rnn v2 model, which have included parts 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, and 11. In this part, I continue looking at beam search for sampling from the model. We will now investigate restricting the search of the sample space.

Let’s limit the number of branches we explore at each depth in computing the joint probability mass function, and consider searching a fraction of them. First, let’s consider a depth $n=2$ and plot the amount of probability mass we miss at the leaves as a function of transcription step and depth-dependent beam width. In this case, we set the beam width at a particular depth to be a fraction of the total number of branches from that depth. Each line above shows the probability mass we miss at each generation step when we compute $P_{X_{t},X_{t+1}|X_{t-1}, \ldots}(x_t,x_{t+1}| x_{t-1}, \ldots )$ for only a fraction of the sample space (shown in legend). In these realizations, when only considering those 5% of outcomes having the most probability mass (7 tokens at each depth) at each transcription step, we miss on average about 1% of the total probability mass. When only considering the top 10% (14 tokens at each depth), the average mass we miss is 0.1%. Considering only the top 20% (28 tokens at each depth), the average mass we miss drops another order of magnitude to 0.01%. So it seems we can get by reasonably well only searching a few dozen branches from each branch.

Here’s the missing probability mass at the leaves of a tree with depth 3: The average amount of mass we miss when considering only the top 5% of outcomes is about 2% of the total probability mass. When only considering the top 10%, the average mass we miss is 0.1%. Considering only the top 20%, the average mass we miss drops another order of magnitude to 0.01%. We also see in some cases that the transcriptions generated using at least 15% of the vocabulary are identical.

Finally, let’s consider a depth of 4. This time, we also restrict the sample space to those 4-tuples having a joint probability greater than 1e-6. (Otherwise, a beam width of 28 results in over 3,000,000 outcomes at each step.) Now we are beginning to see an increase of missing probability mass. For a beam width of 5% at each depth we have about 3% missing. For 10% we have 0.4% missing, and for 15% we have about 0.3%. We also observe, expectedly, that the time it takes to produce a transcription increases. The normal folkrnn v2 model takes about 100 milliseconds to generate a transcription. For a beam width of 10% at each depth, a depth of two takes about 10 seconds. The same width for depth three takes about 30 seconds. And for depth of 4 it is about 5 minutes. The algorithm can be parallelized at the leaves to help reduce this. The beam width can also be restricted to a total number of branches in the entire tree (which we explore below), or adapted to exploring only those branches with the largest probability mass that sum to 0.9 in each layer. Etc.

Let’s look at the transcriptions generated using beam search using a beam width of 10% in trees with depth $n$. Henrik Norbeck, who is a walking and talking catalogue of Irish traditional dance music, and who runs the weekly session in Stockholm, remarks:

This is very reminiscent of the first part of The Fermoy Lasses (played by The Dubliners here), to the point where I would even call it a variation on that tune – especially the first part. In fact Seamus Egan’s rendition of The Fermoy Lasses is probably further removed from the original tune than the first part of the generated tune. Another thing I observe is that it feels like it needs a third part. The first two parts are rather similar, but a third part that `goes somewhere else´ would make it a good tune. At the moment with two parts it feels rather boring.  Henrik remarks on this one:

This tune sounds like it could have been composed by Paddy Fahy or Sean Ryan. There are already two tunes by them that are similar to each other – so much that in my mind they are connected – and this generated one becomes a third tune in the same class, but still a distinct tune.

Let’s see what happens with a tree of depth 7 initialized with 6/8 meter and major key tokens, and a beam width of 7% at each depth. This roughly corresponds to the model generating a joint probability distribution over entire bars. After about 16 hours of computation, here’s the resulting transcription: The end of each part gives this the felling of a slide rather than a jig. The second part of this tune is more interesting than the first part, but I do like how the cadences at the end of both parts are in contrary motion.

The computation time could have been longer if I didn’t restrict the sample space at each step to be far smaller than $137\times 10^7$. For instance, I only evaluated the leaves from branches with $P_{X_7| X_1, \ldots, X_6}(x_7|x_1,\ldots,x_6) > 0.01$. Even so, and for this small beam width, the only step in which the probability mass was less than 0.95 was in the first step (generating the “|:” token), where it was about 0.83.

Though these previous four transcriptions come from the same random seed initialization as “Why are you?” (shown at the very top), each is quite different from one another. I especially like the tune produced with a tree of depth 4. I can’t really make any solid claim yet as to whether the quality of the generated transcriptions improve with deeper trees at this time, but my feeling is that the generated transcriptions seem more coherent with parts that make more sense than when having folkrnn v2 generate one token at a time.

Now let’s look what happens when we set the beam width to be a static number $\beta$. This means that we build a tree from the root to the leaves using only $\beta$ branches. Now we are missing a major amount of probability mass. Here’s a lovely hornpipe generated from a tree with only 7 branches: Doubling the number of branches but using the same random seed produces a rather poor tune: Increasing the width to 21 but using the same random seed gives this excellent reel: That bar 12 and 16 quote the main idea of the first part gives coherence to the tune. That first measure does not appear anywhere in the training data.

With this approach to strict beam width, the generation of entire measures at once becomes very fast. Now the generation of entire transcriptions takes only a few seconds with a beam width of 10. Here’s some example outputs created with the same beam width of $\beta=10$:   One thing we notice is that when we seed folkrnn v2 with a particular meter and mode and sample with small beam widths, it generates quite similar transcriptions each time even though the random seed is different. Here are the results from four different random seeds using 2/4 meter and major mode: Here’s the same for a 4/4 meter and major mode: It’s interesting that the melodic contours are very similar. Increasing the beam width introduces more variety in the outputs with the change of the random seed initialization.

Other variations of beam search are possible. For instance, we can restrict searching branches from the root that are within particular pitch classes.