This is part 11 of my loose and varied analyses of the folk-rnn v2 model, which have included parts 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10! Findings from these have appeared in my MuMu 2018 paper, “What do these 5,599,881 parameters mean? An analysis of a specific LSTM music transcription model, starting with the 70,281 parameters of the softmax layer”, and my presentation at the Joint Workshop on Machine Learning for Music ICML 2018: “How Stuff Works: LSTM Model of Folk Music Transcriptions” (slides here). There’s still a lot that can be done here to pick apart and understand what is going on in the model. It’s a fascinating exercise, trying to interpret this complex model. In this part, I start looking at beam search as an alternative approach to sampling from the model.
Let’s briefly return to what folkrnn v2 really is: a model of a joint probability mass distribution , where is a random variable at step , and the vocabulary is a set of ABC-like tokens. This function can be equivalently written by the chain rule:
Training a folkrnn model involves adjusting its parameters to maximize each one of these conditional probabilities for real sequences sampled from a dataset. When generating new sequences, we merely sample from each estimate of , until we have sampled the stop token. This produces, for instance, the following transcription:
Another random sampling will give a different sequence. This is how folkrnn.org is implemented, one token sampled at a time from a dynamic posterior distribution over the vocabulary.
We can factor the joint probability distribution another way, however:
These are distributions of pairs of random variables. While folkrnn v2 was not explicitly trained to maximize these for real sequences from the training data, I believe that the use of teacher forcing means it was equivalently trained to maximize these joint probabilities. This factorization also shows a different way to generate sequences using folkrnn v2, and one that can be argued to consider more context in generation:
- compute for each
- compute for each
- sample a pair of tokens from this distribution
- update the states in the model with the sampled tokens, and repeat the above until one of the pairs of sampled tokens is the stop token.
In this situation, we are computing probability distributions over the sample space . For a vocabulary of 137 tokens, this has 18,769 outcomes. We don’t need to stop at pairs of tokens, because we can also factor the joint probability distribution using tuples of size 3, which leads to a sample space 2,571,353 outcomes. Pretty quickly however our sample space grows larger than trillions of outcomes … so there’s a limit here given the time I have left to live. We can however bring things under control by approximating these joint probability distributions.
Let’s think of this computational procedure as building a tree. For the general case of -tuples of tokens, we create the first branches by computing . From each of those branches we create more branches by computing . We continue in this way to a depth of and create all the leaves of the tree by computing the product of all values on the branches above it:
Thinking of the procedure in this way motivates a question: how well can we estimate by searching fewer than leaves? We know from the expression above that we are multiplying conditional probabilities, so if at some depth one of those is zero or very small compared to others, we might as well trim that branch then and there. Another possibility is more strict: search only the branches at each depth that have the greatest probability. In this case, computing the probability distribution for tuples of size 3 requires computing leaves instead of . Then after having computed those leaves, we ignore all others and scale the joint probability mass distribution to sum to unity. That strategy is known as beam search. Using a beam width of computes all of the leaves at any depth. Making the width smaller saves computation, but at the price of a poorer estimate of the joint probability distribution. Let’s see how sampling from folkrnn v2 in this way performs.
The following transcription was generated using tuples of size , and beam width of (meaning we compute the probability distribution over the entire sample space ):Though we use the same random initialization as “Why are you …” above, this produces a dramatically different transcription. The first pair of tokens selected are meter and mode, which are both different from before. Repeat bar lines are missing at the end of the two parts, but as a whole the tune holds together with repetition and variation of simple ideas, and cadences at the right places. Here I play it at a slow pace on the Mean Green Machine Folk Machine, imposing an AABB structure:
Here’s another transcription produced with a different random seed initialization.
There are no obvious mistakes and it doesn’t appear to be copied from the training data. And it’s a nice tune, playing well as a hornpipe. Here it is on Mean Green Machine Folk Machine:
Here’s a jig created by seeding the network with a 6/8 meter token:
Another good tune that appears to be novel. Here it is on Mean Green Machine Folk Machine:
In the next part we will look at making .