Making sense of the folk-rnn v2 model, part 1

This holiday break gives me time to dig into the music transcription models that we trained and used this year, and figure out what those ~6,000,000 parameters actually mean. In particular, I’ll be looking at the “version 2” model, which generated all of the 30,000 transcriptions compiled in ten volumes, as well as helped compose a large amount of new music, including such greats as X:488 (“The Mainframe March”)

And the three tunes used by Úna Monaghan in her composition, “Safe Houses” for concertina and tape:

And the 19 tunes used by Luca Turchet in his composition “Dialogues with folk-rnn” for smart mandolin:

The folk-rnn v2 model consists of four layers: three LSTM layers and a fully connected output layer. We input vectors of dimension 137 and the model spits out vectors of the same dimension. Each dimension of the input and output vectors has a clear reference to  an element in the model vocabulary. In the case of the folk-rnn v2 model, the vocabulary is a set of ABC-notation-like tokens, e.g., “=C” and “>2” and “(3”. (This representation is one big difference from folk-rnn v1, which is essentially char-rnn. The other is that v2 is implemented in python, not lua.) The relationship between dimension to token is given by the following list:

{“0”: “d'”, “1”: “=A,”, “2”: “^c'”, “3”: “=e”, “4”: “=d”, “5”: “=g”, “6”: “=f”, “7”: “=a”, “8”: “=c”, “9”: “=b”, “10”: “_G”, “11”: “_E”, “12”: “_D”, “13”: “_C”, “14”: “_B”, “15”: “_A”, “16”: “2<“, “17”: “2>”, “18”: “=E”, “19”: “=D”, “20”: “_B,”, “21”: “=F”, “22”: “=A”, “23”: “4”, “24”: “=C”, “25”: “=B”, “26”: “_g”, “27”: “8”, “28”: “_e”, “29”: “_d”, “30”: “_c”, “31”: “<“, “32”: “_a”, “33”: “(9”, “34”: “|2”, “35”: “D”, “36”: “|1”, “37”: “(2”, “38”: “(3”, “39”: “|:”, “40”: “(7”, “41”: “(4”, “42”: “(5”, “43”: “:|”, “44”: “9”, “45”: “3/2”, “46”: “3/4”, “47”: “=f'”, “48”: “2”, “49”: “d”, “50”: “_E,”, “51”: “B,”, “52”: “16”, “53”: “|”, “54”: “^A,”, “55”: “b'”, “56”: “_e'”, “57”: “M:9/8”, “58”: “E,”, “59”: “</s>”, “60”: “3”, “61”: “7”, “62”: “^F,”, “63”: “=G,”, “64”: “C”, “65”: “G”, “66”: “e'”, “67”: “_d'”, “68”: “^f'”, “69”: “[“, “70”: “b”, “71”: “c”, “72”: “z”, “73”: “g”, “74”: “^G,”, “75”: “=F,”, “76”: “K:Cmin”, “77”: “K:Cmix”, “78”: “=c'”, “79”: “C,”, “80”: “<s>”, “81”: “]”, “82”: “=G”, “83”: “M:12/8”, “84”: “6”, “85”: “=E,”, “86”: “K:Cmaj”, “87”: “>”, “88”: “B”, “89”: “F”, “90”: “c'”, “91”: “^C,”, “92”: “5/2”, “93”: “G,”, “94”: “f”, “95”: “=e'”, “96”: “_b”, “97”: “_A,”, “98”: “F,”, “99”: “/2>”, “100”: “/2<“, “101”: “f'”, “102”: “M:6/8”, “103”: “4>”, “104”: “M:4/4”, “105”: “A,”, “106”: “M:2/4”, “107”: “=C,”, “108”: “5”, “109”: “M:3/4”, “110”: “12”, “111”: “M:3/2”, “112”: “K:Cdor”, “113”: “A”, “114”: “E”, “115”: “a'”, “116”: “(6”, “117”: “^A”, “118”: “^C”, “119”: “^D”, “120”: “^F”, “121”: “^G”, “122”: “a”, “123”: “g'”, “124”: “D,”, “125”: “/4”, “126”: “e”, “127”: “/3”, “128”: “7/2”, “129”: “=B,”, “130”: “/8”, “131”: “^a”, “132”: “^c”, “133”: “^d”, “134”: “/2”, “135”: “^f”, “136”: “^g”}

So, the 105th dimension corresponds to the token “M:4/4”, which signifies a transcription meter of 4/4.

The first three layers, the LSTM layers, transform their inputs by a specific iterative process described below. The output layer is a softmax layer, which maps the output of the last LSTM layer to a point on the positive face of the \ell_1 unit ball. We interpret this output as a probability distribution over the vocabulary. We generate tokens by sampling from this distribution. Let’s get more formal.

Denote the sequence index by t. We input into the v2 model a standard basis vector of \mathbb{R}^{137}, i.e., a 137-dimensional vector that has a 1 in only one row. Next, {\bf x}_t is transformed by the first LSTM layer, according to the following algorithm:

{\bf i}_t^{(1)} \leftarrow \sigma({\bf W}_{xi}^{(1)}{\bf x}_t + {\bf W}_{hi}^{(1)}{\bf h}_{t-1}^{(1)} +{\bf b}_i^{(1)})
{\bf f}_t^{(1)} \leftarrow \sigma({\bf W}_{xf}^{(1)}{\bf x}_t + {\bf W}_{hf}^{(1)}{\bf h}_{t-1}^{(1)} +{\bf b}_f^{(1)})
{\bf c}_t^{(1)} \leftarrow {\bf f}_t^{(1)}\odot{\bf c}_{t-1}^{(1)} + {\bf i}_t^{(1)} \odot \tanh({\bf W}_{xc}^{(1)}{\bf x}_t + {\bf W}_{hc}^{(1)}{\bf h}_{t-1}^{(1)} +{\bf b}_c^{(1)})
{\bf o}_t^{(1)} \leftarrow \sigma({\bf W}_{xo}^{(1)}{\bf x}_t + {\bf W}_{ho}^{(1)}{\bf h}_{t-1}^{(1)} +{\bf b}_o^{(1)})
{\bf h}_t^{(1)} \leftarrow {\bf o}_t^{(1)}\odot \tanh({\bf c}_{t}^{(1)})

where the function \sigma() is the sigmoid applied elementwise to its argument, and \odot is an elementwise multiplication. The vectors {\bf h}_{t-1}^{(1)} and {\bf c}_{t-1}^{(1)} are initialised at the beginning of the process, and updated throughout. Each matrix {\bf W}_{x*}^{(1)} maps the input vector in \mathbb{R}^{137} to a vector in \mathbb{R}^{512}. Each matrix {\bf W}_{h*}^{(1)} rotates and scales the previous state vector {\bf h}_{t-1}^{(1)} in \mathbb{R}^{512}, and each vector {\bf b}_*^{(1)} pushes the linear combination {\bf W}_{x*}^{(1)}{\bf x}_t + {\bf W}_{h*}^{(1)}{\bf h}_{t-1}^{(1)} to a new origin in \mathbb{R}^{512} (also known as a bias).

Next, {\bf h}_t^{(1)} is transformed by the 2nd LSTM layer according to the same algorithm as above, but with different parameters. Then {\bf h}_t^{(2)} is transformed by the 3rd LSTM layer according to the same algorithm, but with different parameters again.

Finally, the last layer takes {\bf h}_t^{(3)} from the third LSTM layer, and computes

{\bf p}_t \leftarrow \textrm{softmax}({\bf V}{\bf h}_t^{(3)} + {\bf v})

This last layer is mapping every point in the \ell_\infty unit ball to a point on the positive face of the \ell_1 unit ball in \mathbb{R}^{137}. We can see this as “decoding” the third layer activation {\bf h}_t^{(3)}, where {\bf p}_t describes a multinomial probability distribiton over the sample space of the vocabulary above.

Training the folk-rnn v2 model means finding values for all of these parameters that minimise a cost in a training and validation dataset. All the parameters are: the four 137×512 matrices {\bf W}_{x*}^{(1)}, the eight 512×512 matrices {\bf W}_{x*}^{(2,3)}, the twelve 512×512 matrices {\bf W}_{h*}^{(1,2,3)}, the twelve 512-dimensional vectors {\bf b}_*^{(1,2,3)}, and finally the 512×137 matrix {\bf V} and 137-dimensional vector {\bf v}. That’s 5,599,881 numbers — exactly the kind of thing that enthrals computers around the world.

Using dropout, learning rate decay, minibatches, and whatnot (see the configuration file), we trained the folk-rnn v2 model on subsets of about 23,000 transcriptions over 35,000 training iterations, or 100 epochs. The parameters of the model are adjusted little by little to minimise the negative log loss. The log loss for the tth symbol of a transcription is the log of the corresponding element in the predicted posterior {\bf p}_t. Let’s denote the dimension as S(t), and its conditional probability [{\bf p}_t]_{S(t)}. The negative log loss for an entire transcription of length T is given by

N(S) := -\frac{1}{T} \sum_{t=0}^T \log ( [{\bf p}_t]_{S(t)} )

The mean negative log loss is the above averaged over a batch of transcriptions. Below we see the mean negative log loss as the training proceeded for the folk-rnn v2 model in a training dataset and validation dataset. Clearly, the model did not improve much after about 30 epochs since the validation loss stays nearly the same… which might have more to do with the learning rate decaying after 20 epochs.


Anyhow, it’s high time to pick this model apart and understand how it works. What do its parameters mean? How do they relate to the qualities of the transcriptions it generates? We are particularly interested in examining its characteristics and explaining some of its apparent abilities:

  1. How is the model generating bar symbols (“|”, “|:”, “:|”, “|1″,  and”|2”) at correct positions in sequences? (This “ability to count” disappears when we nudge the model in directions that should be completely reasonable for any system that has learned to count in terms of this ABC-like representation.)
  2. How is it repeating and varying material to create plausible long-term structres in the trascriptions (e.g., AABB with 8 bars for each part)?
  3. How is it generating cadences at the right places?
  4. Has it learned relationships between equivalent tokens, e.g., enharmonic pitches like “_C” and “B” or “^f” and “_g”?


Stay tuned!

9 thoughts on “Making sense of the folk-rnn v2 model, part 1

  1. Thanks for the description! Do you have plans on how you would go ahead and answer those questions? Maybe looking at the weights and activations? Don’t know if the LSTMVis project would be helpful in this case…


  2. Pingback: Making sense of the folk-rnn v2 model, part 2 | High Noon GMT

  3. Pingback: Making sense of the folk-rnn v2 model, part 3 | High Noon GMT

  4. Pingback: Making sense of the folk-rnn v2 model, part 4 | High Noon GMT

  5. Pingback: Making sense of the folk-rnn v2 model, part 5 | High Noon GMT

  6. Pingback: Making sense of the folk-rnn v2 model, part 6 | High Noon GMT

  7. Pingback: Making sense of the folk-rnn v2 model, part 7 | High Noon GMT

  8. Pingback: Making sense of the folk-rnn v2 model, part 8 | High Noon GMT

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s