It was just a matter of time before someone retrained the massive GPT-2 language model (up to 1.5 billion parameters) on folk music. That’s what Gwern Branwen has done, described in this comment thread. I’m not sure which size model he has used, but for training he seems to have used a concatenation of four folk-rnn datasets. In this post I want to analyze some of the samples generated from the resulting model to help me determine whether there’s much difference in quality compared with transcriptions generated by folk-rnn models (100,000 examples here).
The file Gwern links to contains 12929 lines. I will generate 5 random numbers in [1, 12929] and then analyze the transcriptions closest to those line numbers.
Here’s the transcription closest to line number 2881:
X: 47015 T: Money In Both Pockets M: 4/4 L: 1/8 K: Gmaj BA|GFGA BGGB|dBGB AGFA|GFGA BGBc|deed B2 BA| GFGA BGGB|dBGB AGFA|GFGA BGBc|defd B2 Bc|| dggf gdBd|g2 bg agfg|dggf g2 bg|afge d2 Bc| dggf gdBd|g2 bg agfg|dggf g2 bg|afge d2 dc||
Here is the tune notated:
The transcription appears to be a novel one. The training data has a tune and settings with the same name, but these are quite different from the above. This generated tune has a conventional AABB structure (assuming implicit repeat signs). Within each part there is some repetition and variation consistent with the style. The B part goes higher in register than the A part, which is also stylistically consistent. Both parts lack a cadence, however, sticking to the dominant. It is very easy to play on the accordion:
Overall, this is a boring tune since there is too much repetition and not enough variation. But I would say it is plausible since there is nothing clearly wrong with it.
The two transcriptions generated by the model that follow the above are given the same title by the model:
X: 47016 T: Money In Both Pockets M: 4/4 L: 1/8 K: Gmaj |: G3A BGGB | dBGB AGFA | G3A BGBc | defd BGBc | GFGA BGGB | dBGB AGFA | G3A BGBc | defd Bd d2 | |g2 gf gdBd | (3efg fg agfg | gagf gabg | afge d2 (3Bcd | g2 gf gdBd | (3efg fg agfg | gagf gabg | afge d3c || X: 47017 T: Money In Both Pockets M: 4/4 L: 1/8 K: Gmaj |: G3A BGGB | dBGB AGFA | G3A BGBc | defd BGBc | GFGA BGGB | dBGB AGFA | G3A BGBc | defd Bcdf | g2 gf gdBd | efge fgaf | gagf gabg | afge d2 Bc | g2 gf gdBd | efge fgaf | gagf gabg | afge d3c ||
This behavior likely comes from the fact that in the dataset we created from thesession.org many tunes have “settings”, which are variations of a tune. So the titles will repeat and the tune is varied. Let’s look at how these three transcriptions relate by notating them as three voices:
We can see the model has actually generated two reasonable variations of the tune. This is very impressive behavior! Neither setting, however, improves on the first.
Here’s the generated transcription closest to line number 8623:
X: 33166 T: Breakfast Epiphanies M: 4/4 L: 1/8 K: Dmaj (3def|:gecA G2FE|DFAF GBAG|FDAD BDAB|FAAG (3ABA (3def| gecA G2FE|DFAF GBAG|FDAD BDAB|1FGFD D2 (3def:|2FGFD D2EF|| Geed cAA2|dedA BAGB|Ad~d2 ed (3Bcd|(3efe (3dcB A2cd| ea~a2 bgaf|gedB cAGB|Ad~d2 ed (3Bcd|1(3efe d2 d2 (3ABc:|2(3efe d2 d2|
I love that title, but it actually appears in the training data. Here’s the dots for the generated transcription:
This tune again has an AABB structure with the register of the second going higher than the first. We see variation and repetition. The A part starts in a very interesting way, going V to IV to I, but it has too much repetition. The B part frankly doesn’t make much sense, and doesn’t relate to the A part. I find this part so strange that I’m not going to bother learning to play this tune. I would say this transcription is not plausible.
Here’s the generated transcription closest to line number 7798:
X: 47628 T: Shoe The Donkey M: 2/4 L: 1/8 K: Gmaj |: G/ |B>c BA | AG E/F/G/A/ | B>c BA | d/e/d/B/ G>G | B>c BA | AG E/F/G/A/ | Bd ef | gB G3/ :| |: d/ |Bd dB | ce ed/c/ | Bd c/B/A/G/ | AG E2 | Bd dB | ce e/f/e/d/ | Bd ef | gB G3/ :|
The model has again plagiarized this name from tunes in the training data, but it has created a good polka with coherence.
The B part resembles a part in The Banks of Inverness. The sixteenth note pickup is kind of strange. The A and B parts ending in the same way give a nice coherence. And it’s pretty easy to play on the accordion:
The only changes I would make is to the pickups, raising the G to an A before bar 1 and at the end of bar 4, and lowering the D to a C before bar 9.
Here’s the generated transcription closest to line number 1187:
X: 97541 T:The Victory M:4/4 K:Cmaj |: E G | c G E G F A d c | B d g f e c G E | F E D C B, D G, B, | C E G c B G F E | D E F G A c B c | d c B A G F E D | C E D F E G c e | (3 d e d c B c 2 :| |: B c | d G e G f G e G | d G e d c B A G | e c f d g e c A | B c d e d 2 B c | d G e G f G e G | d G e d c B A G | f e d c B d c A | (3 B A G (3 F E D C 2 :|
Whereas all the transcriptions we have seen so far resemble the training data we used to create the folkrnn v1 model, this one resembles the dataset we used to create the 2nd version — where we transposed and tokenized the transcriptions. Removing the spaces, rebeaming, and transposing to D produces this transcription:
This is different from The Victory in the training data. Again we see an AABB structure. There are cadences, but the one in the A part is unexpected because the part is by and large aimless. The B part is better, and is plausible. The two parts do not relate. I don’t want to spend time learning to play this.
Finally, here’s the transcription at line 7929:
This one is a total failure (and again the model has plagiarized the name). The counting is wrong in all bars except one. The melody doesn’t make a lick of sense.
So of the five transcriptions above, two are plausible. The polka is actually pretty good! All titles by GPT-2 are plagiarized, but I haven’t found much plagiarism in the tunes themselves.
In a future post I will select five transcriptions at random created by folk-rnn (v2) and perform the same kind of analysis. Will the quality of the transcriptions be as good as these ones created by GPT-2? What is gained by increasing the number of model parameters from millions to hundreds of millions, and using a model pretrained on written English text?