- folk-rnn v1: Basically Andrej Karpathy’s char-rnn applied to 23,000+ transcriptions from thesession.org. (It learned to generate titles too, which is absolutely charming.)
- folk-rnn v2: A theano- and lasagne-based implementation using a tokenised representation of the transcriptions, trained with all transcriptions transposed to have the root C.
- folk-rnn v3: v2 with a few changes to the tokenisation, and trained with the 23,000+ transcriptions with the root C, and all transcriptions transposed up a half step to have the root C# (a dataset size of over 46,000 transcriptions).
In addition to that, and for comparison purposes, we have created the Endless magenta Traditional Music Session. All 6000+ tunes here are generated by a magenta system (basic_rnn) that has the same architecture as the folk-rnn versions (three hidden layers of 512 LTM units), and trained using the same 46,000+ transcriptions as folk-rnn v3 (but in MIDI representation), trained with the same settings as folk-rnn v3 (dropout, batch size, comparible learning and decay rates and decay steps), and trained for 80,000 “Global Steps” (roughly comparable to the 100 epochs of folk-rnn v3). We then use the trained magenta model to generate 500 new tunes of 300 MIDI events each starting from each initial MIDI pitch from 60-72. We synthesize the resulting 6400+ output MIDI files using the same scripts as the other folk-rnn tunes.
We don’t pick our favourites in any of these sessions. We just upload it all as it comes.
It is interesting to compare the results of the two systems by listening, knowing that by and large the only difference between them is the data representation. The benefits of the textual abstraction of these transcriptions — essentially recording events rather than slices of time — really comes through here. If the magenta model quantises MIDI events every sixteenth note, this means the number of MIDI events the model must learn to reproduce is around 500 for a typical Irish reel (32 bars of 4 quarter notes each). For the same reel, the number tokens a folk-rnn model must learn to reproduce is around 100.
It seems that both the magenta and folk-rnn models capture local aspects of this style, but the magenta model doesn’t produce longer scale structures. Its melodies just wander without much direction, without meaningful pauses, and rarely end with cadences. The magenta tunes rarely feature repetition and variation. The AABB structure common to a lot of this music is entirely missing from magenta tunes. On the other hand, folk-rnn seems to be producing longer scale structures, and generating many tunes that have a clear AABB structure. The melodies sound more musical, with shapes and figures that are common in this kind of music.
Working with the two models is also different. The folk-rnn model explicitly specifies meter and mode, and generates bar lines, first and second ending symbols, repeats, etc. It ends each tune when it generates a specific transcription ending symbol. The magenta model, however, does not specify meter or mode. It must be told how many time steps to generate events. If we are to work with material generated by our magenta model, we have to identify its key (relatively easy) and meter (relatively hard).