Hello, and welcome to Paper of the Day (Po’D): Recurrent Neural Networks for Music Computation Edition. Today’s paper is a historical one! “J. A. Franklin, “Recurrent Neural Networks for Music Computation”, INFORMS Journal on Computing, vol 18, no. 3, Summer 2006, pp. 321–338.
It has been almost ten years ago now that Franklin explored the application of recurrent neural networks (RNN) with long short-term memory (LSTM) units for musical applications. Four years before that, Eck and Schmidhuber explored the same idea (D. Eck and J. Schmidhuber, “Learning the long-term structure of the blues,” in Proc. Int. Conf. on Artificial Neural Networks, 2002; D. Eck and J. Schmidhuber, “A first look at music composition using lstm recurrent neural networks,” tech. rep., Instituto Dalle Molle di studi sull’ intelligenza artificiale, 2002), which Eck and Lapamle return to in 2008 (D. Eck and J. Lapamle, “Learning musical structure directly from sequences of music,” tech. rep., University of Montreal, 2008). More than a decade before those, Todd explored recurrent neural networks for composition (P. M. Todd, “A connectionist approach to algorithmic composition,” Computer Music Journal, vol. 13, no. 4, pp. 27–43, 1989). Certainly then, by 2006 the application of neural networks to music composition was already old hat, e.g.,
- N. Griffith and P. M. Todd, “Musical networks: Parallel distributed perception and performance,” MIT Press, 1999.
- M. C. Mozer, “Neural network music composition by prediction: Exploring the benefits of psychophysical constraints and multiscale processing”, Connection Science, Vol. 6, pp. 247–280, 1994.
- P. M. Todd, “A sequential network design for musical applications,” In D. Touretzky, G. Hinton, and T. Sejnowski (Eds.), Proceedings of the Connectionist Models Summer School, pp. 76-84, 1988.
Franklin focuses on using RNN with LSTM for modeling pitch and duration for “jazz-related tasks”, e.g., “whether recurrent networks can learn a long and cohesive composition and remember earlier motifs and structured song forms, as well as whether it can generate a new one.” Franklin devises new representations for pitch (nine bits), chord (seven bits) and duration (16 bits). Her representation also separates pitch from octave.
Franklin tests RNN with LSTM networks in four tasks:
- “given a dominant 7th chord as input, output in sequence the four chord tones”
- “given each of 14 pairs of five-pitch sequences, output 1 at the end if the second, third, and fourth notes in the sequence are ordered chromatically and otherwise output 0”
- “learn to reproduce one specific 32 note melody of the form AABA, given only the first note as input.”
- having a pitch-focused network and duration-focused network learn a jazz melody.
In the first task, the network (“one cell in each of ten memory blocks”) is fed the same 7-dim vector at each time step, and the network outputs a 7-dim vector at each time step. Franklin codes the via a “circle of thirds” representation (Fig. 7 is the decoder). Chords are made by summing the vectors of each pitch (e.g., [1, 0, 0, 0, 1, 0, 0] for C), and dividing by the largest value. So, for a C7 chord, each input is [0.6, 0, 0.3, 0.3, 0.3, 0.9, 0]. The system must then output [1, 0, 0, 0, 1, 0, 0] first, then [1, 0, 0, 0, 0, 1, 0], then [0, 0, 0, 1, 0, 1, 0], and finally, [0, 0, 1, 0, 0, 1, 0].
In the second task, the network (“one cell in each of ten memory blocks”) always outputs a 0 until it has been given a pitch sequence of length five that contains a sequence of a semitone.
In the third task, the network (“2 cells per block of the 15 blocks”) is to learn a pitch sequence of length 32. Then, the system is to output the exact same pitch sequence given the same input vector.
Finally, in the fourth task, a jazz song is broken into two sequences of vectors, one for pitch and one for duration. Two networks are trained (“The pitch network contains 15 blocks, with four cells each. The duration network contains 17 blocks, four cells each.”) to reproduce the correct sequence.
I like that Franklin takes pains to test the networks in a variety of tasks of growing complexity, but I can’t see their relationship to music. Furthermore, most of the tests involve only a few observations for training. In the last two tasks, there is only one sequence that each network is taught to reproduce. I also can’t shake the feeling of the arbitrary nature of separating pitch and duration. Nonetheless, it is a historic paper!