I gave the full album a listening-to and find it remarkably palatable. It’s weird, but in the right ways, like PC music. And it’s not too sugar-coated. It’s really exciting work!
Here’s a great talk from Barbara Engelhardt about “open box” machine learning applied to genomics and other medical research:
She convincingly shows the need for domain expertise in building models that help explain the world around us. I love her description of this kind of work as “detective work”. That is also reflected in the Science article from last year, “The AI detectives“. She also has a deep personal connection to questions she is trying to answer with her research, having lost her mother to Lewy Body Dementia. (My father passed away last year from this poorly understood disease.)
Along with these, here’s some excellent work on adversarial attacks on an end-to-end speech transcription system (Deep Speech): http://nicholas.carlini.com/code/audio_adversarial_examples/
Listen to the sound examples and their hilarious transcriptions! I don’t believe speech transcription systems that use “laboriously engineered processing pipelines” would be so susceptible. It can pay to understand a domain and build invariances into a pipeline rather than try to learn everything from data.
Over the holiday break I started to dig into the music transcription models that we trained and used since 2016, and figure out what the ~6,000,000 parameters actually mean. In particular, we have been looking at the “version 2” model (see parts 1, 2, 3, 4, 5 and 6 of this series), which generated all of the 30,000 transcriptions compiled in ten volumes, as well as helped compose a large amount of new music, including this new little work which I title, “Experimental lobotomy of a deep network with subsequent stimulation (2)”:
This little piece was the outcome of stimulating the v2 model with a two-hot input (or forcing), created from the generated token and a random pitch class token. I gave a weight of 0.7 to the token generated by the model, and 0.3 to the other. The resulting piece is weird, but a good kind of weird.
Anyhow, last time we performed some Fourier analysis on the sequences generated by the last LSTM layer. We then identified units producing sequences having a significant amount of power at a period related to the measure tokens (9 steps), and then excised these from the network to see how the “counting” ability of the system is affected. This showed just how distributed the model’s “counting” ability is.
Now, what is going on in the first and second layer LSTM hidden states during the steps? Here’s the first layer hidden states:
Here’s the second layer hidden states:
Time runs from bottom to top. We see a more active scene in the second layer than the first. Here are the distributions of activation values in the hidden states of all three LSTM layers:
This shows that the activations of the hidden states in all three layers are sparse, but that the deeper we travel in the network the more likely we will see large magnitudes. This is also shown in the following picture of the sorted maximum magnitude values of the sequences of each unit.
Let’s look now the magnitude spectra of the hidden state sequences produced by each unit in the first LSTM layer:
This is quite a different picture from what we saw in the third layer, reproduced here:
The sequences of the first layer have less diversity of periodicities compared with those of the third layer. For instance, we can see some sequences of period 9 steps and 2.5 steps, but there aren’t many of them. We also see that most of the power in the first layer sequences seems to occur in periods longer than 8 steps.
When we snip 33% of the first layer units exhibiting a sequence with more than 0.025% of its power in periods between [8.7, 9.3], and have the model generate from the same random seed as before, it produces the following transcription:
M:4/4 K:Cmaj A D | E 3 C (3 D E D C D | G E C E (3 F G F E C | G F D C E D F D | G, 3 E C 2 C A, | (3 G ^F G F E E D C D | E 2 D G F A c A | G 2 C G c B d c | B c B G F G D F | D E D F B D c B | A /2 B /2 c B c d c A G | (3 E F G C B D E G E | F F F C F E F D | E C 2 C B, G, B, A, B, | B, 2 E C (3 F F F F E | z C B, C D C C C | B, 2 E C E F z c | E F B c G F D c | B c B c g c c d | z g B a B c e G | B F F F c G G G | (3 A G F G c c D z B | B 2 E c B G E C | (3 A G F A F (3 c G F E F | A G F c G c e d | d c G d c z B G | B c B A c G F E | B F E F B E E E |
Here’s the dots:
Not even a good tune, but every measure except one is correctly counted. This is quite different from what we observed when we did the same in the 3rd LSTM layer. These observations suggest that the first LSTM layer is not “counting” measures, but just reacting to the periodic input of the measure token.
Let’s now look at the magnitude spectra of the hidden state of the second LSTM layer:
This looks more similar to what we see in the last LSTM layer.
Now, thinking about the sequences produced by all layers, how is power concentrated over the periods? Let’s look at the mean of the normalised spectral magnitudes (computed using a Hann window) for each layer (with one standard error of the mean above and below):
The x-axis is period in steps. For layers 1-3, I normalise by the maximum power in layer 1. The magnitude spectrum of the softmax output is normalised with respect to its maximum power. For layers 1-3, we see a lot of power at DC (coming from the polarity of the sequences). We see large peaks around periods of 36, 18, 9, 4.5, 3 and 2.25 steps. We also see the powers of the sequences of the LSTM layer 1 are by and large less than those in the other two LSTM layers. Layer 2 seems to be producing sequences with the most power, except those with periods of about 2.25 and 9 steps. The spectrum of the softmax layer sequences is more compressed than the others, which is due to softmax operation. We see a high degree of similarity between all of the magnitude spectra. Taking into consideration the standard error of the mean of the sequences in each layer, there seems to be less variation in the DC component of all sequences than at any other period. For most of the peaks, we see power variations that are about 1 dB centred on the mean. Now, what do these peaks mean? Why is there so much power at specific periods, e.g., 2.25 and 8.8 steps?
Before we go any deeper with this analysis, we have to make sure that what we are observing is actually in the sequences and not a product of our analysis decisions. Let’s compare the magnitude spectra of Hann and rectangular windows of length 147 (the length of our sequences):
The width of the main lobe of the Hann window (about 0.014 Hz) is a little less than twice that of the rectangular window (0.008 Hz) (I’m defining width as twice the frequency of the -6 dB point); however, the power of the first sidelobe of the Hann window is more than 30 dB lower that of the main lobe, whereas that of the rectangular window is just 12 dB lower. So, if we are to use a rectangular window for analysis, we will have narrower peaks in the magnitude spectrum, but more peaks coming from the side lobes instead of any frequencies that actually exist in a sequence. Since we are using a Hann window for analysis, we will have wider peaks, and so more uncertainty of the frequencies present in the sequence, but we can be relatively sure that none of the peaks we see at sufficiently high powers (relative to nearby larger peaks) are contributed by sidelobes.
The range of periodicities covered by the main lobes of these analysis windows (their “support”) depends on the frequency to which the window is modulated. This relationship is given in the following table for some key periodicities we are observing in the sequences.
|Main lobe Support (steps)|
|Sequence period (steps)||Rectangular||Hann|
|2.25||[2.23, 2.27]||[2.21, 2.29]|
|4.5||[4.4, 4.6]||[4.3, 4.7]|
|8.8||[8.5, 9.1]||[8.3, 9.4]|
|18||[17, 19]||[16, 21]|
|36||[31, 42]||[28, 48]|
|Inf||[Inf, 250]||[Inf, 143]|
So, the Hann window modulated to a period of 8.8 steps will have a main lobe between 8.2 and 9.4 steps. This means that for a peak around 8.8 steps, the “true” period (or periods) lies somewhere between 8.3 and 9.4 steps. (There may be a few frequencies present contributing to the single peak.)
The locations of the first sidelobes are important to distinguish real spectral peaks from fake ones. These are given in the following table for some of the key periods we are seeing:
|First sidelobe locations (steps)|
|Sequence Period (steps)||Rectangular||Hann|
|2.25||2.20, 2.30||2.17, 2.33|
|4.5||4.3, 4.7||4.2, 4.8|
|8.8||8.0, 9.6||7.7, 10.2|
|18||15, 22||14, 25|
|36||26, 56||23, 85|
The first sidelobes of the Hann window are located at 7.7 and 10.2 steps. But, the power of the sidelobes of the Hann window are at least 30 dB below the peak. That means for the peak we see at 8.8 steps, the first sidelobes of the Hann window are at 7.7 and 10.2 steps at a power less than -38 dB. This is to say that the peaks we see around 7 and 11.5 steps are real, and are not coming from the peak at 8.8 steps.
Since the mean value of the sequences (the DC component) contributes the most power of the sequences from all layers, we need to consider its sidelobes when interpreting long periods. The first sidelobe of the Hann window centred on DC occurs at a period of 63 steps. This means that the peak we see at a periodicity of 36 steps about 8 dB lower than DC is not coming from a sidelobe of the window. It is real.
So, now that we are sure which peaks in the mean magnitude are real, how can we interpret them? What is contributing to the peaks at 2.25, 3, 4.5, 6, 8.8, 11.5, 18, 36 steps and to DC? Knowing how the system operates, the magnitude spectra of any sequence produced by our LSTM layers must arise from the sequence input to the model, which itself arises from the model’s output sequence (recurrence!). First, the peak we see at DC comes from the input of tokens that appear every step or only once, which includes any token, and in particular “M:4/4”, “K:Cmaj”, “(3” and “4”. Second, the peaks we see at around 8.8 steps in the magnitude spectra of the three layers are coming from, at least, the input of tokens that appear every 8-9 steps, e.g., the measure token “|”. (Review the transcription generated by the model from these sequences.) The input of “|” occuring every 8-9 steps might also contributing to the peaks we see at multiples of that period, e.g., 18 and 36 steps, if the sequence shows modulation (which we do see on some of the sequences). We can’t say that all of the peak at 8.8 steps is caused by “|” because there may be other things with such a period.
Third, looking back at the generated transcription, we see the repeat measure token “:|” occurring twice: at step 75 and then 69 steps later. So, we should see a peak around that period. However, we won’t see any peak there in our Hann-windowed sequences because its main lobe is too wide. Using a rectangular window might resolve this:
We see a peak around a period of 100 steps in LSTM layers 2 and 3, which is where the first sidelobe of the rectangular window occurs — so caution must be taken with that peak. But we do see peaks between 60 and 67 steps. The latter is probably related to the “:|” token ouput, but 60 steps seems too small of a period. (Performing an analysis on the rectified sequences does not show a peak in the expected place either.)
What is causing the peak around 2.25 steps in every layer? One cause could be that every other token is a token, but that would make a peak at 2 steps. Instead, it seems like every other token most of the time is a pitch token, and sometimes, like when a duration is specified or a bar, a pitch token occurs 3 steps later. Since the pitch tokens are quite frequent in this transcription, it makes sense that sequences would exhibit power at a period related to the pitch tokens occuring every other step most of the time.
What is causing the peak at 36 steps? One cause could be the repetition of some of the transcription, e.g., “d c d e c 2 A G | A d d” in the second half of the transcription occurs 35 steps later, and “A D D” occurs 20 steps later and 38 steps later. So, the power we see in longer periodicities is relatable to larger structures
Other clues for interpreting the magnitude spectrum of the posterior sequences comes from inspecting its output during the generation of the transcription:
We see some specific tokens that turn on and off with a period of about 2: “(3”, “2”, “/2”. So, at least the peak around 2 steps that we see in the magnitude spectrum of the mean softmax sequence comes from these. These particular tokens do not appear much in the generated transcription: “/2” never appears, “(3” appears once, and “2” appears 9 times.
Now, are the periodicities we see in the sequences of the layers caused by the output of the model, or do they cause the output? Surely the answer depends on which layer we are talking about. We sample from the posterior sequence, and that means the posterior sequence causes the output. We input a token to the first layer, and that causes its sequence. But what about layers 2 and 3? Is the periodicity of 8.8 samples in LSTM layer 3 caused by the input of a measure token every 8.8 steps, or is it causing the output of a measure token every 8.8 steps?
Let’s see how two of the 9-step periodic outputs in the first and third LSTM layers coincide with the inputting and outputting of measure tokens? Let’s take the sequence produced by a unit in each of the LSTM layers exhibiting a periodicity of 9 steps and plot them against the generated transcription.
The vertical dashed lines show those steps in which the measure token shown at bottom is generated. The next step is when it is input to the model. We can clearly see the sequence produced by LSTM layer 1 unit 247 spikes in every step but three when a measure token is the input. It doesn’t spike the two times when the input is “:|”, or the first time a “|” is input. LSTM layer 3 unit 156 spikes in every step when a measure token is output. And LSTM layer 2 unit 97 is almost zero for two steps when the output is a measure token. We also see the sequence produced by LSTM layer 2 unit has a period of about 2 plus some change steps.
We need to disambiguate temporal characteristics of the sequences of each layer that share similar periods. This will hopefully help us see relationships between the model input and output.
I love the author photos at the end!
There’s certainly a lot more basic things going on in that GAIdar machine than a sensitivity to the sexual preference of the living breathing social animal reduced to a self-selected composition of pixel intensities in RBG color space.
This holiday break has given me time to dig into the music transcription models that we trained and used in 2016-2017, and figure out what the ~6,000,000 parameters actually mean. In particular, we have been looking at the “version 2” model (see parts 1, 2, 3, 4 and 5 of this series), which generated all of the 30,000 transcriptions compiled in ten volumes, as well as helped compose a large amount of new music, including the lovely fourth movement of Oded Ben-Tal‘s “Bastard Tunes”:
The last part of our exploration looked at another specific transcription. We looked at how much the generated transcription copied from the training transcriptions, and found some but not excessive overlap. We looked at the posterior distributions over the generation steps and saw some interesting things, e.g., mirrored probability mass in the pitch tokens. We looked at the hidden state of the third LSTM layer over the generation steps and saw some other interesting properties, e.g., very sparse activity overall, some units activating only once, some units nearly always saturated, and some units emitting a periodic sequence. Our spectral analysis of the hidden state sequences of the third LSTM layer shows that sequences emitted by many unts are similar, including harmonics with a fundamental period of 9 steps. This fundamental period seems to correspond with the output of the bar tokens. At the end, we asked how things will change if we have the v2 model generate another C-major 4/4 transcription. Will the behaviours stay the same? What will happen when we intervene on the dimensions of the hidden state that exhibit a 9-period sequence? Will the system not be able to correctly output measure tokens? Also, do the sequences produced by the units on the other two LSTM layers exhibit such periodicity?
After I initialised the v2 model with a new random seed, it generated the following transcription:
M:4/4 K:Cmaj A D D E F 2 A c | G A (3 B A G c B A G | A D D 2 F E D C | D E G A B G E G | A D D 2 F E F A | G B c d c e d c | A G A B c B A G | A c G E D 2 C 2 :| d c d e c 2 A G | A d d c d e e d | c c d e c A G A | A G F G D 4 | d c d e c 2 A G | A d d e c d e g | f e d f e d c A | G E C E D 2 C 2 :|
In common practice notation, this appears like so:
All measures are correctly counted. This transcription features two repeated 8-bar parts (“tune” and “turn”). The transcription is missing a repeat sign at the beginning of the first and second parts — but human transcribers would probably make such an omission. There are a few peculiar things here though. The tune is not in C-major, but D-dorian. An inexperienced human ABC transcriber might make the same confusion because the key signatures are the same. Each part ends on C, however, which makes for a surprise having heard D-dorian throughout. There is not much resemblance between the tune and turn. Also, it seems the tune is not composed of two equal length phrases, but one 10-beat phrase answered by a 9-beat phrase, followed by a 13-beat phrase ending with an “question mark” (C). The turn is made of two four-bar phrases, ending in the same way as the tune — perhaps the only strong bond between the two parts.
Again, I would make several changes to this transcription to improve it, but here’s my rendition of the tune on the Mean Green Machine Folk Machine:
It kind of sounds medieval to me … trancriptions of which do exist in the model’s training data. Let’s see what the network is “thinking” as it generates this output.
First, I find little similarity between this generated output and any of the 23,646 training transcriptions.
Next, here’s a picture of the softmax output:
As before, we see a mirroring of probability mass in the pitch tokens with accidentals. We can see a lot of probability mass on the measure tokens every 7-11 steps. We can see probability mass is distributed across several pitch tokens at the beginning of the two phrases — but also around the middle and ending of the first phrase. The final two pitch tokens of the second phrase seem quite certain. We also see probability mass given to the triplet grouping token “(3” throughout the generation.
Visualising the hidden state of the third LSTM layer as the sequence steps shows a picture similar to before:
Here is the plot from before:
We see similar behaviours in several units, e.g., 60, 81, 300, 325, 326, 445, 464, and 509. Overall, the activations remain quite sparse:
A Fourier analysis of the output sequences produces the following spectra (the magnitude spectrum of each sequence is normalised by its maximum value, and not the maximum value of all spectra combined):
Here is the one from before:
This shows some clear differences. We still have some periodicity at 9 steps, but it is not as strong as before. This difference is also evinced by the odd-lengthed phrases in this transcription.
Let’s look at the outputs of the same units we inspected last time. Units 208 and 327 output single spikes last time. Now their outputs are quite different:
Units 34 and 148 output rather periodic sequences last time, which they continue here:
And units 60, 156, and 203 produce output that looks quite similar to what they produced last time:
So, it seems some things have changed, but others have not.
Let’s go a bit further now and find the units producing a sequence exhibiting a significant amount of power at a period of 9 steps. Here are the magnitude spectra of the sequences from units that have 5% of their power in the 5 bins corresponding to periods between [8.7, 9.3]. Here’s the spectra of the sequences of those 96 units:
Here are the sequences produced by these 96 units:
And here’s the spectra of the sequences of the remaining 416 units:
In the latter we can still see some power around a period of 9 steps, but there is more power in frequencies having a period of about 2.25 steps.
Now let’s do something dramatic. For the 96 third layer units that have produced a sequence having a period of 9 steps, we will snip their connections to the softmax layer and run the generation procedure again with the same random seed. In essence, this entails setting to zero 49,152 values in in the softmax layer. Will the lobotomised system not be able to count correctly now? Here’s the output transcription generated with the same random seed:
M:4/4 K:Cmaj A _A | E C 3 c d e c | =c A G E D E F A | G C G C E G c d | e c d e c A A G | G C 3 D C e c | d c A G E D /2 E /2 F A | G c d A G d c A | G E D E C 2 :| |: c d | e 2 e d c 3 d | e G A /2 A /2 c e c d e c d c d e d c d | (3 e f e d c A c G F | E D G E G 2 c d | e A A G c d e g | a g e c A d c A | G E D E C 2 :|
Here’s the dots:
This is a terrible tune from the pickup (which I am not going to learn)! But by golly our lobotomised model can still “count” to a reasonable extent that it has put the measure bars where they belong, save for the 2nd bar in the turn. The model even accounted for the pick ups to both 8-bar sections.
What if we snip all units producing sequences that have 4% of their power in the 5 bins corresponding to periods between [8.7, 9.3]? That adds another 42 units to the blade (cutting a total of 138 units). Here’s the output transcription generated with the same random seed:
M:4/4 K:Cmaj A _B | A G F A G E B, C D E | E 2 A E A C 3 | A 2 A F G F E D | (3 E F G ^F G A 3 B | (3 c /2 d /2 c /2 G B d c /2 z /2 A G E | F 2 A G F G A F (3 G A B c c d c B d | c (3 E G E G D [ C 3 E c ] :| | G, C D E E 2 C D E | 3/2 D | _E G, E ^D E G A B | c e c d c A G c | G E D ^C D G (3 G ^F G | A B A G ^F G A B | c 2 c B c d c c A G ^F E D G A E | | G 3/2 A G F D G c B c :|
Here’s the dots:
This shows we have now cut a bit too much. The model is losing it’s ability to count. It’s even commiting some errors like putting a duration token before a pitch token, e.g., “| 3/2 D |”.
Snipping all units producing sequences that have 3.5% of their power in the 5 bins corresponding to periods between [8.7, 9.3] adds another 31 units to the blade (cutting a total of 169 units). Here’s the output transcription generated with the same random seed:
M:4/4 K:Cmaj A _B | A G F A G E | G _E =G D E C F A G 2 | G A _B A G C |: E D F 4 F F | ^F G =F D =G 2 _D 2 =B, 2 | c A _B 2 (3 A A G c 3 | B A G C | =F 4 C 2 z 4 :| |: G c c 2 G c (3 c c =B A _B | G D 2 B, C E ^G A | G c c _B c A _B G |1 F 4 D _B, :| _E F F E |:
Here’s the dots:
It’s seems to have no problem reproducing the meter and mode tokens, but after that it’s just nonsense. Now only 2 measures are correctly counted. It decides to end the transcription with an open repeat sign.
So, cutting 96 of the 512 connections from the last LSTM layer to the softmax layer does not reduce the model’s ability to produce measure line tokens at the right places. That’s nearly 19% of the connections. However, after snipping 27% of these connections, we see a marked effect on this ability. And increasing this to 33%, the ability nearly disappears.
What if we snip 33% of these connections at random? Here’s the output transcription generated with the same random seed:
M:4/4 K:Cmaj A 4 G E G c | c F B c _E G C ^C | G, F, D, 2 G, C, F, ^G, | E, C C D, E, F, 3 | F, E, D, F, E, C, F, G, | E, /2 F, /2 F, D, C, E, C, E, /2 G, /2 G, | F, G, F, E, C, E, C, C, | A, B, C D C G F D | G 2 F D _B, E D G, | C G G F E /2 F /2 G F G | f /2 f /2 d B d c 2 G _E | A, =B, C B, G, F, E, G, | F, D D D, C, 2 C B, | C E G F =E C E G | c c E G E C G G | F =E F G A G G 2 | G /2 A /2 G F E F G F D | _E G _d c =G f c B | G 3 B G c G D, | G /2 C /2 C G, G, F, C E, C, | G, _B, A, D _B, 2 B, d :|
Here’s the dots:
It’s a real dog’s breakfast — there’s very little resemblance to any conventions of folk music, other than some stepwise motion and a few arpeggiations — but at least every measure is correctly counted! Repeating the above shows the same behaviour. So, it appears that our model has a significant amount of redundancy built into it when it comes to outputting the measure tokens at the correct steps. This is not really surprising because we trained the model with 50% dropout, which means for every training batch, only 50% of the connections were active. So, this has nicely distributed some of the “concepts” through the model.
The “concepts” of pitch and melody, however, seem to occupy a much larger portion of the model’s parameters. This makes sense because it is a far more complex thing than counting. Also, the vocabulary has 85 pitch tokens, versus 5 measure tokens and and 25 duration tokens.
Time to go deeper into the network. Stay tuned!
One thing I can say about folk-rnn as opposed to other AI composition systems is that it can bring families together, ’round the table after dinner to peck out some glorious tunes of a totally synthetic tradition. Here’s my wife and I playing a folk-rnn (v1) original, which it has titled, “The Cunning Storm”:
We aren’t sure about the title, because it doesn’t fit the tune… at least with the common definition of “storm” — but maybe that’s just what makes this storm “cunning”. It’s a quaint little storm! Or maybe the system translated our last name from German, and so it should be “The Cunning Sturm”?
Either way, here’s the verbatim output of the system (which can be found on page 3,090 of the folk-rnn (v1) Session book, Volume 4 of 20):
You can see I wasn’t quite satisfied with bars 5, 7 and 8. The turn is totally unchanged, save the second ending. Anyhow, with these few changes, “The Cunning Storm” becomes one of my favourite machine folk ditties!
Finally, the 60,000+ transcriptions generated by the first version of folk-rnn (the one that also generates titles), are now available for an unlimited time only in this 20-volume set:
Last year we released 10 volumes of the 30,000 transcriptions generated by the second version of our system:
And here’s the 4 volumes of 10,000 transcriptions generated by the third version of our system:
Off the cuff, one of our collaborating musicians said the transcriptions of the first version are their favourite. I wonder whether a large part of that has to do with the charm of the titles generated by the system. I know from my own experience that the generated titles often draw me in for a closer look.