Normalised Compression Distances for Comparing Melodies

A fundamental problem we face with a system that can generate an endless number of folk tunes is how to find the interesting ones. In this direction, we have started to look at simple measures of similarity. Since we are dealing with sequences of symbols, a natural first step is to think about comparing those sequences according to their compressibility: how much does knowing one sequence help us describe another sequence with parsimony? In this post we briefly look at the normalised compression distance (NCD) as some measure of similarity (which is at the core of the evaluation approach proposed in J. Ens and P. Pasquier, “A Cross-Domain Analytic Evaluation Methodology for Style Imitation“, in Proc. Int. Conf. Computational Creativity, 2018).

Given two strings x and y, the NCD between them is defined by

D(x,y) = \frac{K(xy)-\min[K(x),K(y)]}{\max(K(x),K(y))}

where K(x) is the Kolmogorov complexity of the string x, and xy is the concatenation of x and y. The Kolmogorov complexity of any string is not computable, but an upper bound can be estimated via using a compression algorithm, e.g., using the Lempel-Ziv-Welch algorithm followed by Huffman coding (as done in gzip). K(x) expresses the degree to which x can be compressed. So K(xy)-\min[K(x),K(y)] expresses the degree to which either of the two strings helps in compressing the other. In the limit that the two strings are the same, this value will be zero because K(xx) = K(x). In the limit that the two strings are completely different, K(xy) will grow to be no longer than twice \max(K(x),K(y)). Normalising the numerator with \max(K(x),K(y)) then will give a value close to 1 (but could actually be greater than 1).

In our work, we are looking at symbolic transcriptions of folk tunes. Here’s an example transcription of length 109 tokens from our training data:

M:4/4 K:Cmaj G 3 F E F G A | _B G c B A F F 2 | G A G F E G c f | 
e c d B G c c 2 :| e 2 c e d e c d | e 2 c e d B G 2 | e 2 c e d e c d | 
e c d B G c c 2 | e 2 c e d e c d | e 2 c e d B G ^F | G A B c d e f d | 
g e f d e c c 2 |

Let’s see how this is compressed by a text compression algorithm. LZW (we are using python-lzw) iteratively builds a codebook from an input sequence. It starts with codewords consisting of single items, and then adds longer sequences that it finds are unique. For the example above, this codebook of symbols looks like (each pair of square braces designates a codeword):

['M:4/4'] ['K:Cmaj'] ['G'] ['3'] ['F'] ['E'] ['A'] ['|'] ['_B'] ...
['M:4/4', 'K:Cmaj'] ['K:Cmaj', 'G'] ['G', '3'] ['3', 'F'] ['F', 'E'] 
['E', 'F'] ['F', 'G'] ['G', 'A'] ['A', '|'] ['|', '_B'] ['_B', 'G'] 
['G', 'c'] ['c', 'B'] ['B', 'A'] ['A', 'F'] ['F', 'F'] ['F', '2'] 
['2', '|'] ['|', 'G'] ['G', 'A', 'G'] ['G', 'F'] ['F', 'E', 'G'] 
['G', 'c', 'f'] ... ['g', 'e'] ['e', 'f', 'd'] ['d', 'e', 'c', 'c']

This is ordered in the sequence that the codebook is created. Here we see unique individual elements are added first, followed by pairings “M:4/4 K:Cmaj” and then “K:Cmaj G”, … and then “G A G” since the pair “G A” had already been seen. Towards the end of building the codebook we see longer sequences, e.g., “d e c c”.

The LZW encoding of the length-109 transcription above results in a sequence of length 83 , which is decoded according to:

['M:4/4'] ['K:Cmaj'] ['G'] ['3'] ['F'] ['E'] ['F'] ['G'] ['A'] ['|'] 
['_B'] ['G'] ['c'] ['B'] ['A'] ['F'] ['F'] ['2'] ['|'] ['G', 'A'] 
['G'] ['F', 'E'] ['G', 'c'] ['f'] ['|'] ['e'] ['c'] ['d'] ['B'] 
['G', 'c'] ['c'] ['2'] [':|'] ... ['e'] ['f'] ['d', '|'] ['g'] 
['e', 'f'] ['d', 'e', 'c'] ['c', '2', '|']

This stream of bytes can be further compressed by allocating bytes to codewords depending on their frequency, i.e., via Huffman coding. For the transcription above, the symbols “G”, “F”, and “E” will be given the fewest bits since they are the most frequent, while “M:4/4”,  “K:Cmaj” and “d e c” will be given the most bits since they are the least frequent. For now we will just stick to computing the NCD using LZW compression.

Now that we have an idea of how to estimate the Kolmogorov complexity of strings, let’s look at the NCD within and across two corpora: the 23,636 tunes in the training data of the folk-rnn v2 model; the 60,000 tunes generated by v2 (available in the folk-rnn (v2) Session Books Vols 1-10). We draw a tune from the training corpus, a tune from the generated corpus, and sample a random sequence having a length that is the maximum of the two tunes. We then compute the NCD of each tune with itself (setting x = y), and paired with the others (where x and y come from the two corpora, or is a random sequence). We repeat this 10,000 times and then plot the histogram of the resulting NCD.

By way of examples, here’s a generated tune:

M:3/4 K:Cmaj G, 2 | C 2 E /2 G /2 B /2 d /2 c 2 | B > A G E /2 C /2 D 2 | E 2 
(3 F D B, C > C | C 4 | G A /2 B /2 | c > B c d e < c | d > B G 3 B /2 d /2 | 
c > A B G (3 A G E | G 4 G A /2 B /2 | c > B c d e < c | d > B G 3 B /2 d /2 | 
c > A B G (3 A G E | G 4 E > D | C 2 E /2 G /2 B /2 d /2 c 2 | 
B > A G E /2 C /2 D 2 | E 2 (3 F D B, C > C | C 4 |

And here’s a random tune:

B, ^f' ^a M:2/4 | E F a' 12 M:3/2 _C ^c E, =D |1 _B M:6/8 C K:Cmin d' ^G /4 
_e _B e M:6/8 /3 4 =g _D 12 ^A, G =d > 3/4 =C, D _d =E K:Cmaj _E, ^D ^F, C, =G 
=g 3 5 b 9 f' /2< C /3 D =B _E, /2> K:Cmix =a _G ^c' ^f G, ^C, =A, M:3/4 b' =a 
| A, d' ^A, 3/4 =G M:9/8 7 ^F =b _b e M:2/4 |1 _d' /2 e' 7 c ^A _D K:Cmaj a' 
=E (6 3/4 G, > _E, _e' =E, :| 3/4 _D b ^f ^C K:Cmix ^F /2> =C, 16 F C =G, C 
^C, =F, (2 _e' =F K:Cdor /2 e' (3 7/2 5/2 F, _E _e =F, ^A c 2> |2 =b 4 (5 /2> 
7 M:9/8 f _B /2 7/2 K:Cmaj |2 [ B =f' _E, _E, _e _c ^c M:3/2 K:Cmin

Below are the distributions of the NCD resulting from the six pairings of the three types of sequences: training to training, generated to generated, training to generated, and random to random, training to random, and generated to random. For each, we sampled 50,000 times.


In sky blue is the NCD distribution for tunes drawn from the training corpus; in coral the NCD distribution for generated tunes; in gold khaki is the NCD distribution for random sequences; in mauve is the NCD distribution of real tunes and generated tunes; and in matcha and light periwinkle are the NCD distributions of the real and generated tunes with the random sequences, respectively. We see markedly different distributions. The three NCD of the random sequences and their pairings with real and generated tunes are predictably high. The three NCD distributions of the real tunes and generated tunes have smaller means.

LZW compression works better as strings become longer. For instance, compressing a novel rather than a paragraph will result in a greater percentage of compression. So let’s run the same experiment as above but using multiple transcriptons. Here’s the results when we build each string from a pair of tunes:


The means of each distribution has increased, but their order remains the same. Here’s the same for strings built from 5 tunes:


Note the difference in the y-scale. It seems that the distributions are converging. Here are the results using strings composed of 100 transcriptions.



Several questions now arise:

  1. It looks like we have several different distributions. Why are the distributions what they are?
    • Why does the NCD distribution of paired random sequences (Random-Random) have a smaller mean than those of training or generated sequence paired with random sequences (Training-Random, Generated-Random)? Is it that on average two Random sequences are more “similar” than a Random sequence and a Training or Generated one?
    • Why is the mean of Random-Random larger than the distributions for training and generated tunes (Training-Training, Generated-Generated, Training-Generated)? This is likely caused by the structure present in the Training and Generated transcriptions.
    • Why is Generated-Generated different from Training-Training and Training-Generated? Is there less “diversity” in the Generated corpus?
  2. What does this have to with similarity or typicality as they relate to music? Is there any construct validity to this experiment? Is NCD really any measure of similarity? And what is similarity?
    • If we keep NCD, how might we change our music representation such that the experiment has construct validity in our limited case (Training vs. Generated)?
    • If we keep our ABC-like music representation, is there estimate of the Kolmogorov complexity such that the experiment has construct validity in our limited case (Training vs. Generated)? (I don’t think so.)
    • For what kinds of music/style and representation does the experiment have construct validity using NCD?



MFTOTM August 2018 has been decided

The machine folk tune of the month, with three votes, is good old Folk RNN Tune №1931… which needs a catchier title. Maybe something August themed:

August Reel
The Corn is Nearly Ready
Summer’s Almost Done and Gone
Looking Forward to Cooler Weather

Anyhow, let’s get learning!

Sailing Through The Hoples Mac

Everytime I open up The Endless folk-rnn Traditional Music Session, I find something interesting. Here’s a 9/8 tune generated by the first version of the folk-rnn system, which it has titled, “Sailing Through The Hoples Mac”:

T: Sailing Through The Hoples Mac
M: 9/8
K: Gmaj
|:Bdd d2B A2 A|Bdd d2 B G2A|Bdd d2B d2B|ged BAB G3||
dBd g2g b2d|de2 f2g a2a|bgd B2d e2d|BBB B3 dBA|
dz f3 b3|afd ~a3 b2a|g2B B2g B2c|d2B A2D EFG:||

In this case, folk-rnn has done an ok job with a time signature that is not too common in the training data (only 847 examples out of 23,636).

I have added a setting of this transcription with a few minor changes to The Machine Folk Session.

The Machine Folk Session is Launched!

The Machine Folk Session is now officially launched (another result of our project funded by the UK Arts and Humanities Research Council, grant no. AH/R004706/1: “Engaging three user communities with applications and outcomes of computational music creativity”):

This is a community website dedicated to folk music generated by, or co-created with, machines (and so the designation, “machine folk” — also known as “compufolk“, “data trad”, and “algo dance“). Anyone can browse and search the current collections; but only users can submit tunes and recordings, and build their own tunebooks. Some tunes are crazy, some are a tad boring, and some are great!

We also are running a Tune of the Month, where users get to vote for one of four tunes to work on for the month (any instrument, any level, any interpretation). We then record and share our interpretations and learn from each other.

So get exploring and help the machine folk tradition blossom.


Submit to this Special Issue on Data Science: Machine Learning for Audio Signal Processing

IEEE Journal of Selected Topics in Signal Processing (J-STSP)

Special Issue on Data Science: Machine Learning for Audio Signal Processing

Important Dates:

  • Submission deadline: October 1, 2018
  • 1st review completed: December 1, 2018
  • Revised manuscript due: February 1, 2019
  • 2nd review completed: March 1, 2019
  • Final Manuscript due: April 1, 2019
  • Publication: May, 2019

Slides from 2018 ICML Workshop: Machine Learning for Music

At the 2018 Joint Workshop on Machine Learning for Music at ICML, I delivered my talk “How Stuff Works: LSTM Model of Folk Music Transcriptions.” Here are my slides: Sturm_ICML2018

While I don’t yet have a complete picture of how this folk-rnn model is working, my talk illuminated a bit more about the workings of its first LSTM layer. In particular, we see that the gates of this layer (before nonlinearities) are mapping the input into different subspaces depending on the type of token (made possible by choosing the number of LSTM units in this layer to be approximately four times the number of vocabulary elements). We also find some neat emergent behaviours at the first layer, such as its similar treatment of octave pitches and enharmonics. We also see each gate is using information fused from the other three gates in the layer’s hidden state (from the previous time step). The next challenge that lies ahead now is figuring out how to talk about these things considering the nonlinearities (each a one-to-one mapping). Then we can move to interpret the second and third LSTM layers. And finally we can link this with our understanding of the softmax layer, described in my paper, “What do these 5,599,881 parameters mean? An analysis of a specific LSTM music transcription model, starting with the 70,281 parameters of the softmax layer” presented at MuMe 2018. (The slides for that talk are here: Sturm_MuMe2018.)

In general, this workshop was excellent! There was a nice selection of talks and posters addressing problems with music data in either symbolic and acoustic domains, or both tied together. The results of the deep singing voice synthesis work of Gómez et al. and Cheng-Wei et al. are very impressive. Plus Gómez took time to highlight ethical issues surrounding such work, as well as the pursuit of research in general. Also, the generated piano music samples of Huang et al. (symbolic) are simply amazing. Have a listen here. The happy hour concluding the workshop was as intellectually stimulating as the rest of the day! Thanks to Pandora for facilitating this event, and big kudos to the orgainsers, especially Erik Schmidt and José Iñesta for running the event.