Normalised Compression Distances for Comparing Melodies

A fundamental problem we face with a system that can generate an endless number of folk tunes is how to find the interesting ones. In this direction, we have started to look at simple measures of similarity. Since we are dealing with sequences of symbols, a natural first step is to think about comparing those sequences according to their compressibility: how much does knowing one sequence help us describe another sequence with parsimony? In this post we briefly look at the normalised compression distance (NCD) as some measure of similarity (which is at the core of the evaluation approach proposed in J. Ens and P. Pasquier, “A Cross-Domain Analytic Evaluation Methodology for Style Imitation“, in Proc. Int. Conf. Computational Creativity, 2018).

Given two strings $x$ and $y$, the NCD between them is defined by

$D(x,y) = \frac{K(xy)-\min[K(x),K(y)]}{\max(K(x),K(y))}$

where $K(x)$ is the Kolmogorov complexity of the string $x$, and $xy$ is the concatenation of $x$ and $y$. The Kolmogorov complexity of any string is not computable, but an upper bound can be estimated via using a compression algorithm, e.g., using the Lempel-Ziv-Welch algorithm followed by Huffman coding (as done in gzip). $K(x)$ expresses the degree to which $x$ can be compressed. So $K(xy)-\min[K(x),K(y)]$ expresses the degree to which either of the two strings helps in compressing the other. In the limit that the two strings are the same, this value will be zero because $K(xx) = K(x)$. In the limit that the two strings are completely different, $K(xy)$ will grow to be no longer than twice $\max(K(x),K(y))$. Normalising the numerator with $\max(K(x),K(y))$ then will give a value close to $1$ (but could actually be greater than 1).

In our work, we are looking at symbolic transcriptions of folk tunes. Here’s an example transcription of length 109 tokens from our training data:

M:4/4 K:Cmaj G 3 F E F G A | _B G c B A F F 2 | G A G F E G c f |
e c d B G c c 2 :| e 2 c e d e c d | e 2 c e d B G 2 | e 2 c e d e c d |
e c d B G c c 2 | e 2 c e d e c d | e 2 c e d B G ^F | G A B c d e f d |
g e f d e c c 2 |


Let’s see how this is compressed by a text compression algorithm. LZW (we are using python-lzw) iteratively builds a codebook from an input sequence. It starts with codewords consisting of single items, and then adds longer sequences that it finds are unique. For the example above, this codebook of symbols looks like (each pair of square braces designates a codeword):

['M:4/4'] ['K:Cmaj'] ['G'] ['3'] ['F'] ['E'] ['A'] ['|'] ['_B'] ...
['M:4/4', 'K:Cmaj'] ['K:Cmaj', 'G'] ['G', '3'] ['3', 'F'] ['F', 'E']
['E', 'F'] ['F', 'G'] ['G', 'A'] ['A', '|'] ['|', '_B'] ['_B', 'G']
['G', 'c'] ['c', 'B'] ['B', 'A'] ['A', 'F'] ['F', 'F'] ['F', '2']
['2', '|'] ['|', 'G'] ['G', 'A', 'G'] ['G', 'F'] ['F', 'E', 'G']
['G', 'c', 'f'] ... ['g', 'e'] ['e', 'f', 'd'] ['d', 'e', 'c', 'c']


This is ordered in the sequence that the codebook is created. Here we see unique individual elements are added first, followed by pairings “M:4/4 K:Cmaj” and then “K:Cmaj G”, … and then “G A G” since the pair “G A” had already been seen. Towards the end of building the codebook we see longer sequences, e.g., “d e c c”.

The LZW encoding of the length-109 transcription above results in a sequence of length 83 , which is decoded according to:

['M:4/4'] ['K:Cmaj'] ['G'] ['3'] ['F'] ['E'] ['F'] ['G'] ['A'] ['|']
['_B'] ['G'] ['c'] ['B'] ['A'] ['F'] ['F'] ['2'] ['|'] ['G', 'A']
['G'] ['F', 'E'] ['G', 'c'] ['f'] ['|'] ['e'] ['c'] ['d'] ['B']
['G', 'c'] ['c'] ['2'] [':|'] ... ['e'] ['f'] ['d', '|'] ['g']
['e', 'f'] ['d', 'e', 'c'] ['c', '2', '|']


This stream of bytes can be further compressed by allocating bytes to codewords depending on their frequency, i.e., via Huffman coding. For the transcription above, the symbols “G”, “F”, and “E” will be given the fewest bits since they are the most frequent, while “M:4/4”,  “K:Cmaj” and “d e c” will be given the most bits since they are the least frequent. For now we will just stick to computing the NCD using LZW compression.

Now that we have an idea of how to estimate the Kolmogorov complexity of strings, let’s look at the NCD within and across two corpora: the 23,636 tunes in the training data of the folk-rnn v2 model; the 60,000 tunes generated by v2 (available in the folk-rnn (v2) Session Books Vols 1-10). We draw a tune from the training corpus, a tune from the generated corpus, and sample a random sequence having a length that is the maximum of the two tunes. We then compute the NCD of each tune with itself (setting $x = y$), and paired with the others (where $x$ and $y$ come from the two corpora, or is a random sequence). We repeat this 10,000 times and then plot the histogram of the resulting NCD.

By way of examples, here’s a generated tune:

M:3/4 K:Cmaj G, 2 | C 2 E /2 G /2 B /2 d /2 c 2 | B > A G E /2 C /2 D 2 | E 2
(3 F D B, C > C | C 4 | G A /2 B /2 | c > B c d e < c | d > B G 3 B /2 d /2 |
c > A B G (3 A G E | G 4 G A /2 B /2 | c > B c d e < c | d > B G 3 B /2 d /2 |
c > A B G (3 A G E | G 4 E > D | C 2 E /2 G /2 B /2 d /2 c 2 |
B > A G E /2 C /2 D 2 | E 2 (3 F D B, C > C | C 4 |


And here’s a random tune:

B, ^f' ^a M:2/4 | E F a' 12 M:3/2 _C ^c E, =D |1 _B M:6/8 C K:Cmin d' ^G /4
_e _B e M:6/8 /3 4 =g _D 12 ^A, G =d > 3/4 =C, D _d =E K:Cmaj _E, ^D ^F, C, =G
=g 3 5 b 9 f' /2< C /3 D =B _E, /2> K:Cmix =a _G ^c' ^f G, ^C, =A, M:3/4 b' =a
| A, d' ^A, 3/4 =G M:9/8 7 ^F =b _b e M:2/4 |1 _d' /2 e' 7 c ^A _D K:Cmaj a'
=E (6 3/4 G, > _E, _e' =E, :| 3/4 _D b ^f ^C K:Cmix ^F /2> =C, 16 F C =G, C
^C, =F, (2 _e' =F K:Cdor /2 e' (3 7/2 5/2 F, _E _e =F, ^A c 2> |2 =b 4 (5 /2>
7 M:9/8 f _B /2 7/2 K:Cmaj |2 [ B =f' _E, _E, _e _c ^c M:3/2 K:Cmin


Below are the distributions of the NCD resulting from the six pairings of the three types of sequences: training to training, generated to generated, training to generated, and random to random, training to random, and generated to random. For each, we sampled 50,000 times.

In sky blue is the NCD distribution for tunes drawn from the training corpus; in coral the NCD distribution for generated tunes; in gold khaki is the NCD distribution for random sequences; in mauve is the NCD distribution of real tunes and generated tunes; and in matcha and light periwinkle are the NCD distributions of the real and generated tunes with the random sequences, respectively. We see markedly different distributions. The three NCD of the random sequences and their pairings with real and generated tunes are predictably high. The three NCD distributions of the real tunes and generated tunes have smaller means.

LZW compression works better as strings become longer. For instance, compressing a novel rather than a paragraph will result in a greater percentage of compression. So let’s run the same experiment as above but using multiple transcriptons. Here’s the results when we build each string from a pair of tunes:

The means of each distribution has increased, but their order remains the same. Here’s the same for strings built from 5 tunes:

Note the difference in the y-scale. It seems that the distributions are converging. Here are the results using strings composed of 100 transcriptions.

Curious!

Several questions now arise:

1. It looks like we have several different distributions. Why are the distributions what they are?
• Why does the NCD distribution of paired random sequences (Random-Random) have a smaller mean than those of training or generated sequence paired with random sequences (Training-Random, Generated-Random)? Is it that on average two Random sequences are more “similar” than a Random sequence and a Training or Generated one?
• Why is the mean of Random-Random larger than the distributions for training and generated tunes (Training-Training, Generated-Generated, Training-Generated)? This is likely caused by the structure present in the Training and Generated transcriptions.
• Why is Generated-Generated different from Training-Training and Training-Generated? Is there less “diversity” in the Generated corpus?
2. What does this have to with similarity or typicality as they relate to music? Is there any construct validity to this experiment? Is NCD really any measure of similarity? And what is similarity?
• If we keep NCD, how might we change our music representation such that the experiment has construct validity in our limited case (Training vs. Generated)?
• If we keep our ABC-like music representation, is there estimate of the Kolmogorov complexity such that the experiment has construct validity in our limited case (Training vs. Generated)? (I don’t think so.)
• For what kinds of music/style and representation does the experiment have construct validity using NCD?