Summer’s Almost Done and Gone

Here’s my contribution to The Machine Folk Session Tune of the Month for August 2018:

I think it turned out to be a nice tune with just a few adjustments. Here’s the original tune, produced by folkrnn:


The weakest part of this tune to me is bars 11-12. I felt repeating the idea of bar 10 with a descending harmony is just perfect to describe how I feel at the close of summer. A further adjustment is just making first and second endings to each part. Here’s my tune:



Normalised Compression Distances for Comparing Melodies

A fundamental problem we face with a system that can generate an endless number of folk tunes is how to find the interesting ones. In this direction, we have started to look at simple measures of similarity. Since we are dealing with sequences of symbols, a natural first step is to think about comparing those sequences according to their compressibility: how much does knowing one sequence help us describe another sequence with parsimony? In this post we briefly look at the normalised compression distance (NCD) as some measure of similarity (which is at the core of the evaluation approach proposed in J. Ens and P. Pasquier, “A Cross-Domain Analytic Evaluation Methodology for Style Imitation“, in Proc. Int. Conf. Computational Creativity, 2018).

Given two strings x and y, the NCD between them is defined by

D(x,y) = \frac{K(xy)-\min[K(x),K(y)]}{\max(K(x),K(y))}

where K(x) is the Kolmogorov complexity of the string x, and xy is the concatenation of x and y. The Kolmogorov complexity of any string is not computable, but an upper bound can be estimated via using a compression algorithm, e.g., using the Lempel-Ziv-Welch algorithm followed by Huffman coding (as done in gzip). K(x) expresses the degree to which x can be compressed. So K(xy)-\min[K(x),K(y)] expresses the degree to which either of the two strings helps in compressing the other. In the limit that the two strings are the same, this value will be zero because K(xx) = K(x). In the limit that the two strings are completely different, K(xy) will grow to be no longer than twice \max(K(x),K(y)). Normalising the numerator with \max(K(x),K(y)) then will give a value close to 1 (but could actually be greater than 1).

In our work, we are looking at symbolic transcriptions of folk tunes. Here’s an example transcription of length 109 tokens from our training data:

M:4/4 K:Cmaj G 3 F E F G A | _B G c B A F F 2 | G A G F E G c f | 
e c d B G c c 2 :| e 2 c e d e c d | e 2 c e d B G 2 | e 2 c e d e c d | 
e c d B G c c 2 | e 2 c e d e c d | e 2 c e d B G ^F | G A B c d e f d | 
g e f d e c c 2 |

Let’s see how this is compressed by a text compression algorithm. LZW (we are using python-lzw) iteratively builds a codebook from an input sequence. It starts with codewords consisting of single items, and then adds longer sequences that it finds are unique. For the example above, this codebook of symbols looks like (each pair of square braces designates a codeword):

['M:4/4'] ['K:Cmaj'] ['G'] ['3'] ['F'] ['E'] ['A'] ['|'] ['_B'] ...
['M:4/4', 'K:Cmaj'] ['K:Cmaj', 'G'] ['G', '3'] ['3', 'F'] ['F', 'E'] 
['E', 'F'] ['F', 'G'] ['G', 'A'] ['A', '|'] ['|', '_B'] ['_B', 'G'] 
['G', 'c'] ['c', 'B'] ['B', 'A'] ['A', 'F'] ['F', 'F'] ['F', '2'] 
['2', '|'] ['|', 'G'] ['G', 'A', 'G'] ['G', 'F'] ['F', 'E', 'G'] 
['G', 'c', 'f'] ... ['g', 'e'] ['e', 'f', 'd'] ['d', 'e', 'c', 'c']

This is ordered in the sequence that the codebook is created. Here we see unique individual elements are added first, followed by pairings “M:4/4 K:Cmaj” and then “K:Cmaj G”, … and then “G A G” since the pair “G A” had already been seen. Towards the end of building the codebook we see longer sequences, e.g., “d e c c”.

The LZW encoding of the length-109 transcription above results in a sequence of length 83 , which is decoded according to:

['M:4/4'] ['K:Cmaj'] ['G'] ['3'] ['F'] ['E'] ['F'] ['G'] ['A'] ['|'] 
['_B'] ['G'] ['c'] ['B'] ['A'] ['F'] ['F'] ['2'] ['|'] ['G', 'A'] 
['G'] ['F', 'E'] ['G', 'c'] ['f'] ['|'] ['e'] ['c'] ['d'] ['B'] 
['G', 'c'] ['c'] ['2'] [':|'] ... ['e'] ['f'] ['d', '|'] ['g'] 
['e', 'f'] ['d', 'e', 'c'] ['c', '2', '|']

This stream of bytes can be further compressed by allocating bytes to codewords depending on their frequency, i.e., via Huffman coding. For the transcription above, the symbols “G”, “F”, and “E” will be given the fewest bits since they are the most frequent, while “M:4/4”,  “K:Cmaj” and “d e c” will be given the most bits since they are the least frequent. For now we will just stick to computing the NCD using LZW compression.

Now that we have an idea of how to estimate the Kolmogorov complexity of strings, let’s look at the NCD within and across two corpora: the 23,636 tunes in the training data of the folk-rnn v2 model; the 60,000 tunes generated by v2 (available in the folk-rnn (v2) Session Books Vols 1-10). We draw a tune from the training corpus, a tune from the generated corpus, and sample a random sequence having a length that is the maximum of the two tunes. We then compute the NCD of each tune with itself (setting x = y), and paired with the others (where x and y come from the two corpora, or is a random sequence). We repeat this 10,000 times and then plot the histogram of the resulting NCD.

By way of examples, here’s a generated tune:

M:3/4 K:Cmaj G, 2 | C 2 E /2 G /2 B /2 d /2 c 2 | B > A G E /2 C /2 D 2 | E 2 
(3 F D B, C > C | C 4 | G A /2 B /2 | c > B c d e < c | d > B G 3 B /2 d /2 | 
c > A B G (3 A G E | G 4 G A /2 B /2 | c > B c d e < c | d > B G 3 B /2 d /2 | 
c > A B G (3 A G E | G 4 E > D | C 2 E /2 G /2 B /2 d /2 c 2 | 
B > A G E /2 C /2 D 2 | E 2 (3 F D B, C > C | C 4 |

And here’s a random tune:

B, ^f' ^a M:2/4 | E F a' 12 M:3/2 _C ^c E, =D |1 _B M:6/8 C K:Cmin d' ^G /4 
_e _B e M:6/8 /3 4 =g _D 12 ^A, G =d > 3/4 =C, D _d =E K:Cmaj _E, ^D ^F, C, =G 
=g 3 5 b 9 f' /2< C /3 D =B _E, /2> K:Cmix =a _G ^c' ^f G, ^C, =A, M:3/4 b' =a 
| A, d' ^A, 3/4 =G M:9/8 7 ^F =b _b e M:2/4 |1 _d' /2 e' 7 c ^A _D K:Cmaj a' 
=E (6 3/4 G, > _E, _e' =E, :| 3/4 _D b ^f ^C K:Cmix ^F /2> =C, 16 F C =G, C 
^C, =F, (2 _e' =F K:Cdor /2 e' (3 7/2 5/2 F, _E _e =F, ^A c 2> |2 =b 4 (5 /2> 
7 M:9/8 f _B /2 7/2 K:Cmaj |2 [ B =f' _E, _E, _e _c ^c M:3/2 K:Cmin

Below are the distributions of the NCD resulting from the six pairings of the three types of sequences: training to training, generated to generated, training to generated, and random to random, training to random, and generated to random. For each, we sampled 50,000 times.


In sky blue is the NCD distribution for tunes drawn from the training corpus; in coral the NCD distribution for generated tunes; in gold khaki is the NCD distribution for random sequences; in mauve is the NCD distribution of real tunes and generated tunes; and in matcha and light periwinkle are the NCD distributions of the real and generated tunes with the random sequences, respectively. We see markedly different distributions. The three NCD of the random sequences and their pairings with real and generated tunes are predictably high. The three NCD distributions of the real tunes and generated tunes have smaller means.

LZW compression works better as strings become longer. For instance, compressing a novel rather than a paragraph will result in a greater percentage of compression. So let’s run the same experiment as above but using multiple transcriptons. Here’s the results when we build each string from a pair of tunes:


The means of each distribution has increased, but their order remains the same. Here’s the same for strings built from 5 tunes:


Note the difference in the y-scale. It seems that the distributions are converging. Here are the results using strings composed of 100 transcriptions.



Several questions now arise:

  1. It looks like we have several different distributions. Why are the distributions what they are?
    • Why does the NCD distribution of paired random sequences (Random-Random) have a smaller mean than those of training or generated sequence paired with random sequences (Training-Random, Generated-Random)? Is it that on average two Random sequences are more “similar” than a Random sequence and a Training or Generated one?
    • Why is the mean of Random-Random larger than the distributions for training and generated tunes (Training-Training, Generated-Generated, Training-Generated)? This is likely caused by the structure present in the Training and Generated transcriptions.
    • Why is Generated-Generated different from Training-Training and Training-Generated? Is there less “diversity” in the Generated corpus?
  2. What does this have to with similarity or typicality as they relate to music? Is there any construct validity to this experiment? Is NCD really any measure of similarity? And what is similarity?
    • If we keep NCD, how might we change our music representation such that the experiment has construct validity in our limited case (Training vs. Generated)?
    • If we keep our ABC-like music representation, is there estimate of the Kolmogorov complexity such that the experiment has construct validity in our limited case (Training vs. Generated)? (I don’t think so.)
    • For what kinds of music/style and representation does the experiment have construct validity using NCD?


Sailing Through The Hoples Mac

Everytime I open up The Endless folk-rnn Traditional Music Session, I find something interesting. Here’s a 9/8 tune generated by the first version of the folk-rnn system, which it has titled, “Sailing Through The Hoples Mac”:

T: Sailing Through The Hoples Mac
M: 9/8
K: Gmaj
|:Bdd d2B A2 A|Bdd d2 B G2A|Bdd d2B d2B|ged BAB G3||
dBd g2g b2d|de2 f2g a2a|bgd B2d e2d|BBB B3 dBA|
dz f3 b3|afd ~a3 b2a|g2B B2g B2c|d2B A2D EFG:||

In this case, folk-rnn has done an ok job with a time signature that is not too common in the training data (only 847 examples out of 23,636).

I have added a setting of this transcription with a few minor changes to The Machine Folk Session.

The Machine Folk Session is Launched!

The Machine Folk Session is now officially launched (another result of our project funded by the UK Arts and Humanities Research Council, grant no. AH/R004706/1: “Engaging three user communities with applications and outcomes of computational music creativity”):

This is a community website dedicated to folk music generated by, or co-created with, machines (and so the designation, “machine folk” — also known as “compufolk“, “data trad”, and “algo dance“). Anyone can browse and search the current collections; but only users can submit tunes and recordings, and build their own tunebooks. Some tunes are crazy, some are a tad boring, and some are great!

We also are running a Tune of the Month, where users get to vote for one of four tunes to work on for the month (any instrument, any level, any interpretation). We then record and share our interpretations and learn from each other.

So get exploring and help the machine folk tradition blossom.


Submit to this Special Issue on Data Science: Machine Learning for Audio Signal Processing

IEEE Journal of Selected Topics in Signal Processing (J-STSP)

Special Issue on Data Science: Machine Learning for Audio Signal Processing

Important Dates:

  • Submission deadline: October 1, 2018
  • 1st review completed: December 1, 2018
  • Revised manuscript due: February 1, 2019
  • 2nd review completed: March 1, 2019
  • Final Manuscript due: April 1, 2019
  • Publication: May, 2019