This is the second part of live blogging my analysis of the 365 double jigs collected in “O’Neill’s 1001”. Part 1 is here.
Here is the full similarity matrix with all the corrections I applied yesterday. (I made the values on the diagonal to be 0.25 in the plot).
Below is a histogram of the values we see in the similarity matrix.
The little ticks we see at values greater than 0.6 are those pairs of transcriptions investigated in part 1. The smallest value in this matrix is 0.048, which is the normalized DL similarity of jigs #178 (“Paddy O’Rafferty”) and #327 (“Stay where you are”):
I was thinking that this small number would be due to the comparison of a jig with many parts and one with fewer parts — and here we see a five part jig and a three part jig. We also notice that the third part of “Paddy O’Rafferty” has 7 measures instead of 8! It’s another transcription mistake (which motivates making sure all jigs in the collection have the right number of measures). Nonetheless, these jigs are quite different and so do deserve a small similarity score… but I do wonder why the smallest normalized DL similarity doesn’t come from the longest and shortest jigs in the dataset.
Wait a second! I forgot python indexes from 0. This means the smallest similarity is actually between jigs #179 (“I do not incline”) and #328 (“Miss Walsh’s fancy”):
Huh. These are quite different, but again not as different as I expected in terms of structure. The normalized DL similarity between “I do not incline” and “Paddy O’Rafferty” is 0.31… which is about 10 times larger than the smallest normalized DL similarity we see. Perhaps my confusion comes from the fact that comparing strings is not like comparing tunes.
Let’s collapse the similarity matrix to find the mean similarities of each transcription in the collection:
The transcription with the smallest mean DL similarity is #257 (“Morgan Rattler”), which is a very long one… and now I find mistakes in its transcription. Here’s the corrected version:
The tune with the largest mean normalized DL similarity is #114 (“The maid on the green”):
So it seems my intuition is right, but in a limiting sense: the tunes with the smallest normalized DL similarity among all of them are long ones. And the tunes with the greatest normalized DL similarity are short ones. Let’s plot the mean normalized DL similarity of tunes as a function of their lengths in characters to confirm this. Below is a scatter plot for all 365 jigs (numbers correspond to the jigs). We see a clear negative correlation between tune length and mean similarity.
The shortest jig with the smallest mean normalized DL similarity is #164 (“The ladies of Carrick”):
This motivates another approach to analyzing relationships in the dataset. Let us perform multidimensional scaling on the similarity matrix (or rather the dissimilarity matrix), projecting onto two dimensions. This essentially tries to find a way to position the transcriptions on a plane such that their pairwise distances reflect their pairwise dissimilarities. Here’s the result:
The darkness of the labels is proportional to the length of the transcriptions, with lighter values being longer transcriptions. It appears that all of the longest transcriptions are at the periphery. Jigs #257 is in the southeast corner. Jig #164 (the shortest jig with the smallest mean norm. DL similarity) is on the periphery in the southwest corner.
Closest to the centroid of this mass is jig #134 (“Young Tim Murphy”):
which we saw yesterday has the fifth largest normalized DL similarity in all pairs (with jig #296, the next closest to the center). So, according to the normalized DL similarity, these transcriptions are the most similar to all others in the dataset.
In conclusion, looking at this set of 365 transcriptions through the musically naive approach of edit distances between strings provides a sensible amount of useful observations … including several error detections! Now we must move to be less musically naive. For instance, an edit-distance comparison of the transcription strings “DEFG2G” and “DE^FG2G” should be zero if they both occur in D major. Also “DEFG2G” and “DEFG2G” should not have an edit distance of zero if one is in D major and the other is in D minor. The next complication then will consider the mode in the computation of the similarity. All pitches will have to be made explicit. Then we will have to consider variations, e.g., “DEFG2G” and “DEFGGG” and “DEFG3” should also have edit distances of zero because they are equivalent as variations.