Hello, and welcome to Paper of the Day (Po’D): Music Cover Song Identification Edition, pt. 5 (of how many more I don’t know). We continue today with a recent paper: T. E. Ahonen, “Combining chroma features for cover version identification,” in Proc. Int. Soc. Music Info. Retrieval, Utrecht, Netherlands, Aug. 2010.

To evaluate the distance between two pieces of audio, the author proposes using an information theoretic measure, called the normalized compression distance (NCD), on chromagrams. This measure is defined for two strings \(x,y\)

$$N(x,y) := \frac{C\{x \cup y\} – \min\{C\{x\}, C\{y\}\}}{\max\{C\{x\}, C\{y\}\}}$$

where \(C\{x\}\) is the compressed size of \(x\) using a lossless compression algorithm, e.g., Lempel-Ziv with a sufficiently large sliding window.

NCD the normalized information distance, which uses Kolmogorov complexity and is not computable. For the NCD, we can see that when \(y = x\), then \(C\{x \cup y\} \approx C\{x\}\), and thus \(N(x,y) \to 0\), because the concatenation of two identical strings will only minimally increase the entropy of the original string. On the other hand, when the two strings are quite different, \(C\{x \cup y\} \to \max\{C\{x\}, C\{y\}\}\), and thus \(N(x,y) \to 1\).

The author finds for each song a chromagram using windows of 186 ms with a hop size of 162 ms, uses no tempo estimation and beat averaging, and transposes the chroma sequences using the optimal transposition index method of Serrà et al. From these he creates several different features. First, he learns 12-state and 24-state hidden Markov models resulting in 12- and 24-chord estimations. Second, he computes distances between successive chroma vectors using a Manhattan distance, which is then discretized. Finally, he finds the strongest pitch class in each chroma. Then he employs a prediction by partial matching compression algorithm (PPMZ), which uses Markov prediction, using each of these three features (I think). Then the distances between two songs is found by either the mean or median of the NCD for each set of features.

In his experiments, the author assembled a new dataset with six versions of 25 songs, and with 600 pieces added having no other versions. With all features combined, his approach has a mean average precision of 0.41, which is a little better than the approach of Ellis et al. 2007. The author remarks that live versions of songs are much more identifiable than remixes.

### Like this:

Like Loading...

*Related*