Hello, and welcome to Paper of the Day (Po’D): Minimum Distances in High-Dimensional Musical Feature Spaces Edition. Today’s paper is: M. Casey, C. Rhodes, and M. Slaney, “Analysis of minimum distances in high-dimensional musical spaces,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, pp. 1015-1028, July 2008.

In the interest of extreme brevity, here is my one line description of the work in this paper:

With a proper analysis of the distribution of distances in groups of features that are known to be unrelated, and the application of locality sensitive hashing, a range of problems in audio similarity search is essentially nailed in a highly efficient manner (Kudos!).

This paper is at once a presentation of a new method for

similarity query in a large music database using nearest-neighbors

with locality sensitivity hashing and hypothesis testing,

as well as an analysis of the expected distances between

high-dimensional feature vectors relevant for

music indexing, search, and retrieval in order to determine

a good distance threshold for the search process.

Based on the specific tasks focused on here

(fingerprinting, cover song identification,

and remixed retrieval) the authors construct

sequences of feature vectors

called *shingles* from segments of audio data (3 seconds),

with a shift of 100 ms each.

The similarity of two tracks (songs)

is calculated by counting their matching shingles.

Two shingles that match have a distance less

than a specified distance threshold.

This threshold is set by estimating from the database

the background distance distribution — assumed to be

a sum of Gaussians — of the features used.

This forms the conditions upon which we can assume with high probability

that two tracks are unrelated.

The authors use two features: log-frequency cepstral coefficients (LFCC, 20 coefficients),

and equal-termperament pitch class profiles (PCP, 12 coefficients), to create the shingles. Silence is removed before calculating the features,

and each shingle is normalized to unit norm.

The LFCC features are used for tasks requiring high-specificity (fingerprinting),

while the PCP features are used for mid-specificity tasks (cover song ID, and remixed audio).

Each type of feature vector is created from the audio (44.1 kHz)

by windowing (8192 samples), zero padding to twice the length,

and transforming to the frequency domain by a DFT.

To create the PCP features, the magnitudes in 8193 elements of the DFT

are assigned to one of 12 pitch classes, summed,

and then they take the log.

To create the LFCC features, the DFT magnitudes are summed in

equal-temperament logarithmically spaced frequency bands,

and the DCT is used to reduce the dimensionality.

For both types, the window is shifted by 100 ms and the process repeats.

Each shingle is created by concatenating into a long vector

feature vectors computed over 3 sequential seconds of audio.

The next shingle is created by shifting the window of the previous shingle by 100 ms.

Each LFCC shingle is thus \(M=600\) dimensions (3 seconds with shifts of 100 ms,

and 20 coefficients for each vector);

each PCP shingle is \(M=360\) dimensions (3 seconds with shifts of 100 ms,

and 12 coefficients for each vector).

The authors define the distance between two shingles \(\vq \in \mathcal{R}^M\) and \(\va \in \mathcal{R}^M\)

as simply their Euclidean distance:

$$

d(\vq, \va) = || \vq – \va ||_2 = \left [ ||\vq||_2^2 + ||\va||_2^2 – 2\vq^T\va \right ]^{1/2} = \left [ 2 – 2\vq^T\va \right ]^{1/2}

$$

where we have used the fact that each shingle has unit norm.

Given a set of shingles of two tracks then, \(\mathcal{Q} = \{ \vq_i \}\)

and \(\mathcal{A} = \{ \va_j \}\), the similarity between the tracks is

given by a histogram of the number of shingle pairs between them

that lie within a minimum distance \(x_{thresh}\):

$$

C(\mathcal{Q}, \mathcal{A}) := \#\{ j \in \{ 1, 2, \ldots, |\mathcal{A}| \} :

d(\vq_i, \va_j) \le x_{thresh}, i = \{1, 2, \ldots, |\mathcal{Q}|\}\}.

$$

The specification of \(x_{thresh}\) is critical to the performance of the query task,

and so the authors assign it by estimating the distribution of distances

between those shingles of tracks known to be unrelated (training data).

The authors assume that each element of a feature vector

is distributed identically and independently as an Gaussian,

and thus each shingle is a \(M\) long i.i.d. Gaussian vector.

Thus, the PDF of the distance (\ref{eq:shingledistance}) will

be a chi-squared distribution.

When a pair of shingles have a distance

that exists in the left tail of this distribution (up to \(x_{thresh}\))

then the authors assume the sources are related.

The authors test this approach in three tasks: fingerprinting, cover song identification,

and remixed retrieval.

In the first they use a collection of 2741 commercial recordings of Chopin music, in which it is known that 49 recordings are fraudulently created by one pianist through splicing other artists’ recordings together and passing the results off as hers — a fraud not uncovered until 2007!

The proposed method was able to find all 49 frauds using the LFCCs features.

In the cover song retrieval task, they used the same database of piano music, but now looked for versions of the same piece. This time they used the PCPs features. Their results show that their method has a very high precision for a large range of recall (> 90% for recall between 0 — 72%). In their final experiment, the authors attempted to detect remixed versions of a query track, with a specificity they position between the two previous experiments. Here they used a collection of 2018 jazz and popular recordings, and the PCPs features. Their results show a precision of 75% at 70% recall!