After having reproduced the log-frequency cepstral coefficient and pitch-class profile shingle extraction technique of Casey et al. 2008 (M. Casey, C. Rhodes, and M. Slaney, “Analysis of minimum distances in high-dimensional musical spaces,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, pp. 1015-1028, July 2008, Po’D here), it is time for me to understand the derivation of their decision criteria for audio shingles being significantly close in a Euclidean sense.

Say we have sets of unit-norm shingles for each song. We are interested in the number of shingles across songs that are close to each other in a Euclidean sense, which is equivalent to being separated by a small angle since all features exist on the unit ball.

Let’s take two songs and make each shingle of each song a column of matrices \(\MX, \MY\). These matrices have the same number of rows, but possibly different numbers of columns (depending on the lengths of the songs).

The Euclidean distances of all possible shingle pairs is given by \(\MS = \mathbf{2} – 2\MX^T\MY\), where \(\mathbf{2}\) is just a big matrix of 2s.

So we are interested in the number of entries of \(\MS\) that are smaller than \(\delta\)

with \(\delta > 0\).

(Casey et al. 2008 count the number of rows or number of columns of \(\MS\) that have at least one element less than \(\delta\). Shouldn’t we normalize by by the cardinality of the sets?) The problem is how to choose \(\delta\) such that we can reduce the number of false negatives and false positives. This appears to be the “fun” part.

**Assumption #1**

Every element of a shingle is distributed zero-mean Gaussian.

It is known (and celebrated) that the squared Euclidean distance between two \(M\)-dimensional standard normal vectors (\(\sim \mathcal{N}(\mathbf{0}, \MI)\)) is a random variable \(R\) distributed chi-squared:

$$f_R(r; d) := \frac{1}{2^{d/2} \Gamma(d/2)} r^{d/2-1} e^{-r/2} \mu(r)$$

where \(\mu(r)\) is the step function, and \(d \in \mathcal{Z}\) is a number of independent degrees of freedom. The Gamma function also makes an appearance. If the elements of the random vectors are all distributed independently (as they are with iid normal vectors), then \(d=M\). If instead all the components consist of the same random variable, then \(d=1\).

Casey et al. 2008 now consider two extreme scenarios for

- (Complete independence of feature dimensions) If each \(M\)-dimensional random vector \(\sim \mathcal{N}(\mathbf{0},\sigma^2\MI)\), then the distance between the pair is a random variable \(S \sim 1/\sigma^2 f_R(s/\sigma^2; M) \).
- (Complete dependence of feature dimensions) If each \(M\)-dimensional random vector is a replication of a single random variable \(\sim \mathcal{N}(0,\sigma^2)\),

then the distance between the pair is a random variable \(S \sim 1/M\sigma^2 f_R(s/M\sigma^2; 1) \).

The differences between these is in how independent the feature dimensions are. In the pitch-class profile features of dimension 12, for instance, there will be a rather large dependence between harmonically related dimensions. The cepstral coefficient feature dimensions will be less correlated.

**Assumption #2**

Given

*shingles of unrelated songs*(i.e., the random vectors are distributed independently), the squared distance between each pair is a random variable distributed between these two extremes. In other words, \(S \sim d/M\sigma^2 f_R(rd/M\sigma^2; d) \) with \(1 \le d \le M\) being the “effective dimension”. In other other words, the probability distribution function of the squared distance is

$$f_S(s; d, \sigma^2) := \left ( \frac{d}{2M\sigma^2} \right )^{d/2} \frac{1}{\Gamma(d/2)} s^{d/2-1} e^{-sd/2M\sigma^2} \mu(s)$$

(What about the fact that our features live on the unit ball and thus \(0 \le S \le 2\)?)

The problem is now to estimate \(d\) and \(\sigma^2\), which can be done by using maximum likelihood and a set of \(N\) shingle pairs, as long as

**Assumption #3**

The squared distances of a set of shingle pairs (not just one shingle pair, but all those in \(\MX, \MY\)) are independent and identically distributed.

Thus, the likelihood of parameters \((d,\sigma^2)\) given \(N\) pairs of iid shingles \(\{(\vx_i, \vy_i) : i=1, \ldots, N\}\) is just a product of the \(N\) individual distributions:

$$\begin{align}

L(d,\sigma^2 | \{ s_i = ||\vx_i – \vy_i||^2 : i=1, \ldots, N\}) & = \prod_{i=1}^N f_S(s_i; d, \sigma^2) \\

& = \left ( \frac{d}{2M\sigma^2} \right )^{Nd/2} \frac{1}{\Gamma^N(d/2)} \prod_{i=1}^N s_i^{d/2-1} e^{-s_i d/2M\sigma^2}

\end{align}$$

and from this we can find the maximum likelihood estimates of \((d,\sigma^2)\) with partial differentiation:

$$\begin{align}

\hat{\sigma}^2 & = \frac{1}{M} \frac{1}{N}\sum s_i \\

\log\frac{\hat d}{2} -\psi\left(\frac{\hat d}{2}\right) & = \log\left [ \frac{1}{N} \sum s_i \right ] -\frac{1}{N} \sum \log s_i

\end{align}$$

where \(\psi()\) is a digamma function! Now given some pairs of shingles from unrelated songs, we can estimate the variance and effective dimension of the probability distribution function modeling their distances.

Tomorrow I continue, but the road looks a bit precarious with more Gamma functions, overbars, lots of approximations, and the appearance of the Euler-Mascheroni constant … which I have seen before!