# Digging deeper into minimum distances in high-dimensional musical spaces, pt. 2

Continuing from yesterday,
Casey et al. 2008 wish to find the probability distribution of the minimum among the squared distances of $$N$$ shingle pairs from unrelated songs.
With this we can say two songs are related when the minimum squared distance among their shingle pairs is inferior.

Assumption #4
The cumulative distribution function of the squared distance of a pair of shingles from unrelated songs is distributed like
$$F_S(s; d, \sigma^2) := \frac{2}{d} \left ( \frac{d}{2M\sigma^2} \right )^{d/2} \frac{1}{\Gamma(d/2)} s^{d/2} \mu(s)$$
for small $$s$$.
Essentially, this is saying that the pdf is a power law for the minimum squared distances in which we are interested.

Now, given $$N$$ shingle pairs, the probability that all their squared distances are greater than some $$s_{\min}$$ is
\begin{align} P(s \ge s_{\min}) & = \left [ 1 – F_S(s_{\min}; d, \sigma^2) \right ]^N \\ & = \left [ 1 – C(d,\sigma^2) s_{\min}^{d/2} \right ]^N \end{align}
since we are assuming the pairs are iid,
and where
$$C(d,\sigma^2) = \left ( \frac{d}{2M\sigma^2} \right )^{d/2} \frac{1}{\Gamma(d/2)}.$$
Now, if we very slyly make the following change of variables
$$s_{\min} = w_{\min}/\left [ C(d,\sigma^2)N\right ]^{2/d}$$
then a wonderful thing happens:
$$P(s \ge w_{\min}) = \left [ 1 – w_{\min}^{d/2}/N \right ]^N.$$
Assumption #5
$$N$$ is large, i.e., we have many pairs of shingles from unrelated songs.

With this assumption, we can say
$$\lim_{N \to \infty} P(s \ge w_{\min}) = \exp(-w_{\min}^{d/2})$$
and finally,
$$\lim_{N \to \infty} P(s < s_{\min}) = 1 – \exp(-C(d,\sigma^2) N s_{\min}^{d/2}).$$
Now, it is with this that we can find a $$s_{\min}$$ such that the probability of finding shingle pairs with smaller squared distance is some value.
Casey et al. 2008 find the $$s_{\min}$$ such that the false positive rate (for shingles pairs being within a certain squared distance, and not songs being similar) is $$f_p := P(s < s_{\min}) = 0.01$$. This becomes the threshold, which is
$$s_\textrm{thresh}(f_p; d,\sigma^2) = \left [ -\frac{\Gamma(d/2)}{N}\ln(1-f_p) \right ]^{2/d} \frac{2M\sigma^2}{d}.$$

So the procedure is very clear to me now:

1. Take $$N$$ pairs of shingles (each with dimension $$M$$) from a query and some unrelated songs, the more the better.
2. Estimate the variance $$\sigma^2$$ and effective dimension $$d$$ of the chi-squared distribution modeling the squared distances:
\begin{align} \hat{\sigma}^2 & = \frac{1}{M} \frac{1}{N}\sum s_i \\ \log\frac{\hat d}{2} -\psi\left(\frac{\hat d}{2}\right) & = \log\left [ \frac{1}{N} \sum s_i \right ] -\frac{1}{N} \sum \log s_i \end{align}
where $$\psi()$$ is defined for integers and half integers:
$$\psi\left(n\right) = \begin{cases} – \gamma + \sum_{k=1}^{n-1} \frac{1}{k}, & n \in \mathcal{N} \\ – \gamma – 2 \left [\ln 2 – \sum_{k=1}^n \frac{1}{2k-1} \right ], & n+\frac{1}{2} \in \mathcal{N} \end{cases}$$
where $$\gamma$$ is the Euler-Mascheroni constant
(solve for $$\hat d \in \mathcal{N}$$ numerically).
3. For a given false positive rate $$f_p$$, find the squared distance threshold $$s_\textrm{thresh}(f_p; d,\sigma^2) = \left [ – \, \frac{\Gamma(d/2)}{N}\ln(1-f_p) \right ]^{2/d} \frac{2M\sigma^2}{d}.$$
4. For the shingles of songs with unknown relationship to the query, find those with squared distances less than this threshold and count as a match.
5. That is all.