Continuing from yesterday,

Casey et al. 2008 wish to find the probability distribution of the minimum among the squared distances of \(N\) shingle pairs from unrelated songs.

With this we can say two songs are related when the minimum squared distance among their shingle pairs is inferior.

**Assumption #4**

The cumulative distribution function of the squared distance of a pair of shingles from unrelated songs is distributed like

$$F_S(s; d, \sigma^2) := \frac{2}{d} \left ( \frac{d}{2M\sigma^2} \right )^{d/2} \frac{1}{\Gamma(d/2)} s^{d/2} \mu(s)$$

for small \(s\).

Essentially, this is saying that the pdf is a power law for the minimum squared distances in which we are interested.

Now, given \(N\) shingle pairs, the probability that all their squared distances are greater than some \(s_{\min}\) is

$$\begin{align}

P(s \ge s_{\min}) & = \left [ 1 – F_S(s_{\min}; d, \sigma^2) \right ]^N \\

& = \left [ 1 – C(d,\sigma^2) s_{\min}^{d/2} \right ]^N

\end{align}$$

since we are assuming the pairs are iid,

and where

$$C(d,\sigma^2) = \left ( \frac{d}{2M\sigma^2} \right )^{d/2} \frac{1}{\Gamma(d/2)}.$$

Now, if we very slyly make the following change of variables

$$s_{\min} = w_{\min}/\left [ C(d,\sigma^2)N\right ]^{2/d}$$

then a wonderful thing happens:

$$P(s \ge w_{\min}) = \left [ 1 – w_{\min}^{d/2}/N \right ]^N.$$

**Assumption #5**

\(N\) is large, i.e., we have many pairs of shingles from unrelated songs.

With this assumption, we can say

$$\lim_{N \to \infty} P(s \ge w_{\min}) = \exp(-w_{\min}^{d/2})$$

and finally,

$$\lim_{N \to \infty} P(s < s_{\min}) = 1 – \exp(-C(d,\sigma^2) N s_{\min}^{d/2}).$$

Now, it is with this that we can find a \(s_{\min}\) such that the probability of finding shingle pairs with smaller squared distance is some value.

Casey et al. 2008 find the \(s_{\min}\) such that the false positive rate (for shingles pairs being within a certain squared distance, and not songs being similar) is \(f_p := P(s < s_{\min}) = 0.01\). This becomes the threshold, which is

$$s_\textrm{thresh}(f_p; d,\sigma^2) = \left [ -\frac{\Gamma(d/2)}{N}\ln(1-f_p) \right ]^{2/d} \frac{2M\sigma^2}{d}.$$

- Take \(N\) pairs of shingles (each with dimension \(M\)) from a query and some unrelated songs, the more the better.
- Estimate the variance \(\sigma^2\) and effective dimension \(d\) of the chi-squared distribution modeling the squared distances:

$$\begin{align}

\hat{\sigma}^2 & = \frac{1}{M} \frac{1}{N}\sum s_i \\

\log\frac{\hat d}{2} -\psi\left(\frac{\hat d}{2}\right) & = \log\left [ \frac{1}{N} \sum s_i \right ] -\frac{1}{N} \sum \log s_i

\end{align}$$

where \(\psi()\) is defined for integers and half integers:

$$\psi\left(n\right) = \begin{cases}

– \gamma + \sum_{k=1}^{n-1} \frac{1}{k}, & n \in \mathcal{N} \\

– \gamma – 2 \left [\ln 2 – \sum_{k=1}^n \frac{1}{2k-1} \right ], & n+\frac{1}{2} \in \mathcal{N}

\end{cases}$$

where \(\gamma\) is the Euler-Mascheroni constant

(solve for \(\hat d \in \mathcal{N}\) numerically). - For a given false positive rate \(f_p\), find the squared distance threshold $$s_\textrm{thresh}(f_p; d,\sigma^2) = \left [ – \, \frac{\Gamma(d/2)}{N}\ln(1-f_p) \right ]^{2/d} \frac{2M\sigma^2}{d}.$$
- For the shingles of songs with unknown relationship to the query, find those with squared distances less than this threshold and count as a match.
- That is all.