Hello, and welcome to Paper of the Day (Po’D): Recursive Nearest Neighbor Search in a Sparse Domain Applied to Comparing Audio Signals Edition. Today’s paper is our just-yesterday-submitted revised article: B. Sturm and L. Daudet, “Recursive Nearest Neighbor Search in a Sparse and Multiscale Domain for Comparing Audio Signals,” submitted to Signal Processing, Sep. 2010. And with that revision done, I had my first good night’s rest last night.

For past entries related to this revision, see the past two months:

In the interest of extreme brevity, here is my one line description of the work in this paper:

A previously proposed approach for nearest neighbor search in a sparse domain is not practical for comparing audio signals, but its general framework produces surprising yet sensible results.

The whole problem is like so: given several sparse models of audio signals, how can we find the nearest neighbors of a signal by comparing the elements of their models? Building upon previous work (P. Jost, Algorithmic aspects of sparse approximations. PhD thesis, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland, June 2007; P. Jost and P. Vandergheynst, “On finding approximate nearest neighbours in a set of compressible signals,” in Proc. European Signal Process. Conf., (Lausanne, Switzerland), pp. 1-5, Aug. 2008; B. L. Sturm and L. Daudet, “On similarity search in audio signals using adaptive sparse approximations,” in Proc. Int. Workshop Adaptive Multimedia Retrieval, (Madrid, Spain), Sep. 2009.), we define nearest neighbors as being a small Euclidean distance from a signal in the time domain. Even though that is highly specific, and extremely limiting when talking about comparing audio signals, we find that using only a few pairwise comparisons of atoms we can locate the position of an excerpt of a longer signal.

Above we see in gray in the background the time-domain waveform of a 25 second audio signal. We extracted a small segment (the word “cheese”), decomposed both signals using MP and the 8xMDCT dictionary, and computed pairwise correlations between atoms. The black line at back is the time-domain correlation of each signal, where we can clearly see the location of the excerpt. The first black line in the foreground is the resulting “score” using only one atom from each model for a fine partition of the time domain. We clearly see there is a single spike at the location of the excerpt. And when we increase the number of atoms compared, we begin to find other portions of the signal that have similar time-frequency structures (many of these other portions contain the word “cheese”). We see the same behavior if we seriously distort the excerpt before decomposition with AWGN, and with a signal that has a high coherence with the dictionary. The time domain correlations between these distorted signals show multiple spikes; but comparing pairs of atoms we find a more sparse signal that describes in a hierarchic manner relationships between the extract and the original signal.

This does not mean that we have created robust speech recognition. Rather, it means that we can compare relatively few elements from sparse models to find portions that share the same time-frequency structures, and rank them in terms of this overlap. Because of the specificity of the distance we use, this can be seen more as “fingerprinting” than “similarity search”. However, it is not that far to leap to something more akin to similarity search when we consider comparing the parameters of parametric sparse models.

In the figure above, we are comparing the sparse domain cosine distances between two sets of recorded individual piano notes using only 10 atoms from each model of each signal. These signals are decomposed by MP with a Gabor dictionary of several scales. The rows are a diatonic scale played on a very out-of-tune piano recorded in noisy conditions, and the columns are a chormatic scale played on a piano that is more in tune and recorded in much nicer conditions. In (a) we are summing the correlations of 55 atom pairs in the order of their approximation weights. We can see somewhat of a diatonic scale that is in-between the notes of the other. If we restrict ourselves to only looking at the ten highest-weighted atoms that are longer than 180 ms, then we get (b). We can more clearly see the harmonic relationships between the two sets of signals because we are comparing the tonal aspects of the signals and not also their high-energy and time-localized attacks of each note.