Hello, and welcome to Paper of the Day (Po’D): Finding Similar Acoustic Events Using Matching Pursuit Edition. Today’s paper is one of those few that I have read where it feels like I am looking at the back of the book to see the solution to my odd numbered problem: C. Cotton and D. P. W. Ellis, “Finding similar acoustic events using matching pursuit and locality-sensitive hashing,” in Proc. IEEE Workshop App. Signal Process. Audio and Acoustics, (Mohonk, NY), pp. 125-128, Oct. 2009. The reason I say this is because I worked on similar problems last year during my postdoc, and I just received word (peer review of my article, “Audio Similarity in a Multiscale and Sparse Domain,” submitted in January) that I must revisit the problem in no minor way — but more on this in the days and weeks to comes.

For reasons that I am not omniscient (Mom always said that it’s no fun being a know-it-all, barring certain deities), I have been completely unaware of this work until yesterday; and of course, when I read the title my stomach sank — but that’s just from my piece of Danish dream cake.

The authors of this paper propose a method for finding occurrences of repeated and perceptually similar sound events, such as clopping horse hooves, within a recording — which grows out of their work in attempting to link together events in audio and visual data.

The authors approach this problem by building features from low-order atomic models found by matching pursuit (MP) with an overcomplete time-frequency (Gabor) dictionary.

For all pairs of atoms with center time differences within some threshold, the authors create a set of three-dimensional feature vectors consisting of the two modulation frequencies, and their center-time difference.

Then the authors assume that within this space there exist clusters of features (or “landmarks” as they call them) corresponding to particular acoustic events, and that we can detect the presence of similar events by comparing the landmarks with the features built from the recordings of unknown events.

(In a similar direction, see this Paper of the Day (Po’D): Environmental Sound Recognition with Time-frequency Features Edition.)

The final aspect of their approach is in pruning the atomic model of redundant atoms. They assume that psychoacoustic masking principles apply to the atomic model, and so remove atoms that are too close in time or frequency to other atoms with higher energies. In their experiments, this sometimes halves the number of atoms in the models.

Since the authors compare the statistics of pairs to pairs, the number of possibilities of course explodes combinatorially. So they employ locality sensitive hashing, which gives me a good opportunity to finally learn what that is.

A hash function maps a large piece of data to a smaller piece of data, e.g., going from \(n\)-dimensional space to the natural numbers,

to ease the process of searching and sorting.

And locality sensitive hashing performs this mapping such that

we have a high probability that two data close together in the \(n\)-dimensional space are mapped to the same “bucket” in the lower-dimensional space.

In the case of the present paper, the authors perform this hashing by projecting each set of vectors onto a set of iid three-dimensional random vectors, and then quantizing and averaging the results.

Then, using the same hash functions and quantization, and projecting the features of the unknown source, we can easily compare the similarity between the signals within the hashed pair space by starting with the largest and most near cluster of atom pairs.

For the training phase in their experiments, they decompose, pair atoms, and hash a sound file having several instances of horse-hoof sounds. They select those atom pairs having at least 20 nearest neighbors (I think this means 20 in the same bucket), which gives 25 “patterns” representing “horse hoof sound.”

Then they decompose the same signal mixed with other interfering sounds at different energies (SNR), pair atoms, hash, and then find the nearest patterns in the training set.

Below we see the precision and recall as a function of SNR and varying by the hash quantization.

As these buckets get wider, we will get more false positives, and thus the precision will drop; but we will find more of the true positives, and so the recall will increase.

Clearly, the choice of the quantization is critical to the performance of the classificator.

Also critical is the interference of the other signals with high correlation in a

time-frequency dictionary. This leads to decreased precision, but the recall appears much less affected.

Below are the sonograms of the original signal, and the original with interference to the tune of -7 dB SNR. Superposed onto each of them in black are the horse hoove patterns, and their detection in the noisy signal. The magenta colored ones are “false positives” … which doesn’t make sense to me because I can see all but one of them are next to the original horse hoove.

*Now switching to critique mode:*

I think this approach is interesting in that it attempts to generalize sparse atomic models to something like a flexible template into which we can try to fit other models to see if they work — kind of like the pegs of different shapes and the box with holes of different shapes. It is not the approach I take in my article-in-writing, but it does offer a contrasting method. However, I am intellectually uncomfortable with a few aspects.

- The first is the idea that the collection of pairs of atoms, and their frequencies and center-time differences, are sufficient for characterizing a signal and facilitating measurements of similarity. If we merely time-reverse the signal, it will have the exact same features in this space (barring the “noise” injected by MP), but it won’t sound at all similar to the original.
- The second is the assumption that MP is finding atoms that are relevant to our perception of the source. Sure, it is using a time-frequency dictionary, but it is working on the time-domain signal, not to mention MP with very coherent dictionaries has a nasty habit of injecting its own biases into data models. A portion of these artifacts might be addressed by the post-processing step of the proposed method, but that brings me to my next problem.
- The third problem is the assumption that psychoacoustic masking principles even apply to an atomic and multiresolution sparse approximation. When we hear the original signal we are not hearing atoms are we? It doesn’t make sense then to get rid of those atoms “that we don’t hear anyway.” It does make sense, when trying to create a template of a sound source, to reduce the redundancy of the model or equivalently to make sure the time-frequency energy representation of the signal is adequately represented. But I don’t see why this should be approached with masking … especially since the source of interest (the sounds we are searching for) may be partially masked to our ears by an interfering source. (Why don’t we want to detect all horse hooves in the noisy signal above?)
- Finally, the fourth problem is the use of the exact same signal in training and testing. I want to know how well these patterns represent the sound of horse hooves.

So what is my one line description of this paper?

We can gauge acoustic similarity using generalized features from sparse parametric time-frequency signal models.