# Comparing Audio Signals in a Sparse Domain

It is push-time for me in the process of revising an article on comparing audio signals in a sparse domain. (Nine days left.) Our main point (at least in today’s version) is that sparse and parametric models of data provide flexible ways of comparing and/or classifying the contents in audio signals — which has been argued before, e.g., just to cite a few:

1. R. Gribonval, E. Bacry, S. Mallat, P. Depalle, and X. Rodet, “Analysis of sound signals with high resolution matching pursuit,” in Proc. IEEE-SP Int. Symp. Time-Freq. Time-Scale Anal., (Paris, France), pp. 125-128, June 1996.
2. K. Umapathy, S. Krishnan, and S. Jimaa, “Multigroup classification of audio signals using time-frequency parameters,” IEEE Trans. Multimedia, vol. 7, pp. 308-315, Apr. 2005.
3. L. Daudet, “Sparse and structured decompositions of signals with the molecular matching pursuit,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, pp. 1808-1816, Sep. 2006.
4. P. Leveau, E. Vincent, G. Richard, and L. Daudet, “Instrument-specific harmonic atoms for mid-level music representation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, pp. 116-128, Jan. 2008.
5. P. Jost and P. Vandergheynst, “On finding approximate nearest neighbours in a set of compressible signals,” in Proc. European Signal Process. Conf., (Lausanne, Switzerland), pp. 1-5, Aug. 2008.
6. E. Ravelli, G. Richard, and L. Daudet, “Fast mir in a sparse transform domain,” in Int. Conf. Music Info. Retrieval, (Philadelphia, PA), Sep. 2008.
7. P.-A. Manzagol, T. Bertine-Mahieux, and D. Eck, “On the use of sparse time-relative auditory codes for music,” in Proc. Int. Soc. Music Information Retrieval, (Philadelphia, PA), Sep. 2008.
8. S. Chu, S. Narayanan, and C.-C. J. Kuo, “Environmental sound recognition with time-frequency audio features,” IEEE Trans. Audio, Speech, Lang. Process., vol. 17, pp. 1142-1158, Aug. 2009.
9. C. Cotton and D. P. W. Ellis, “Finding similar acoustic events using matching pursuit and locality-sensitive hashing,” in Proc. IEEE Workshop App. Signal Process. Audio and Acoustics, (Mohonk, NY), pp. 125-128, Oct. 2009.
10. E. Ravelli, G. Richard, and L. Daudet, “Audio signal representations for indexing in the transform domain,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, pp. 434-446, Mar. 2010.
11. M. Morvidone, B. L. Sturm, and L. Daudet, “Incorporating scale information with cepstral features: experiments on musical instrument recognition,” Patt. Recgn. Lett., vol. 31, pp. 1489-1497, Sep. 2010.
12. S. Scholler and H. Purwins, “Sparse coding for drum sound classification and its use as a similarity measure,” in Proc. Int. Workshop Machine Learning Music ACM Multimedia, (Firenze, Italy), Oct. 2010.

What is novel about our particular work is that we explore the evaluation of correlations between signals and subsequences in the time domain by comparing elements of their sparse models —
which builds upon the work in P. Jost and P. Vandergheynst, “On finding approximate nearest neighbours in a set of compressible signals,” in Proc. European Signal Process. Conf., (Lausanne, Switzerland), pp. 1-5, Aug. 2008.
Our initial thoughts were that the simultaneous description and “compression” of signals through sparse approximation using parametric overcomplete dictionaries can provide extremely efficient ways of searching audio data by looking at pairwise comparisons of their elements.
With regards to the method of Jost et al., however, we are finding that the compressibility of audio in time-frequency dictionaries is not enough to compete with state-of-the-art algorithms for, e.g., exact search (e.g., A. Kimura, K. Kashino, T. Kurozumi, and H. Murase, “A quick search method for audio signals based on piecewise linear representation of feature trajectories,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, pp. 396-407, Feb. 2008.).
Indeed, such a claim of efficient and quick search using the method of Jost et al. is not be feasible for at least two reasons:

1. audio signals, while compressible in time-frequency dictionaries, are not compressible enough, and thus the number of pairwise comparisons necessary grows combinatorially (which can be somewhat addressed by a smart ordering of pairs in terms of importance);
2. the sparse approximation process is extremely expensive.

The experiments made by Jost et al. use small images (128 square) and wavelet decompositions, and thus their results do not completely translate to audio data (high dimensional) decomposed over redundant million-atom time-frequency dictionaries.
In such a case, there is no possibility of tabulating the Gramian for quick lookup of atom pair correlations. And the cost of looking up atoms in million-atom decompositions is expensive as well.

These things being said, however, we do observe some intriguing and attractive behaviors.
Below in (a) we see the magnitude time-domain correlations between sampled individual piano notes (dimension 176,400).
The overtone series is very clear.
In (b) we see each magnitude time-domain correlation between two signals approximated with pairwise comparisons of only ten atoms in each sparse model generated by matching pursuit.
The overtone series is again clear.
Here we select these ten atoms are selected based only on their weights,
but we can also imagine selecting the atoms based on other parameters, such as scale and modulation frequency.
In (c) we see the magnitude time-domain correlations between two different sets of sampled piano notes.
The second set is a diatonic scale played on a piano with tuning “issues”.
These recordings are also poorly made, and have a lot of noise.
In this matrix we can see some sort of progression, but the noise makes comparisons difficult.
In (d) we see approximations of these magnitude time-domain correlations using pairwise comparisons of only three atoms.
Now we can see a somewhat diatonic scale starting around MIDI note number 49-50.
The notes on the detuned piano sit between those of the other piano.

Well, back to revisioning!

2. Hi Igor, In this work we are not working with compressed audio data. (Also, transform coding of audio is not expensive.) We are only trying to take advantage of the adaptivity of sparse approximation in an overcomplete dictionary of multiscale time-frequency atoms to create signal approximations/models that offer features more descriptive than low-level features from, e.g., short-time Fourier transforms, or bags of frames of features (BFFs) calculated without localized considerations (like window size). One of our real problems is to find some solution to $$|| \vx – \MD\vs|| \lt \epsilon$$ where $$\MD$$ contains 2,194,730,297 complex atoms (which almost doubles when we include the conjugates to model a real signal), and $$\vx$$ is of dimension 54,772,200. In such a case, the only feasible approach to find some satisfactory solution $$\vs$$ is a greedy iterative method like matching pursuit (MPTK holla!). The idea of sensing such a large $$\MD$$ first with a random matrix, and then finding a $$\vs$$ through an $$\ell_1$$ minimization of a system of millions of equations, is not an idea for which I am prepared! :)