This morning I attended the séminaire “Méthodes Mathématiques du Traitement d’Images” at the Laboratoire Jacques-Louis Lions at Paris 6, where Prof Laurent Daudet presented a talk titled, “Compression et indexation échelonnables de la musique.” Prof Daudet has produced much fine work in sparse representations of audio signals, and is an authority in this area.
Today Prof Daudet discussed the problems inherent to producing representations and models that are rich in describing the characteristics of audio signals — finding “the music subspace,” as he called it. Some approaches, such as time-frequency transforms, come with a trade-off in resolution; sparse approximation with an overcomplete dictionary is capable of “overcoming” such a trade-off, though in an artificial way. (The “uncertainty principle” is not broken, but sparse approximation is not tied to use a single time-frequency-domain resolution.) Daudet presented three attempts of using sparse approximation with overcomplete dictionaries to create rich representations of audio signals, that are useful in a number of ways, including source detection and separation in polyphonic music, simultaneous audio coding and description, and efficient and hierarchical similarity search in audio signals.
The first example Daudet discussed is the “mid-level” representation of polyphonic music audio signals created by sparse approximation with harmonic and chirped time-frequency dictionaries learned from the steady-states of isolated instrument sources. This work is well-documented in: P. Leveau, E. Vincent, G. Richard, and L. Daudet, “Instrument-specific harmonic atoms for mid-level music representation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, pp. 116-128, Jan. 2008. Leveau et al. show that matching pursuit with such a trained dictionary can perform a rough automatic transcription of a polyphonic musical signal. Instead of atoms, the decomposition results in a set of molecules pertaining to specific instruments; and which resembles a piano-roll notation.
The second example Daudet presented is the simultaneous compression and description of audio signals using matching pursuit with an overcomplete dictionary consisting of a union of MDCT basis of eight different scales. This work is well-documented in: E. Ravelli, G. Richard, and L. Daudet, “Union of MDCT bases for audio coding,” IEEE Trans. Audio, Speech, Lang. Proc., vol. 16, pp. 1361-1372, Nov. 2008. Ravelli et al. show that adapting a signal representation using a multiscale dictionary and greedy matching pursuit can improve the quality of a parametric signal model or a transform coder at low bit rates. At high-bit rates however, it is performs similarly to a transform coder, but with a humongous computational complexity (decomposition and compression is 60-100 times the signal duration). However, what is interesting with this approach is that it creates a “content-aware representation,” as Daudet called it. With the signal so described in a sparse domain, we can determine higher-level characteristics such as chords, genre, and rhythm. This is the subject of the article: E. Ravelli, G. Richard, and L. Daudet, “Audio signal representations for indexing in the transform domain,” IEEE Trans. Acoustics, Speech, Lang. Process., vol. 18, pp. 434-446, Mar. 2010. Daudet argued that this all leads to “semantic audio coding.” Now that audio coding is a solved problem, we are moving toward better acquisition systems providing perfect or versatile description of audio signals for facilitating useful search and retrieval of data.
The final example Daudet presented was our work in efficient and hierarchical similarity search in audio signals à la a sparse domain approach. In this work we show how sparse and multiresolution models of signals can be used to search for similar content at various levels of detail. This moves us from correlating waveforms, to correlating larger-scale structures in signals by comparing pairs of atoms, starting with the “most significant” as determined by a sparse approximation. This work is a part of a recent submission (which can be made available to those interested): B. L. Sturm and L. Daudet, “Audio similarity in a multiscale and sparse domain,” Signal Process. (submitted), Dec. 2009.
On top of these, some recent papers I am looking forward to reviewing in a paper of the day context are:
- M. Davies and L. Daudet, “Fast sparse subband decomposition using FIRSP (Fast iteratively reweighted sparsifier),” in Proc. EUSIPCO, 2004.
- M. D. Plumbley, T. Blumensath, L. Daudet, R. Gribonval, and M. E. Davies., “Sparse representations in audio and music: from coding to source separation.,” Proc. IEEE, 2010. Accepted for publication.