# Four-hour Power Hour (not in one sitting)

Hello, and welcome to my first Four-hour Power Hour (not in one sitting), a span of time during which I will attempt to confront and control the near-critical mass of literature not only threatening the structural integrity of my desk, and coaxing my colleagues to call me disposophobic, but also to alleviate impediments in my research acumen. Below, I review and briefly summarize several papers (some of which have moved with me from Paris, and some of which have moved with me from Santa Barbara to Paris and then from Paris, and some of which are completely new). My selection of papers and brief discussion of each are only in context to my present research task (major revision of a submission on audio similarity, search and retrieval in sparse domains), and so any brevity should not to be taken as a reflection of the quality of the work. (And of course, if I have misrepresented anything please let me know and I will correct it.)

J. A. Arias, J. Pinquier, and R. André-Obrecht, “Evaluation of classification techniques for audio indexing,” in Proc. European Signal Process. Conf., (Istanbul, Turkey), Sep. 2005.

The authors discuss the use of SVMs and GMMs trained using MFCCs and derivatives for four simple audio content discrimination tasks: speech/non-speech, music/non-music, applause/non-applause, laughter/not-laughter.

J. A. Arias, R. André-Obrecht, and J. Farinas, “Automatic low-dimensional analysis of audio databases,” in Proc. Content-based Multimedia Indexing, (London, U.K.), pp. 556-559, June 2008.

The authors propose a method for reducing each element of a heterogeneous collection of audio recordings to a point in 3D where clustering can discriminate between the signal content, e.g., spoken speech, singing voice, music, speaker id, etc. They apply this method to three tasks: speech/music discrimination, speaker identification, and 3-class language detection. In their method, the authors first extract for each segment 15 MFCCs (20 ms windows, hops of 10 ms, which coefficients I don’t know, but I assume the first 15), and then bag these features together by training a GMM. Then, for each pair of GMMs, they estimate the Kullback-Leiber symmetric divergence. Then, from these values they generate 3D vectors for each signal using multidimensional scaling (which I think is a projection that maintains distance relationships in a high-dimensional space). Finally, kernal PCA is applied to this set of points (if at all) to reduce in-cluster distances, and increase the separation between-clusters. The results look very convincing. Related to this work, see: J. Pinquier and R. André-Obrecht, “Audio indexing: primary components retrieval: Robust classification in audio documents,” Multimedia Tools and Applications, vol. 30, no. 3, pp. 313-330, 2006.

D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet, “Semantic annotation and retrieval of music and sound effects,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, 2008.

The authors create an automatic method for describing music signals (songs or sound effects) at a high-level (e.g., a music review) using semantic annotations acquired from a user study, and metadata on the web. This method (born out of research in image retrieval) also serves as a query-by-text into an annotated database. For each annotation, the authors learn a GMM of audio features (bags of MFCCs and deltas) from songs positively associated with that annotation (like, “calming”, “not calming”, or “screaming vocals”). They compare models built in two ways: averaging the models of a specific word built over each track; and a mixture hierarchies approach (which I am not too clear on). In their results, we see that for music annotation both approaches perform close to humans except for genre and instrumentation (both average precision and recall of related semantic concepts). The latter approach outperforms the average models approach. For the sound effects tests, we see a much higher average recall (~36%) than precision (~18%) for both approaches. For a retrieval task, the concept of emotion provides the highest average precision (~51%).

L. Barrington, A. Chan, D. Turnbull, and G. Lanckriet, “Audio information retrieval using semantic similarity,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., (Honolulu, Hawaii), Apr. 2007.

In the sequel to the work above (Turnbull et al., TASLP 2008), the authors explore the use of semantic similarity between annotated audio for retrieval. When compared to query by acoustic example, this approach has a higher mean average precision (17.5% vs. 15.9%). It also has a higher overlap in the semantic contents of the returned sounds rather than if the sounds sound similar.

A. Kimura, K. Kashino, T. Kurozumi, and H. Murase, “A quick search method for audio signals based on piecewise linear representation of feature trajectories,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, pp. 396-407, Feb. 2008.

The authors build upon previous work (e.g., K. Kashino, T. Kurozumi, and H. Murase, “A quick search method for audio and video signals based on histogram pruning,” Multimedia, IEEE Transactions on, vol. 5, pp. 348-357, Sept. 2003.), to devise an efficient audio search method for near-exact matches to a query. This involves vector quantizing (128 codewords) the magnitude spectrum of each frame in a learned codebook (7-channel band pass filterbank, equally spaced in log frequency, 10 ms hops of 60 ms windows), creating histograms of these codes over larger frames, then dynamically segmenting the histograms in time based on piecewise linear constraints, then compressing the histograms in each segment (KL transform, lossy), and then decimating the compressed features in time (selecting one feature from N features as representative). The same procedure (without the feature decimation) is done to a query signal. Even with all this, there exist assurances that a query will not return false negatives (as long as certain conditions are met). (Some of the experimental results are strange, e.g., why are there ~70,000 matches to a query in 200 hours of audio?)

A. Divekar and O. Ersoy, “Compact storage of correlated data for content based retrieval,” in Proc. Asilomar Conf. Signals, Systems, and Computers, 2009.

The authors discuss the application of $$\ell_1$$ minimization for recovering image data from a compressed set of image feature vectors (and residues) queried by a user. Images are broken into small blocks, each of which is described in a compressible (sparse) manner in some orthogonal basis. The residual is kept for each block. The transformed blocks are downsampled and then “measured” by a sensing matrix satisfying an RIP. (At this point I am thinking how this is similar to sketching: P. Indyk, N. Koudas, and S. Muthukrishnan, “Identifying representative trends in massive time series data sets using sketches,” in Proc. Int. Conf. Very Large Data Bases, (Cairo, Egypt), pp. 363-372, Sep. 2000.) Somehow they measure similarity between two compressed features, and the database image that matches the best (over all blocks?) is recovered by $$\ell_1$$ minimization from the compressed feature and coded residual.

R. Bardeli, “Similarity search in animal sound databases,” IEEE Trans. Multimedia, vol. 11, pp. 68-76, Jan. 2009.

In this work the author explores the application of structure tensors (a matrix describing the dominant orientations in local regions) to describe short-time Fourier magnitude spectra for use in similarity search (query-based retrieval) within a database of animal sounds. The structure tensors are used to find time-frequency regions with high activity, and then these regions are essentially vector quantized. The author adopts a fast indexing and retrieval scheme based on “inverted lists” (he points to the paper: M. Clausen F. Kurth, “Content-based information retrieval by group theoretical methods,” Proc. Computational Noncommutative Algebra and Applications, (Tuscany, Italy), July 2003.) The author conducted an experiment in retrieving and ranking the similarities of bird songs to several queries, using both his approach and one with HMMs trained on MPEG-7 “spectrum projection features” of each species. For most birdsong his approach fared better, and specifically for non-noisy songs with pitch curves (something structure tensors are sensitive to).

G. Peeters and E. Deruty, “Sound indexing using morphological description,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, pp. 675-687, Mar. 2010.

The aim of this work is to automatically organize a set of recorded sounds based on high-level descriptions of the “shape of sounds,” motivated by the work of electroacoustic music pioneer Pierre Schaffer. The three descriptions proposed here are: “dynamic profiles” (behavior of loudness over time); melodic profiles (behavior of pitch over time); and complex-iterative sound description (repetition of sound-element over time).

R. J. Weiss and J. P. Bello, “Identifying repeated patterns in music using sparse convolutive non-negative matrix factorization,” in Proc. Int. Symp. Music Info. Retrieval, (Utrecht, Netherlands), Aug. 2010.

The authors approach song segmentation using a probabilistic enhancement of non-negative matrix factorization called probabilistic latent component analysis, which they extend to be shift-invariant using convolution. For features they use beat-synchronous chromagrams, which result in time-frequency templates signifying chord patterns, common to popular music structures.

And with that my stack is nine papers (~9 mm) shorter. Only a few meters to go!