# Paper of the Day (Po’D): Natural Sounds Segmenting, Indexing, and Retrieval Edition

Hello, and welcome to Paper of the Day (Po’D): Natural Sounds Segmenting, Indexing, and Retrieval Edition. Today we continue with the theme of working with natural and/or organic sounds: G. Wichern, J. Xue, H. Thornburg, B. Mechtley, and A. Spanias, “Segmentation, indexing, and retrieval for environmental and natural sounds,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, pp. 688-707, Mar. 2010.

In the interest of extreme brevity, here is my one line description of the work in this paper:

Here we find an all-in-one system for automatically segmenting and efficiently searching environmental audio recordings, based in no minor way on probabilistic models of six simple short- and long-term features… very impressive!

The authors present a complete system for segmenting and efficiently retrieving sounds from a database of environmental recordings. The authors compute features from various time-scales (40 ms — 1s, overlapped the same at 20 ms). The short-term features they use are: loudness, spectral centroid (Bark-weighted), and “spectral sparsity” (ratio of $$\ell_\infty$$ and $$\ell_1$$ norms of Fourier transform). The long-term features they use are computed from collections of the short-term features: “temporal sparsity” (ratio of $$\ell_\infty$$ and $$\ell_1$$ norms of frame energies), “transient index” (average of cepstral flux between subsequent short-term frames), and harmonicity. Their approach to segmentation involves a dynamic Bayesian network involving the feature frames, with each frame being modeled by a Markov chain having three states: silence or absence of sound event, onset of new sound event, and continuation of other sound events. The model consists of hidden variables that model the activity of particular features (I think, and the rest of this portion I will leave). The authors show that their audio segmentation gives results that make sense, and performs better than an established method for speech segmentation.

The retrieval procedure proposed by the authors involves creating and comparing templates of “feature trajectories” derived for each feature as a hidden Markov model with the four visible states: constant, increasing, decreasing, increasing and then decreasing, decreasing and then increasing. With these templates for each feature, and an assumption of conditional independence between the features, it is relatively simple to calculate the likelihood of a database signal given the features of the specified query signal, and then rank the matches. It is computationally prohibitive to do this, however, since it must be done for each frame of audio. The authors propose to first cluster the database sounds using a spectral clustering algorithm, and create templates representative of the clusters for initial comparisons with a query template. The templates in the nearest cluster are then exhaustively compared with the query.

The authors had 10 people rank the relevance of several hundred sounds to queries, which they then used as a ground truth for the automatic retrieval system. They show that their approach has a higher precision (still only 42%) than a non-template (i.e., Euclidean distance) approach. In testing their cluster-based retrieval approach, the authors find they usually pay a price in terms of decreased precision and recall for indoor environments, but not for outdoor and noisier environments. (Their graphs are confusing me.) We also see that retrieving indoor sounds using outdoor noisy queries gives a low precision (12.6% with exhaustive search).