Paper of the Day (Po’D): Environmental Sound Recognition with Time-frequency Features Edition

Hello, and welcome to the Paper of the Day (Po’D): Environmental Sound Recognition with Time-frequency Features Edition. Today’s papers address the general problem of detecting the activities within some environment using acoustic data:

  • S. Chu, S. Narayanan, and C.-C. J. Kuo, “Environmental sound recognition with time-frequency audio features,” IEEE Trans. Audio, Speech, Lang. Process., vol. 17, pp. 1142-1158, Aug. 2009.
  • S. Chu, S. Narayanan, and C.-C. J. Kuo, “Environmental sound recognition using MP-based features,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., (Las Vegas, NV), pp. 1-4, Mar. 2008.

In the surveillance of an outdoor environment, we probably want to recognize particular events without having to do the labor-intensive process of annotating hundreds of hours of data, both visual and acoustic. We could be interested in anything from detecting the presence of a particular bird species, or the presence of particular football hooligans. So how can we do this within the audio domain?
Chu et al. explore the automatic discrimination between and recognition of general classes of recorded sound environments (sampled at 22.05 kHz), such as “casino,” “restaurant,” and “daytime nature,” and “inside moving vehicle” (fourteen in all), by using a variety of features including Mel-frequency cepstral coefficients (MFCCs), delta MFCCs, linear prediction coefficients, and the statistics of matching pursuit (MP) decompositions over a multiresolution Gabor dictionary. (All features are calculated over 50% overlapped rectangular windows of 11.6 ms, or 256 samples.)

In their MP-derived features, the authors decompose each 11.6 ms using MP and a Gabor dictionary of 1120 real atoms (thus phase is not optimized), with atoms of dyadic scales from 2 to 256 samples, translations in \(\{0, 64, 128, 192\}\) samples (???), and 35 different exponentially distributed modulation frequencies for each atom (no matter the scale??? I am totally confused now). Anyhow, from the MP decomposition of the segment, the authors keep the parameters of the first \(n = 5\) atoms (empirically chosen to give best recognition accuracy, but what is the logic?). From these atoms, the authors calculate a four-dimensional feature vector from the mean and standard deviations of the modulation frequencies and scales of the 5 atoms.

The authors explore the effectiveness of combinations of all these features
in training and testing for two different classifiers:
K nearest neighbors (\(K=1\)), and Gaussian mixture models (orders of 1-20).
Their database is constructed of sound clips of duration 4 seconds, from the (not free) BBC Sound Effects Library, and the on-line and free Freesound Project.
The size of the database used in this paper is unknown.
We are told that training and test sources were never the same.
When the authors used the MP features only,
the correct recognition rate was 72.5\%;
using only MFCC features the recognition was 70.9\%;
and when combing these together the authors obtained 83.9\%.
The presentation of the results are generally confusing,
but it can be read that the MP and MP + MFCC features
perform the best of all.

Though similar results are reported in K. Umapathy, S. Krishnan, and S. Jimaa, “Multigroup classification of audio signals using time-frequency parameters,” IEEE Trans. Multimedia, vol. 7, pp. 308-315, Apr. 2005,
I am generally skeptical of classification approaches that use features
created at the level of the atom found in a MP decomposition,
a decomposition process that has a well-documented propensity to produce very bad atoms, and thus features. I remember reading one conference award winning paper that posited that heavy metal music has more short-scale atoms than smooth jazz, or something like that. And thus, if one detects many short-scale atoms, it is likely to be heavy metal. No, no, no.

More atoms. Many more atoms.
Few atoms. Very few atoms.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s