Paper of the Day (Po’D): Fingerprinting to Identify Repeated Sound Events Edition

Hello, and welcome to Paper of the Day (Po’D): Fingerprinting to Identify Repeated Sound Events Edition. Today’s paper is: J. Ogle and D. P. W. Ellis, “Fingerprinting to identify repeated sound events in long-duration personal audio recordings,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., vol. 1, (Honolulu, Hawaii), pp. 233-236, Apr. 2007. This work is essentially a study of the ShazamWow! Fingerprint (Po’D: A. Wang, “An industrial strength audio search algorithm,” in Proc. Int. Conf. Music Info. Retrieval, (Baltimore, Maryland, USA), pp. 1-4, Oct. 2003.) applied to not just songs, but audio in general, e.g., environmental sounds like ringing phones, door bells, and other sounds common to technological human existence.

In the interest of extreme brevity, here is my one line description of the work in this paper:

The ShazamWow! Fingerprint works far better in identifying longer sounds with a spectral structure rich in tonal components than for sounds that are short and/or have a simple spectral structure.

In their experiments, the authors use sound material from the following three event classes: recorded music and radio broadcasts; alert sounds like telephone rings and door bells; and “organic sounds” like doors closing and garage doors opening. (Why not splattering melons, like those used in: L. Chu and G. Kendall, “The Sound of Fruits and Vegetables: Scientific Visualization with Auditory Tokens,” in Proc. Int. Computer Music Conf., Banff, Canada, 1995?) The data set was 40 hours of recordings from around the office of the first author, to which the authors artificially added 30 songs, 45 telephone rings, and 20 door closures and beeps (like from a computer?). The locations of these events provides a ground truth for evaluating the identification algorithm. They then took a sound from one of the classes (how long were the music queries?), reduced it to its fingerprint hash, and searched for its occurrences in the 40 hour recording. Locating signals of the type “production music” was the most successful with high recall (97%)and precision (85%); for the alert sounds the recall was much lower (69%), but the precision was perfect (100%). For the organic sounds the performance was dismal: 0% in both recall and precision.

The authors rightly observe that this fingerprinting approach is great for music, which is complex in time-frequency components, and is of long duration. But in the case of identifying events in environmental audio, the problem now exists that many sounds do not last as long as a song. This limits the query length, which will reduce the recall when retrieving based on chronological hash matches. Furthermore, simple sounds like alerts will not be rich enough, and their fingerprint might use elements that are in the noise component. What is more, for sounds like door closures, and short ticks and beeps, these will not be well-represented in a Fourier basis of a single scale. This is where sparse approximation with a time-frequency dictionary can make a positive impact — and indeed, we have already seen an extension of this work in such a direction, c.f., Po’D: C. Cotton and D. P. W. Ellis, “Finding similar acoustic events using matching pursuit and locality-sensitive hashing,” in Proc. IEEE Workshop App. Signal Process. Audio and Acoustics, (Mohonk, NY), pp. 125-128, Oct. 2009.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s