Paper of the Day (Po’D): An industrial Strength Audio Search Algorithm Edition

Hello, and welcome to Paper of the Day (Po’D): An industrial Strength Audio Search Algorithm Edition. Since I am working to address the critiques of a recently reviewed article of mine, I am performing a thorough literature search on similarity search in audio, among the world of time-series data — which itself is massive. Among these numerous works is this paper that details (but not completely) what appears to be the correct solution to a thorny problem: A. Wang, “An industrial strength audio search algorithm,” in Proc. Int. Conf. Music Info. Retrieval, (Baltimore, Maryland, USA), pp. 1-4, Oct. 2003.

In the interest of extreme brevity, here is my one line description of the work in this paper:

With the ShazamWow! Fingerprint we can perform song search and information retrieval in a scalable and robust manner by comparing hashes of short-time magnitude spectra reduced to high-entropic fingerprints.


The problem is nowadays very common, and is convincingly implemented on the iPhone, for instance. Say I am at a bar, or a sporting event, and I hear a song that I like and of which I want to know more. Due to the nonexistence of printed music programs at these events, I must use other means to gather some information. I can ask people what it is, but this could be embarrassing when word gets around that I like Lady \(\MG_a^T\MG_a\) — the Gramian of Pop Stars. I could try to remember, and sing it over the telephone to a radio DJ; but that assumes the expert system on the other end can parse my input. Instead, I can hold up my cellular phone, record a bit of the song, and have an automated system tell me what is playing.

The author tackles this problem by comparing hashes of high-dimensional “fingerprints” created for all known signals in the music database, and the unknown signal. Each fingerprint is derived from a “constellation” of the largest peaks in localized time-frequency energy distributions of a signal, subject to a density criterion to ensure that the peaks selected are uniformly distributed in time and frequency. (I don’t know the duration or hop of the frames used.) These constellations of points are reduced to several hashed fingerprints, each one consisting of pairs of frequencies and time offsets for each “anchor point” and an associated target region. The relative time-location of the hash is also kept. Matches between an unknown signal and those in the database are done by looking for the presence of correctly time-ordered hashes, e.g., in a scatter plot. In their experiments the authors find that their music identification method works extremely efficiently and accurately at low (additive) SNR and with longer query recordings. Shazam is also still in existence, so no doubt people are finding this application useful.

It is at first unbelievable to think about identifying unknown music recorded by a cell phone because CELP is not music-friendly. (The codec the authors consider here is not CELP, but “Regular Pulse Excited-Linear Predictive Coder with a Long Term Predictor Loop”, which I still think is not friendly to polyphonic music because of the whole “regular pulse” excitation.) However, we see that identifying music by finding chronologically ordered hashes makes use of the time-domain structure of music to form a robust method. I wonder, however, whether this system as it stood in 2001 could identify cover songs, remixes, etc.? I am sure Shazam has already looked into that …

Leave a comment