A Briefly Annotated Timeline of Similarity in Audio Signals

Oct. 26, 2010, NB: This “living document” assembles a collection of works that address similarity search in audio data. Have I forgotten anything so far? My briefly annotated timeline of similarity search in time series is here.

It is difficult to define boundaries on work in and applications of similarity search in audio data, so I restrict the below annotated timeline to work that addresses the following problems:

  1. querying a collection of sampled sound signals by a sampled sound query (query by example, fingerprinting);
  2. clustering a collection of audio signals based on a measure of similarity.

I do not refer to any work that operates only on metadata supplied by humans, assigned by other sources, or high-level representations, such as MIDI, etc. For instance, I don’t consider works describing querying a MIDI database by humming. I am also not considering speech verification, although that is certainly using interesting concepts of similarity.


B. Feiten, S. Günzel, “Automatic Indexing of a Sound Database Using Self-organizing Neural Nets,” Computer Music J., vol.18, no.3, pp. 53-65, Fall 1994.


D. Keislar, T. Blum, J. Wheaton and E. Wold, “Audio Analysis for Content-Based Retrieval,” in Proc. Int. Computer Music Conf., Banff, Alberta, Canada, 1995.

T. Blum, D. Keislar, J. Wheaton and E. Wold, “Audio Databases with Content-Based Retrieval,” in Proc. Int. Joint Conf. Artificial Intelligence and Intelligent Multimedia Information Retrieval, Montreal, Quebec, Canada, 1995.


E. Wold, T. Blum, D. Keislar, and J. Wheaton, “Content-based classification, search and retrieval of audio,” IEEE Multimedia, vol. 3, no. 3, pp. 27-36, Fall 1996.

The authors describe the sound searching methods used by the
then-start-up company Muscle Fish (http://www.musclefish.com).
Also see U.S. patent number 5,918,223: “Method and article of manufacture
for content-based analysis, storage, retrieval, and segmentation of audio information.”
Features that they compare when searching and retrieving sounds are:
duration, loudness, pitch, brightness, bandwidth, and harmonicity.
The last five features are computed in time-localized fashion,
and their statistics (mean, variance, autocorrelation) are calculated over the entire sound.
The database is queried by either specifying a class (e.g., “scratchy”),
or supplying another sound and finding its nearest neighbors
with a Euclidean distance.
The nice thing about this work is that classes can be abstracted
by supplying sound examples during the training.
Thus, it is operating closer to the level of content with supervised learning.


S. R. Subramanya, R. Simha, B. Narahari, and A. Youssef, “Transform-based indexing of audio data for multimedia databases,” in Proc. Int. Conf. Multimedia Computing Syst., (Ottawa, Canada), pp. 3-6, June 1997.

The authors propose two algorithms for indexing the audio databases of multimedia databases
using frequency domain representations created from one of several transforms: DCT, DFT, and Haar.
For each fixed-sized window of audio, they transform it and keep only a small fraction of the coefficients.
Comparisons are made using Euclidean distance.
These approaches are very similar to those proposed in time-series similarity search, e.g. Agrawal et al. 1993,
but missing from the references.

S. Pfeiffer, S. Fischer, and W. Effelsberg, “Automatic audio content analysis,” in Proc. Int. Multimedia Conf., (Boston, MA), pp. 21-30, Nov. 1997.

The authors propose a method of searching audio data
by segmenting the music data, reducing it to “fuf signatures”, or fundamental frequency signatures,
and comparing these to a query similarly fufed.
It appears that they just correlate the fufs directly.
They specify an example of searching broadcast video and audio
for repetitions of commercials.

J. T. Foote, “Content-based retrieval of music and audio,” in Proc. SPIE Multimedia Storage Archiving Syst., (Dallas, TX), pp. 138-147, Nov. 1997.

The author presents an audio content search engine
that compares sounds based on their “templates”, or
histograms of their vector quantized features (13-dim MFCCs).
For a set of class-labeled training data, a tree-structured quantizer is built (in a greedy fashion),
which for each feature dimension tries to maximizes the mutual information
between the feature and class.
A query sound is quantized by this tree,
and its histogram is computed and compared with those of the classes in the database, I think.
The author compares the results of this approach with those from Muscle Fish (Wold et al., 1996),
and finds that it performs better for certain types of sound (like agogo bell and touchtone phone),
but much worse for others (like oboe and rain/thunder).
(I should read this paper more closely because this tree-based approach is interesting. See also: J. Foote, “A similarity measure for automatic audio classification,” Tech. Rep. SS-97-03, AAAI, 1997.)

K. Melih, R. Gonzalez, and P. Ogubona, “An audio representation for content based retrieval,” in Proc. IEEE Conf. Speech Image Tech. Computing and Telecommunications, pp. 207-210, Brisbane, Australia, Dec. 1997.

The authors propose a non-redundant and perceptually based time-frequency audio representation.
They reduce frames of sound to three classes: noise, tone, and sweep.
Though they say this representation can be useful for browsing and searching audio data,
they report no experiments.

T. Blum, D. Keislar, J. Wheaton and E. Wold, “Audio Databases with Content-Based Retrieval,” in Intelligent Multimedia Information Retrieval, AAAI Press, Menlo Park, CA, pp. 113-135, 1997.


T. Zhang and C.-C. J. Kuo, “Hierarchical system for content-based audio classification and retrieval,” in Proc. Int. Conf. Acoustics, Speech, Signal Process., (Seattle, WA), pp. 398-409, Apr. 1998.

(Also see: T. Zhang and C.-C. J. Kuo, “Content-based classification and retrieval of audio,”
Proc. Conf. Advanced Signal Process. Algorithms, Arch. Implementations, San Diego, CA, July 1998.)
The authors propose a hierarchical organization of sounds,
first classified and segmented into speech, music, environment, and silence
(using energy, mean zero crossing rate, f0);
and then into more specific classes (e.g., man/woman)
defined by HMMs trained on GMMs of STFT data.
(One state of each HMM is a timbre within a class.)
Sounds similar to an example are retrieved by
modeling the query feature vectors with the set of HMM
in the database, and finding the ones with high likelihoods.
The authors specify three model matching modes:
no restrictions on durations or state transitions (useful for monotimbral sounds like babbling brooks);
arbitrary durations but fixed state transitions (useful for held musical notes);
and state durations and transitions are fixed (useful for footsteps).
The experimental results appear convincing.
(This paper takes a very original approach!)

G. Smith, H. Murase, and K. Kashino, “Quick audio retrieval using active search,” in Proc. Int. Conf. Acoustics, Speech, Signal Process., (Seattle, WA), pp. pp. 3777-3780, May 1998.

The authors propose an approach to searching for audio
which they later improve in
K. Kashino, G. Smith, and H. Murase, “Time-series active search for quick retrieval of audio and video,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., vol. 6, pp. 2993-2996, Mar. 1999.

S. R. Subramanya and A. Youssef, “Wavelet-based indexing of audio data in audio/multimedia databases,” in Proc. Int. Workshop Multimedia Database Management Syst., (Dayton, OH), Aug. 1998.

The authors replace the monoscale transforms in their work in Subramanya et al. 1997
with a dyadic wavelet transform.
They keep only 10-15\% of the coefficients to index each audio signal.
Compared to the previous approach,
we see an increase in mean precision by a few percentiles.


J. Foote, “An overview of audio information retrieval,” Multimedia Systems, vol. 7, pp. 2-10, Jan. 1999.

This article presents a good overview of work up to 1998.

K. Kashino, G. Smith, and H. Murase, “Time-series active search for quick retrieval of audio and video,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., vol. 6, pp. 2993-2996, Mar. 1999.

The authors propose a method for the search of audio and video data within a large audio or video stream. For audio data, the feature vector is a set of N-band (e.g.,7) normalized short-term power spectrum (found using N second order IIR bandpass filters (e.g., Q=15) with center frequencies equally spaced in logarithmic frequency). These features are used because they are less expensive to compute than MFCCs. The feature vectors of the windowed data (e.g., 128 samples at 11 kHz sampling rate) are reduced to a histogram representation. The distance between two histograms is calculate using “histogram intersection,” which is the sum of the minimum values in bin pairs over all bins. Then the distance between two windows of data is the minimum histogram intersection computed over all sub-windows.

M. Welsh, N. Borisov, J. Hill, R. von Behren, and A. Woo,
“Querying large collections of music for similarity,” tech. rep., University of California, Berkeley, 1999.

The authors describe an approach to similarity search in a music database
using Euclidean distances (within unit ball) and kNN with feature vectors (one per song) of up to dimension 1248(!).
Features include tonal histograms (chromagrams?),
tonal transition statistics, spectral flatness, volume variance,
and tempo statistics.

E. Wold, T. Blum, D. Keislar and J. Wheaton, “Classification, Search and Retrieval of Audio,” in Handbook of Multimedia Computing, CRC Press, pp. 207-226, 1999.

D. Keislar, T. Blum, J. Wheaton and E. Wold, “A Content-Aware Sound Browser,” in Proc. Int. Computer Music Conf., Beijing, China, pp. 457-459, 1999.


S. E. Johnson and P. C. Woodland, “A method for direct audio search with applications to indexing and retrieval,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., (Istanbul, Turkey), pp. 1427-1430, June 2000.

The authors propose a fast and accurate method for finding exact matches of cue audio in news broadcasts,
building upon their work in text-free speaker recognition.
For features they use Mel-frequency perceptual linear prediction (PLP) cepstral coefficients.
(see: H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” J. Acoustical Soc. America, vol. 87, pp. 1738-1752, 1990.)
To calculate distances, they compare the
covariance matrix of the features of the cue audio with that of the features
of each same-size segment of the database
(apparently this is a measure called “arithmetic harmonic sphericity”).
A window shifting approach similar to that in Kashino et al. 1998, 1999,
is used to decrease the number of necessary comparisons
(with queries longer than 1 s, they find shifts of one half the duration is acceptable).

D. Pye, “Content-based Methods for the Management of Digital Music,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process.,
pp. 2437-2440, (Istanbul, Turkey), June 2000.

The author compares Foote’s approach (Foote 1997) using vector quantization
with Gaussian mixture models of cepstral features for audio classification and retrieval.
He mainly addresses performance speed issues,
and proposes an approach using partially decompressed MPEG 1, layer 3 coded audio.

Z. Liu and Q. Huang, “Content-based indexing and retrieval-by-example in audio,” in Proc. IEEE Int. Conf. Multimedia Expo, vol. 2, (New York, NY, USA), pp. 877-880, Aug. 2000.

To aid in browsing and retrieving data from audio databases,
the authors propose an approach that first segments the audio data,
clusters its segments,
and represents them using a probabilistic model.
The unsupervised segmentation method
compares the MFCCs of windows neighboring each candidate break point
using a Kullback-Leibler divergence (KLD) assuming each MFCC is iid Gaussian.
This is followed by a merging step using again the KLD
but over the distribution of the features of the segment
(again assuming MFCC is iid Gaussian).
Finally, each segment is represented as a Gaussian mixture model (GMM).
A hierarchical clustering of segments is then performed
based on the similarity of pairs of GMMs,
each of which is measured by a linear program
(see Z. Liu and Q. Huang, “A new distance measure for probability distribution
function of mixture types,” Proc. Int. Conf. Acoustics, Speech, Signal Process., Istanbul, Turkey,
June 2000).
To query by example, the query data is similarly processed
and its features (fist 13 MFCCs) are represented by a series of GMMs.
Distances between pairs of GMMs are computed
using the KLD, and those below a specified threshold
are deemed to match.

G. Li and A. A. Khokhar, “Content-based indexing and retrieval of audio data using wavelets,” in Proc. IEEE Int. Conf. Multimedia Expo, (New York, NY, USA), pp. 885-888, Aug. 2000.

The authors propose an audio indexing scheme based on
statistics of each band of their discrete wavelet transforms:
mean, variance, and zero crossing rates (of coefficients).
These values form a 3-dimensional feature vector for each sound (entire sound or each frame of sound?).
A weighted Euclidean distance is computed to quantify similarity.
The results are comparable with those of Wold et al. 1996.

S. Z. Li, “Content-based audio classification and retrieval using the nearest feature line method,” IEEE Trans. Speech Audio Process., vol. 8, pp. 619-625, Sep. 2000.

The authors present a method for audio content classification and retrieval
using a pattern classification method called “nearest feature line” (NFL)
(applied previously by the author to face recognition).
A feature line is a generalization of multiple feature vectors
in the same class to a line that attempts to encompass variability.
Classification is done by finding the feature line nearest to a point.
To describe an audio signal they find the mean and variance
of the perceptual features (e.g., brightness, pitch, bandwidth) and MFCCs
computed in a time-localized manner (32 ms hamming windows with 8 ms overlap).
They also take into account the number of silent frames.
In their experiments they find that combining cepstral and perceptual features
with NFL classification performs better than nearest neight (1 and 5) and nearest centroid.
Unlike the approach of Wold et al. 1996,
they show much fewer errors with sounds of the class “percussion”
on the same dataset.

J. Foote, “Arthur: Retrieving orchestral music by long-term structure,” in Proc. Int. Symp. Music Info. Retrieval, (Plymouth, MA), Oct. 2000.

The author proposes a fingerprint method specialized for classical music
that takes advantage of its large and long-term changes in dynamics.
Normalized energy profiles (1 Hz sampling rate)
of audio signals are compared using dynamic time warping (squared Euclidean distance),
and the cost of the path quantifies the similarity of two signals.
The precision and recall of orchestral music are very high,
but are low for piano concertos.
These results are improved when using subband energy profiles (e.g., downsampled spectrogram).


C. Papaodysseus, G. Roussopoulos, D. Fragoulis, A. Panagopoulos, and C. Alexiou, “A new approach to the automatic recognition of musical recordings,” J. Audio Eng. Soc., vol. 49, pp. 23-35, Feb. 2001.

Behind the JAES paywall.

D. Fragoulis, G. Rousopoulos, T. Panagopoulos, C. Alexiou, and C. Papaodysseus,
“On the automated recognition of seriously distorted musical recordings,” IEEE Trans. Signal Process., vol. 49, pp. 898-908, Apr. 2001.

The author address the problem of identifying music
distorted by, e.g., compression, speed-up, transmission by radio and TV stations,
and additive noise, by measuring and comparing
relative and perceptually-weighted spectral peak positions.
For segments of the unmodified signal,
they create a set of “band representative vectors” for comparison
to those of a signal from an unknown source.
(There are too many heuristics here!)

B. Logan and A. Salomon, “A content-based music similarity function,” Tech. Rep. CRL 2001/02, Cambridge Research Laboratory, June 2001.

(Also see: B. Logan and A. Salomon, “A music similarity function based on signal analysis,”
in Proc. IEEE Int. Conf. Multimedia Expo., Aug. 2001.)
As in Foote 1997,
the authors use histograms of MFCCs to compare the similarity of audio data.
Their approach is similar to Liu et al. 2000,
but does not use Gaussian mixture models.
They compute MFCCs of 25.6 ms frames (10 ms overlap) of the song,
then cluster (K-means, K=16) these together,
and then form a “signature” from the means, covariances, and weights (percentage of frames) of each cluster.
Two signatures are compared using the earth movers distance,
which is a calculation of the amount of work required to
change one signature into the other.
The distance between two sets of parameters
is defined by a symmetrized Kullback-Leibler divergence.
In their experiments, the authors compare the generated results
using human listening tests.
We also see that the music of Johann Sebastian Bach
is closer to Ravi Shankar and the Spice Girls than it is to Mozart!
On recalling the original song,
we see the system performs at extremely high recall.

M. Liu and C. Wan, “A study on content-based classification and retrieval of audio database,”
in Proc. IEEE Int. Database Eng. App. Symp., (Grenoble France), pp. 339-345, July 2001.

The authors test a probabilistic neural network (PNN) classifier
with Euclidean distance between feature vectors
to query by example a database of audio data.
They take a hierarchical approach like Zhang and Kuo 1998,
but use a sequential forward feature selection (greedy?) approach to select the best
set from 87 different features in time (RMS, zero crossings, etc.)
and frequency (centroid, bandwidth, pitch, etc.),
time-frequency (spectrogram), and coefficient domains (MFCC, LPC).
The PNN classifier implements a GMM.
In retrieval, the query sound is first classified by its features using the PNN,
and then the similarity of its (best selected) features to those of other sounds with the class
is measured by mean Euclidean distance — which is strange to me.
The precision and recall performance is better using the classification step first,
than not, but even then the scores are not very high.

C. Yang, “Macs: music audio characteristic sequence indexing for similarity retrieval,” in Proc. IEEE Workshop Appl. Signal Process. Audio Acoust., (New Paltz, NY), pp. 1-4, Oct. 2001.

In previous work (C. Yang, “Music database retrieval based on spectral similarity,” Tech. Rep. Stanford University Database Group. No. 2001-14, 2001),
the author proposes a dynamic time warping approach like Foote 2000,
but matching only spectral peaks in the spectrogram.
In this work, the authors propose a more efficient and scalable approach
using locality sensitive hashing of the indexes.
The author first generates a characteristic sequence of each audio file
by finding peaks in the power profile (as in Foote 2000),
then in the time-neighborhood of each peak,
finding 180 significant frequency components (between 0.2 – 2 kHz),
sum up the energy contributions in a 24-band comb filter representing different pitch levels (I guess log freq. domain?),
zero mean, and normalize each 24-dim. vector,
and adaptively downsample the set of these vectors to reduce redundancy.
Finally, the distance between a pair of elements from two characteristic sequences
is defined by the Euclidean distance.
Dynamic time warping with these distances
is then used to quantify the similarity between two audio signals.

E. Allamanche, J. Herre, O. Hellmuth, B. Fröba, T. Kastner, and M. Cremer,
“Content-based identification of audio material using MPEG-7 low level description,”
in Proc. Int. Symp. Music Info. Retrieval, (Bloomington, IN), Oct. 2001.

(Also see: J. Herre, E. Allamanche, and O. Hellmuth, “Robust matching of audio signals using spectral flatness features,”
in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 127-130, 2001.)
The authors propose a robust (to common distortions) system for the efficient identification of an audio signal
from some of the descriptors specified by MPEG-7
(loudness and spectral flatness in four bands, and the spectral crest factor).
Features are compared by Euclidean distances to feature vectors, I think,
but somehow vector quantization is involved.
(The writing and organization confuse me.)

H. Neuschmied, H. Mayer, and E. Batlle, “Identification of audio titles on the internet,”
in Proc. Int. Conf. Web Delivering of Music, (Florence, Italy), Nov. 2001.

The authors create a system they call “AudioDNA”
in order to perform a monitoring of music content on the WWW.
AudioDNA is created from each window of data,
decorrelation of some timbre features,
computing some probabilities using HMMs,
and then a Viterbi algorithm is thrown in there.
(Significant details are missing!)
The identification problem reduces to
string matching of subsequences of AudioDNA.


S. Sukittanon and L. E. Atlas, “Modulation frequency features for audio fingerprinting,”
in Proc. Int. Conf. Acoustics, Speech, Signal Process., (Orlando, FL), pp. 1773-1776, May 2002.

The authors propose using “joint acoustic modulation frequency” features
for robust fingerprinting, which appears to be like a transform of the Wigner-Ville distribution.
The experiments show fingerprints are robust to several distortions.

A. Rauber, E. Pampalk, and D. Merkl, “Using psycho-acoustic models and self-organizing maps
to create hierarchical structuring of music by sound similarity,”
in Proc. Int. Symp. Music Info. Retrieval, (Paris, France), Oct. 2002.

The authors attempt to measure the perceived degree of similarity between
pieces of music based on psycho-acoustic models (specific loudness sensation)
and rhythm patterns (modulations in 20 frequency bands) found for 6-second segments,
and then use this to cluster music together in a hierarchical self-organizing map.
The Euclidean distance between features measures similarity.
The issue of genre definition is avoided since music is clustered based on
how similar they sound.

J.-J. Aucouturier and F. Pachet, “Music similarity measures: What’s the use?,”
in Proc. Int. Symp. Music Info. Retrieval, (Paris, France), Oct. 2002.

In this paper, the authors propose using MFCCs (50 ms frames, first 8 coeffs.)
collectively modeled by GMMs, to measure timbral similarity between music.
Similarity between songs can then be gauged by sampling from each model
and computing the mean likelihoods,
or by finding the likelihood of the features of one song given each song model
(but what happens when matching partial songs?).
Such probabilistic models of songs represent an enormous savings in space.
The authors report that the system is able to
recognize (with varying success) cover songs, songs by the same artist,
and songs in the same genre — though timbre is not always the best feature to do so.
They even report surprising results that highlight the benefits of
music data mining in timbral spaces;
and attempt to formalize this concept in a measure of “interestingness.”
Thus, we see the “use”.

J. Haitsma and T. Kalker, “A highly robust audio fingerprinting system,” in Proc. Int. Symp. Music Info. Retrieval, (Paris, France), pp. 107-115, Oct. 2002.

(Also see: J. Haitsma and T. Kalker, “A highly robust audio fingerprinting system with an efficient search strategy,”
J. New Music Research, vol. 32, pp. 211-221, June 2003;
and J. Haitsma, T. Kalker, and J. Oostveen, “Robust audio hashing for content identification,”
Proc. Content-Based Multimedia Indexing, Brescia, Italy, Sep. 2001.)
The authors describe a robust fingerprinting technique for identifying audio data.
Essentially frames of subfingerprints are generated for 3 seconds of audio
using time and frequency energy differences between 33 frequency bands
of a magnitude DFT of 370 ms windows shifted 11.6 ms
in the range 300 — 2000 Hz.
The fingerprint blocks are thus binary,
and the distance between blocks is computing by the Hamming distance.
They design an efficient search method (which I don’t understand),
and show that with amazing effectiveness they are able to identify
a source even with GSM coding, and RealAudio 20 kbps destruction.

P. Cano, E. Batlle, T. Kalker, and J. Haitsma, “A review of algorithms for audio fingerprinting,” in Proc. Workshop Multimedia Signal Process., (St. Thomas, US Virgin Islands), pp. 169-173, Dec. 2002.

The authors present a thorough overview of current approaches to
audio fingerprinting.


G. Guo and S. Z. Li, “Content-based audio classification and retrieval by support vector machines,”
IEEE Trans. Neural Networks, vol. 14, pp. 209-215, Jan. 2003.

For audio retrieval the authors propose trained SVMs,
and an SVM-inspired metric called “distance-from-boundary” to find sounds similar those in a database.
As features they use combinations of perceptual and mel-cepstral coefficients
(means and variances of total spectrum power, subband powers, brightness, bandwidth, pitch frequency, MFCCs).
They use a binary tree structure for multiclass recognition.
They compare the performance of this approach
with kNN, and nearest centroid classifiers,
and to the approach by Wold et al. 1996.
Their experiments show that the SVM approach
has the highest accuracy.

C. J. C. Burges, J. C. Platt, and S. Jana, “Distortion discriminant analysis for audio fingerprinting,”
IEEE Trans. Speech Audio Process., vol. 11, pp. 165-174, May 2003.

(Also see: C. Burges, J. Platt, and S. Jana, “Extracting noise robust features from audio data,” in
Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., pp. 1021-1024, 2002.)
The authors propose learning good sets of features
from the audio data using linear discriminant analysis (LDA) (or oriented PCA) on
artificially distorted (non-linear as well) data to simulate the various problems
one can encounter when trying to identifying distorted audio data.
They apply the LDA in two layers, which they call “distortion discriminant analysis.”
The fingerprints are created in the following way (for 6-s segments
30 seconds from “the beginning of the clip”):
every 372 ms segment (50\% overlap) is transformed by a
modified complex lapped transform (MCLT) into 2048 points,
the log values of which are compressed by LDA into a 64-dim vector,
and 32 of these vectors is further reduced by another application of LDA
into a compressed vector of dimension 64.
Preprocessing of the MCLT log spectrum
removes equalization by lowpass filtering it;
and then applying a perceptual threshold.
Random alignment can also increase robustness.
(Confused on this, but details may be in ICASSP paper.)
The authors show good results for a variety of distortions,
and very large databases.

J. Gao, G. Tzanetakis, and P. Steenkiste, “Content-based retrieval of music in scalable peer-to-peer networks,”
in Proc. Int. Conf. Multimedia Expo., (Baltimore, Maryland, USA), pp. 309-312, July 2003.

The authors consider scaling similarity search to work over many computers
containing different audio files.
They propose to use a set of audio descriptors
(attribute value pairs given in Tzanetakis and Cook 2002),
and define distances between value pairs (vector) by \(\ell_1\) of the difference.

K. Kashino, T. Kurozumi, and H. Murase,
“A quick search method for audio and video signals based on histogram pruning,”
IEEE Trans. Multimedia, vol. 5, pp. 348-357, Sept. 2003.

This work is built upon in
A. Kimura, K. Kashino, T. Kurozumi, and H. Murase,
“A quick search method for audio signals based on piecewise linear representation of feature trajectories,”
IEEE Trans. Audio, Speech, Lang. Process., vol. 16, pp. 396-407, Feb. 2008.

A. Wang, “An industrial strength audio search algorithm,” in Proc. Int. Conf. Music Info. Retrieval, (Baltimore, Maryland, USA), pp. 1-4, Oct. 2003.

(See PoD here.)
With the ShazamWow! Fingerprint we can perform song search and information retrieval
in a scalable and robust manner by comparing hashes of short-time magnitude spectra
reduced to high-entropic fingerprints.


A. Berenzweig, B. Logan, D. P. W. Ellis, and B. Whitman,
“A large-scale evaluation of acoustic and subjective music-similarity measures,”
Computer Music J., vol. 28, pp. 63-76, Mar. 2004.

H. Özer, B. Sankur, and N. Memon, “Robust audio hashing for content identification,”
in Proc. European Signal Process. Conf., (Vienna, Austria), Sep. 2004.

M. Clausen and F. Kurth, “A unified approach to content-based and fault-tolerant music recognition,”
IEEE Trans. Multimedia, vol. 6, pp. 717-731, Oct. 2004.


C. J. C. Burges, D. Plastina, J. C. Platt, E. Renshaw, and H. S. Malvar, “Using audio fingerprinting for duplicate detection and thumbnail generation,” in Proc. Int. Conf. Acoustics, Speech, Signal Process., (Philadelphia, PA), pp. 9-12, Mar. 2005.

Y. Ke, D. Hoiem, and R. Sukthankar, “Computer vision for music identification,”
in Proc. IEEE Conf. Computer Vision Pattern Recog., pp. 597-604, June 2005.

M. Müller, F. Kurth, and M. Clausen, “Audio matching via chroma-based statistical features,”
in Proc. Int. Conf. Music Info. Retrieval, (London, U.K.), pp. 288-295, Sep. 2005.

N. Bertin and A. de Cheveigné, “Scalable metadata and quick retrieval of audio signals,” in Proc. Int. Symp. Music Info. Retrieval, (London, UK), pp. 238-244, Sep. 2005.

J. A. Arias, J. Pinquier, and R. André-Obrecht, “Evaluation of classification techniques for audio indexing,” in Proc. European Signal Process. Conf., (Istanbul, Turkey), Sep. 2005.

R. Typke, F. Wiering, and R. C. Veltkamp, “A survey of music information retrieval systems,” in Proc. Int. Symp. Music Info. Retrieval, (London, U.K.), Sep. 2005.

The authors provide a good overview of state-of-the-art MIR, commercialized examples, and future challenges and directions.

M. Müller, F. Kurth, and M. Clausen, “Chroma-based statistical audio features for audio matching,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, (New Paltz, NY), pp. 275-278, Oct. 2005.

J.-J. Aucouturier, F. Pachet, and M. Sandler, “‘The way it sounds’: Timbre models for analysis and retrieval of music signals,” IEEE Trans. Multimedia, vol. 7, pp. 1-8, Dec. 2005.

P. Cano, E. Batlle, E. Gómez, L. Gomes, and M. Bonnet, Audio Fingerprinting Concepts and Applications, ch. 17, pp. 233-245. Studies in Computational Intelligence, Berlin, Germany: Springer-Verlag, 2005.


D. Schwarz, “Concatenative sound synthesis: The early years,” J. New Music Research, vol. 35, pp. 3-22, Mar. 2006.

A review of a particular creative application of similarity measures to audio synthesis.

J. Pinquier and R. André-Obrecht,
“Audio indexing: primary components retrieval: Robust classification in audio documents,”
Multimedia Tools and Applications, vol. 30, no. 3, pp. 313-330, July 2006.

T. Pohle, P. Knees, M. Schedl, and G. Widmer, “Independent component analysis for music similarity computation,” in Proc. Int. Symp. Music Info. Retrieval, (Victoria, BC), pp. 228-233, Oct. 2006.

N. Orio, “Music retrieval: A tutorial and review,” Foundations and Trends in Information Retrieval, vol. 1, pp. 1-90, Nov. 2006.

E. Pampalk, Computational Models of Music Similarity and their Application in Music Information Retrieval. PhD thesis, Vienna University of Technology, Vienna, Austria, March 2006.

Chapter 2 presents a summary of many audio-based similarity measures,
and their performance in various problems, including comparisons between music genres.
Chapter 3 presents three specific applications of music similarity:
visualization of collections, artist-level organization, and playlist generation.


J. Merrell, D. Ventura, and B. Morse, “Clustering streaming music via the temporal similarity of timbre,” in Proc. IJCAI Workshop Artificial Intell. and Music, (Hyderabad, India), pp. 153-164, Jan. 2007.

J. Ogle and D. P. W. Ellis, “Fingerprinting to identify repeated sound events in long-duration personal audio recordings,” in Proc. Int. Conf. Acoustics, Speech, Signal Process., vol. 1, (Honolulu, Hawaii), pp. 233-236, Apr. 2007.

(See Po’D here.) The ShazamWow! Fingerprint works far better in identifying longer sounds with a spectral structure rich in tonal components than for sounds that are short and/or have a simple spectral structure. (This work is extended in: C. Cotton and D. P. W. Ellis, “Finding similar acoustic events using matching pursuit and locality-sensitive hashing,” in Proc. IEEE Workshop App. Signal Process. Audio and Acoustics, (Mohonk, NY), pp. 125-128, Oct. 2009, Po’D here.)

D. P. W. Ellis and G. E. Poliner, “Identifying ‘cover songs’ with chroma features and dynamic programming beat tracking,” in Proc. Int. Conf. Acoustics, Speech, Signal Process., (Honolulu, Hawaii), Apr. 2007.

(See also: D. P. W. Ellis, “Identifying ‘cover songs’ with beat- synchronous chroma features,” Music Information Retrieval Evaluation eXchange (MIREX), 2006; D. P. W. Ellis and C. V. Cotton, “The 2007 Labrosa Cover Song Detection System,” Music Information Retrieval Evaluation eXchange (MIREX) 2007.)
The authors propose building and comparing chroma features in a tempo invariant way for the automatic identification of different versions (covers) of the same song. First, they segment the musical audio into beats (see: D. W. P. Ellis, “Beat Tracking by Dynamic Programming,” J. New Music Research, vol. 36 no. 1, pp. 51-60, March 2007). Then, for each of these segments they construct a 12-element chroma feature from the energy measured at the instantaneous frequency calculated in each DFT bin up to 1 kHz, with a margin added of +- 1/2 semitones to take care of people playing out of tune. Then, they treat two sets of features as if they are images, and perform a 2D correlation between them with repetition at the borders. They then take the row with the largest magnitude correlation, high pass filter it, and compute the “distance” as the reciprocal of the output maximum value. Their system performs well for a variety of test sets, recalling 761 cover songs from 3,300, and with a mean reciprocal rank of 0.49 for the MIREX 2006 set. Their MATLAB software is available here. Of course, the effectiveness of this approach strongly relies on music having well-defined beats.

R. Virtanen and M. Helén, “Probabilistic model based similarity measures for audio query-by-example,” in Proc. IEEE Workshop Appl. Signal Process. Audio Acoust., (New Paltz, NY), Oct. 2007.


F. Kurth and M. Müller, “Efficient index-based audio matching,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, pp. 382-395, Feb. 2008.

A. Kimura, K. Kashino, T. Kurozumi, and H. Murase, “A quick search method for audio signals based on piecewise linear representation of feature trajectories,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, pp. 396-407, Feb. 2008.

The authors build upon previous work (e.g., K. Kashino, T. Kurozumi, and H. Murase, “A quick search method for audio and video signals based on histogram pruning,” Multimedia, IEEE Transactions on, vol. 5, pp. 348-357, Sept. 2003.), to devise an efficient audio search method for near-exact matches to a query. This involves vector quantizing (128 codewords) the magnitude spectrum of each frame in a learned codebook (7-channel band pass filterbank, equally spaced in log frequency, 10 ms hops of 60 ms windows), creating histograms of these codes over larger frames, then dynamically segmenting the histograms in time based on piecewise linear constraints, then compressing the histograms in each segment (KL transform, lossy), and then decimating the compressed features in time (selecting one feature from N features as representative). The same procedure (without the feature decimation) is done to a query signal. Even with all this, there exist assurances that a query will not return false negatives (as long as certain conditions are met). (Some of the experimental results are strange, e.g., why are there ~70,000 matches to a query in 200 hours of audio?)

M. Casey, R. Veltkamp, M. Goto, M. Leman, C. Rhodes, and M. Slaney, “Content-based music information retrieval: Current directions and future challenges,” Proc. IEEE, vol. 96, pp. 668-696, Apr. 2008.

V. A. Larsen, “Combining audio fingerprints,” Master’s thesis, Norwegian University of Science and Technology, June 2008.

Y. Yu, K. Joe, and J. S. Downie, “Efficient query-by-content audio retrieval by locality sensitive hashing and partial sequence comparison,” IEICE Trans. Information and Systems, vol. E91-D, pp. 1730-1739, June 2008.

J. A. Arias, R. André-Obrecht, and J. Farinas, “Automatic low-dimensional analysis of audio databases,” in Proc. Content-based Multimedia Indexing, (London, U.K.), pp. 556-559, June 2008.

(Also see: J. Pinquier and R. André-Obrecht, “Audio indexing: primary components retrieval: Robust classification in audio documents,” Multimedia Tools and Applications, vol. 30, no. 3, pp. 313-330, 2006.)
The authors propose a method for reducing each element of a heterogeneous collection of audio recordings to a point in 3D where clustering can discriminate between the signal content, e.g., spoken speech, singing voice, music, speaker id, etc. They apply this method to three tasks: speech/music discrimination, speaker identification, and 3-class language detection. In their method, the authors first extract for each segment 15 MFCCs (20 ms windows, hops of 10 ms, which coefficients I don’t know, but I assume the first 15), and then bag these features together by training a GMM. Then, for each pair of GMMs, they estimate the Kullback-Leiber symmetric divergence. Then, from these values they generate 3D vectors for each signal using multidimensional scaling (which I think is a projection that maintains distance relationships in a high-dimensional space). Finally, kernal PCA is applied to this set of points (if at all) to reduce in-cluster distances, and increase the separation between-clusters. The results look very convincing.

M. Casey, C. Rhodes, and M. Slaney, “Analysis of minimum distances in high-dimensional musical spaces,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, pp. 1015-1028, July 2008.

(See Po’D here. Also see: M. Casey and M. Slaney, “Fast recognition of remixed music audio,” in Proc. Int. Conf. Acoustics, Speech, Signal Process., (Honolulu, Hawaii), pp. 1425-1428, Apr. 2007.)
With a proper analysis of the distribution of distances in groups of features that are known to be unrelated, and the application of locality sensitive hashing, a range of problems in audio similarity search is essentially nailed in a highly efficient manner (Kudos!)

Q. Li, L. Wu, and X. He, “Content-based audio retrieval using perceptual hash,” in Proc. Int. Conf. Intelligent Info. Hiding and Multimedia Signal Process., (Harbin, China), pp. 791-794, Aug. 2008.


R. Bardeli, “Similarity search in animal sound databases,” IEEE Trans. Multimedia, vol. 11, pp. 68-76, Jan. 2009.

In this work the author explores the application of structure tensors (a matrix describing the dominant orientations in local regions) to describe short-time Fourier magnitude spectra for use in similarity search (query-based retrieval) within a database of animal sounds. The structure tensors are used to find time-frequency regions with high activity, and then these regions are essentially vector quantized. The author adopts a fast indexing and retrieval scheme based on “inverted lists” (he points to the paper: M. Clausen F. Kurth, “Content-based information retrieval by group theoretical methods,” Proc. Computational Noncommutative Algebra and Applications, (Tuscany, Italy), July 2003.) The author conducted an experiment in retrieving and ranking the similarities of bird songs to several queries, using both his approach and one with HMMs trained on MPEG-7 “spectrum projection features” of each species. For most birdsong his approach fared better, and specifically for non-noisy songs with pitch curves (something structure tensors are sensitive to).

Y. Yu, K. J. V. Oria, F. Mörchen, J. S. Downie, and L. Chen, “Multi-version music search using acoustic feature union and exact soft mapping,” Int. J. Semantic Computing, vol. 3, pp. 209-234, May 2009.

J. H. Jensen, M. G. Christensen, D. P. W. Ellis, and S. H. Jensen, “Quantitative analysis of a common audio similarity measure,” IEEE Trans. Audio, Speech, Lang. Process., vol. 17, pp. 693-703, May 2009.

(See Po’D here.) This paper provides a significant amount of insight into where and why MFCCs work and don’t work in tasks involving musical signals.

B. L. Sturm and L. Daudet, “On similarity search in audio signals using adaptive sparse approximations,” in Proc. Int. Workshop Adaptive Multimedia Retrieval, (Madrid, Spain), Sep. 2009.

The authors investigate extending an image similarity search approach using sparse representations (P. Jost and P. Vandergheynst, “On finding approximate nearest neighbours in a set of compressible signals,” in Proc. European Signal Process. Conf., (Lausanne, Switzerland), pp. 1-5, Aug. 2008) to use in gauging similarity between audio signals.

C. Cotton and D. P. W. Ellis, “Finding similar acoustic events using matching pursuit and locality-sensitive hashing,” in Proc. IEEE Workshop App. Signal Process. Audio and Acoustics, (Mohonk, NY), pp. 125-128, Oct. 2009.

(See Po’D here.) The authors of this paper propose a method for finding occurrences of repeated and perceptually similar sound events, such as clopping horse hooves, within a recording — which grows out of their work in attempting to link together events in audio and visual data. The authors approach this problem by building features from low-order atomic models found by matching pursuit (MP) with an overcomplete time-frequency (Gabor) dictionary.

D. Rafailidis, A. Nanopoulos, and Y. Manolopoulos, “Nonlinear dimensionality reduction for efficient and effective audio similarity searching,” Multimedia Tools and Applications, Nov. 2009.


M. Müller and S. Ewert, “Towards timbre-invariant audio features for harmony-based music,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, pp. 649-662, Mar. 2010.

G. Wichern, J. Xue, H. Thornburg, B. Mechtley, and A. Spanias, “Segmentation, indexing, and retrieval for environmental and natural sounds,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, pp. 688-707, Mar. 2010.

(See Po’D here.) Here we find an all-in-one system for automatically segmenting and efficiently searching environmental audio recordings, based in no minor way on probabilistic models of six simple short- and long-term features… very impressive!

S. Ravuri and D. P. W. Ellis, “Cover song detection: from high scores to general classification,” in Proc. Int. Conf. Acoustics, Speech, Signal Process., (Dallas, TX), Mar. 2010.

The authors address two problems with current approaches to cover song identification: 2) the assumption that one feature provides enough discriminability between all possible transformations of a song expected in a cover (e.g, tempo, rhythm, instrumentation, key, etc.); and 2) the lack of a meaningful scale and threshold for the general problem of cover song detection. As in the Ellis and Poliner system (2007), the authors construct chromagrams of beat segmented audio signals, but with “preference windows” set at three different tempo means (part of the beat detection process). From the cross-correlated chromagrams of two songs (test and reference), they compute three features: the value of the maximum peak from the highpassed cross-correlation (Ellis’ 2006 submission to MIREX); the values of the maximum peaks in each chroma lag (using Ellis’ 2007 submission to MIREX); and a feature adapted from J. Serrà and E. Gómez, “A cover song identification system based on sequences of tonal descriptors,” MIREX 2007 (Also described in: J. Serrà, E. Gómez, P. Herrera, and X. Serra, “Chroma binary similarity and local alignment applied to cover song identification,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, pp. 1138-1151, Aug. 2008). The system then normalizes these nine features by calculating the features of the test song by using features of an unrelated reference song, and finds their mean and variances to compute standard scores (or z-score) (which reminds me of the approach in M. Casey, C. Rhodes, and M. Slaney, “Analysis of minimum distances in high-dimensional musical spaces,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, pp. 1015-1028, July 2008). These values are used to normalize the features computed between each test and reference pair. Finally, the authors train either a SVM (linear kernel) or multi-layer perceptron (9 input, 75 hidden, and 2 output) to classify pairs as being covers or not. In the simulations (using the dataset of Ellis’ 2006 and 2007), their system performs very well at discriminating between the two different pairs, with an equal error rate well-below other approaches.

R. F. Lyon, M. Rehn, S. Bengio, T. C. Walters, and G. Chechik, “Sound retrieval and ranking using sparse auditory representations,” Neural Computation, vol. 22, pp. 2390-2416, Sep. 2010.

(See Po’D here.)
We can learn links between high-level textual keywords and low-level auditory features using principles of sparse coding, and then use this to retrieve and rank audio data using textual and semantic queries.

S. Scholler and H. Purwins, “Sparse coding for drum sound classification and its use as a similarity measure,” in Proc. Int. Workshop Machine Learning Music ACM Multimedia, (Firenze, Italy), Oct. 2010.

(See Po’D here.) The authors use histograms of atomic parametric decompositions of signals to discriminate between percussive sounds.

Thank you

Helpful contributions made by: Doug Keislar.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s