# Un Homme et Une Femme: Audio Fingerprinting and Matching in MATLAB

At my software page I have made available a nice demonstration of audio fingerprinting using a Shazam-like approach.
I have discussed this approach in Paper of the Day (Po’D): An industrial Strength Audio Search Algorithm Edition, which refers to the following paper: A. Wang, “An industrial strength audio search algorithm,” in Proc. Int. Conf. Music Info. Retrieval, (Baltimore, Maryland, USA), pp. 1-4, Oct. 2003. In my code example I use 56 different versions of the music theme from the 1966 French film “Un Homme et Une Femme“, composed by Francis Lai — one of the the great French film composers of last century. I spent a fun Sunday afternoon collecting all of these versions from YouTube. And now two days later my wife and I are suffering from an insufferable earworm.

Essentially, the fingerprint is made of a collection of duples of the following form:
\$\$ \{ \vh_i = [ f_i, f_t, t_t-t_i ]^T, t_i : t_t > t_i \}\$\$
where \((t_i, f_i)\) is the time-frequency location of the \(i\)th anchor point,
\((t_t, f_t)\) is the time-frequency location of some neighboring target point at a later time.
For an entire song we have a collection of these duples, which together forms the fingerprint of the song. Separately, we only have a pair of frequencies and their time-difference. But together, we have a large collection of points in three-dimensional frequency-time difference space. (I find my anchor and target points differently from Wang.)

Now, when we have some signal (an excerpt for instance) that we wish to match to the fingerprints we have, we compute its finger print in the same way.
Then, for each frequency-time difference point in the fingerprint of the unknown \(\{\vu_j\}\), we find the time of the frequency-time difference point in the fingerprint of the known that is closest in a Euclidean sense:
\$\$i_{\min} = \arg \min_i || \vh_i – \vu_j ||_2.\$\$
With this, we form a matrix of matched times,
and look for long diagonal lines as in one of the figures below.

We take an excerpt from Johnny Mathis singing the English version of “Un Homme et Une Femme”, which roughly translates to “A Man and A Woman.” When we compare the elements of the Mathis fingerprints to those from an extremely flat (wow it hurts so much) sung version by Engelbert Humperdink, we find the following pattern:

Then, comparing Mathis to Mathis, we find the following pattern:

Clearly, we see a very straight line (with unit slope) in the latter, which indicates we have found a match.
Below we compute for each song the largest percentage of the pairs that lie along diagonal lines of unit slope (with some buffer) for each song. Johnny Mathis is item number 2. All the other songs are way below this value.

With this collection of covers of “Un Homme et Une Femme” we see the extremely high specificity of this fingerprinting approach. No other song comes close to having such a marked diagonal line as the song by Johnny Mathis. And that goes for the rest of the excerpts I have tested. (This will change with shorter excerpts, which is no surprise.) Now, I wonder how these conditions can be relaxed so that I can obtain a measure of how close one cover is to the original (or another cover) based on comparing the fingerprints? Johnny Mathis is no Engelbert Humperdink, and Engelbert Humperdink is no Johnny Mathis; but how close to Johnny Mathis is Engelbert Humperdink? Though they are both dreamy, Engelbert Humperdink is so much more fun to say. (Much more fun than his birth name: Arnold George Dorsey. Which makes me wonder: who would take Engelbert over Arnorld? And Humperdink over Dorsey???)

Other MATLAB code demonstrating the same fingerprinting technique is that by Dan Ellis at the Laboratory for the Recognition and Organization of Speech and Audio, at Columbia University.