Papers of the Day (Po’D): Music Cover Song Identification Edition, pt. 1

Hello, and welcome to Papers of the Day (Po’D): Music Cover Song Identification Edition, pt. 1. I am currently looking methods of measuring music similarity for cover song identification. Previously at CRISSP, I have discussed music fingerprinting, e.g., the Shazam-Wow! approach Paper of the Day (Po’D): An industrial Strength Audio Search Algorithm Edition. This particular approach uses features that are much too specific to be of any use to be insensitive to the wealth of possible transformations when a song “goes under cover.” For instance, see my examples here: Un Homme et Une Femme: Audio Fingerprinting and Matching in MATLAB. I have also discussed other approaches that attempt to generate features of varied specificity for similarity calculation, as in Paper of the Day (Po’D): Minimum Distances in High-Dimensional Musical Feature Spaces Edition. I note many other approaches at my big A Briefly Annotated Timeline of Similarity in Audio Signals. With today’s papers, we add a few more to the stack:

  1. D. P. W. Ellis and G. E. Poliner, “Identifying ‘cover songs’ with chroma features and dynamic programming beat tracking,” in Proc. Int. Conf. Acoustics, Speech, Signal Process., (Honolulu, Hawaii), Apr. 2007.
  2. S. Ravuri and D. P. W. Ellis, “Cover song detection: from high scores to general classification,” in Proc. Int. Conf. Acoustics, Speech, Signal Process., (Dallas, TX), Mar. 2010.


In Ellis et al. (see also: D. P. W. Ellis, “Identifying ‘cover songs’ with beat- synchronous chroma features,” Music Information Retrieval Evaluation eXchange (MIREX), 2006; D. P. W. Ellis and C. V. Cotton, “The 2007 Labrosa Cover Song Detection System,” Music Information Retrieval Evaluation eXchange (MIREX) 2007.)
the authors propose building and comparing chroma features in a tempo invariant way for the automatic identification of different versions (covers) of the same song. First, they segment the musical audio into beats (see: D. W. P. Ellis, “Beat Tracking by Dynamic Programming,” J. New Music Research, vol. 36 no. 1, pp. 51-60, March 2007). Then, for each of these segments they construct a 12-element chroma feature from the energy measured at the instantaneous frequency calculated in each DFT bin up to 1 kHz, with a margin added of +- 1/2 semitones to take care of people playing out of tune. Then, they treat two sets of features as if they are images, and perform a 2D correlation between them with repetition at the borders. They then take the row with the largest magnitude correlation, high pass filter it, and compute the “distance” as the reciprocal of the output maximum value. Their system performs well for a variety of test sets, recalling 761 cover songs from 3,300, and with a mean reciprocal rank of 0.49 for the MIREX 2006 set. Their MATLAB software is available here. Of course, the effectiveness of this approach strongly relies on music having well-defined beats. But who is going to do a cover of an ambient track?

In Ravuri et al., the authors address two problems with current approaches to cover song identification: 2) the assumption that one feature provides enough discriminability between all possible transformations of a song expected in a cover (e.g, tempo, rhythm, instrumentation, key, etc.); and 2) the lack of a meaningful scale and threshold for the general problem of cover song detection. As in the Ellis and Poliner system (2007), the authors construct chromagrams of beat segmented audio signals, but with “preference windows” set at three different tempo means (part of the beat detection process). From the cross-correlated chromagrams of two songs (test and reference), they compute three features: the value of the maximum peak from the highpassed cross-correlation (Ellis’ 2006 submission to MIREX); the values of the maximum peaks in each chroma lag (using Ellis’ 2007 submission to MIREX); and a feature adapted from J. Serrà and E. Gómez, “A cover song identification system based on sequences of tonal descriptors,” MIREX 2007 (Also described in: J. Serrà, E. Gómez, P. Herrera, and X. Serra, “Chroma binary similarity and local alignment applied to cover song identification,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, pp. 1138-1151, Aug. 2008). The system then normalizes these nine features by calculating the features of the test song by using features of an unrelated reference song, and finds their mean and variances to compute standard scores (or z-score) (which reminds me of the approach in M. Casey, C. Rhodes, and M. Slaney, “Analysis of minimum distances in high-dimensional musical spaces,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, pp. 1015-1028, July 2008). These values are used to normalize the features computed between each test and reference pair. Finally, the authors train either a SVM (linear kernel) or multi-layer perceptron (9 input, 75 hidden, and 2 output) to classify pairs as being covers or not. In the simulations (using the dataset of Ellis’ 2006 and 2007), their system performs very well at discriminating between the two different pairs, with an equal error rate well-below other approaches.

This reduces my stack by a few. Now I have the following to review in the coming days:

  1. M. Müller and S. Ewert, “Towards timbre-invariant audio features for harmony-based music,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, pp. 649-662, Mar. 2010.
  2. M. Müller, F. Kurth, and M. Clausen, “Audio matching via chroma-based statistical features,” in Proc. Int. Conf. Music Info. Retrieval, (London, U.K.), pp. 288-295, Sep. 2005.
  3. M. Müller, F. Kurth, and M. Clausen, “Chroma-based statistical audio features for audio matching,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, (New Paltz, NY), pp. 275-278, Oct. 2005.
  4. J. Serrà, E. Gómez, P. Herrera, and X. Serra, “Chroma binary similarity and local alignment applied to cover song identification,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, pp. 1138-1151, Aug. 2008
  5. F. Kurth and M. Müller, “Efficient index-based audio matching,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, pp. 382-395, Feb. 2008.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s