Hello, and welcome to Paper of the Day (Po’D): Music Cover Song Identification Edition, pt. 3. We continue today with a paper mentioned in the post from a few days ago: J. Serrà, E. Gómez, and P. Herrera, “Audio cover song identification and similarity: Background, approaches, evaluation, and beyond,” in Advances in Music Information Retrieval (Z. Ras and A. Wieczorkowska, eds.), vol. 274 of Studies in Computational Intelligence, pp. 307-332, Springer, 2010.
The authors provide a thorough description of the problem of cover song identification, and approaches to evaluate algorithm performance. They identify the different types of covers: remaster (adjusting levels), instrumental (song once sung just played), live performance (also jamming?), acoustic (like MTV unplugged), demo (sketchy and unrefined), duet (larger ensemble?), medley (combination of themes), remix (minor or substantial changes to the structural aspects), and quotation (good composers borrow, great composers steal). They also identify the various characteristics of music than can be changed in a cover: timbre, tempo, timing/rhythm, structure, key, harmonization, lyrics/language, and noise (from audience noise to recording equipment, compression artifacts, and so on). A robust cover song identification algorithm should be able to accommodate all types of covers, and should be invariant to the variety of changes in the characteristics of a piece of music. (I also think that there should be some symmetry as well, such that the original tune is retrieved using a cover, and vice versa.)
The authors discuss a variety of methods that use features in ways that are more or less invariant to key, tempo, and/or structure. For key invariance, the best approach appears to be using pitch class profiles (computed over about 100 ms) with similarity maximization over circular transpositions. For tempo invariance, the best approach (though computationally expensive) appears to be dynamic time warping of descriptors. For structural invariance, the best approach is not as clear, although dynamic programming and windowing across descriptors perform very good in the face of structural changes.
The authors also discuss the difficulties in evaluating algorithms using various statistics, and with limited datasets that consist of only a few types of covers. For instance, versions of classical music will have limited changes in key, timbre, and structure, and thus will be easier to identify than popular or folk music. Of course, with testing on collections of more songs with more covers, and the addition of songs that do not have covers, the performance results will better reflect that for systems working in realistic scenarios.
In their “beyond” portion — in which I am most interested — the authors discuss the following:
- does naivete of timbre, vocals, and melody play any role in low performance?
- should we also be looking at differences, instead of just similarities?
- how can we import meaningful degrees of similarity into the retrieval results?
- how can we identify quotations of short duration?
- how can we improve the efficiency and performance of these systems?
- how can we create a ground truth for a collection of covers in the sense that we categorize the entries based on their musical characteristics?
- is there a trade-off between system accuracy and efficiency?
While not quite detecting renditions of holiday tunes by dogs, these questions are indeed important and relevant to work in the coming years. Now, how can I fit sparse approximation in there?