Hello, and welcome to Paper of the Day (Po’D): Three current issues in music autotagging edition. Today’s paper is a provocative one: G. Marques, M. Domingues, T. Langlois, and F. Gouyon, “Three current issues in music autotagging,” in Proc. ISMIR, 2011.
What is “music autotagging”? Let’s not get ahead of ourselves. What is a “tag”? In general, is a term applied by someone to music to make it more useful to them (thanks to Mark Levy for that great definition). So, on last.fm, if we view Roger Whittaker’s “The Last Farewell”, we see people have applied several tags to the song: “easy listening”, “roger whittaker”, “schlager”, and “oldies”.
Some of those tags are quite useless and/or meaningless to me.
One of these tags notes a use of the music (easy listening, maybe),
one redundantly names the artist/performer/singer in the music,
and two are completely unrelated to the musical content (shlager, which is “hit” in German apparently). (Here are many other tags people have applied to this song.)
In T. Bertin-Mahieux, D.Eck, and M.Mandel, Machine Audition: Principles, Algorithms and Systems, ch. Automatic tagging of audio: The state-of-the-art. IGI Publishing, 2010, we learn that in 2007, 68% of the tags on last.fm describe genre (“rock”); 12% describe the location (“Brooklyn” for Beastie Boys); 5% describes mood (“chill”); 4% describe opinion (“favorite”); 4% describe instrumentation (“contrabassoon”); 3% describe ‘style’ (“political”); and the rest is a mixed bag.
So, what is “music autotagging” (MA)? That is just the act of a machine programmed to apply tags to a piece of music like we see above. And the Po’D provides illumination on three concerning issues in this line of research:
- Current approaches to MA evaluation are too sensitive with respect to imbalances in the data, and end up painting too rosy of a picture of performance and progress.
- Current approaches to MA do not generalize across datasets, and measuring performance in one dataset ends up painting too rosy of a picture of performance and progress.
- Tag post-processing using tag co-occurrences does not work well because, you know, pulling one’s self up out of mire by the bootstraps, i.e., achieving good tags to begin is what is necessary.
With regards to the first concern, Marques et al. discuss how high mean F-scores (over all tags) can be achieved easily when a large amount of the data is tagged with “foo”, and a system applies “foo” to all data. Hence, they recommend using as well mean F-scores per tag. Furthermore, one must consider that some of the data has many tags, and a lot of the data has few tags.
There is also the problem of data integrity: what to do with misspellings? duplicated excerpts? excerpts without tags? mutually exclusive tags? and so on.
With regards to the second concern, Marques et al. show extremely interesting results of training two MA systems on one dataset, and testing it on another. This shows the systems to be quite different in terms of generalizing across the two.
Furthermore, when they apply tags to two different and un-tagged datasets and look at tag frequencies,
they find nearly the same behaviors! That is, to excerpts in both datasets, the MA applies “man singing” with the same frequency; and “acoustic”; and “duet”.
And this occurs in the face of the systems having “good” F-scores when evaluated in their own dataset. (2-fold CV)
With regards to the last concern, Marques et al. look at the quantitative differences in F-scores (for individual tags) before and after a post-processing stage. The idea of the stage is that if the tags “rock,” “Springsteen” and “oboe” are applied to a song, then “oboe” should be removed or substituted by “guitar”.
Here the experiment reveals that the benefits achieved by this approach are quite limited by the success of the application of the tags in the first place. Hence, work must be focused on improving the initial application of tags.
Bottom line: We have a ways to go.