Hello, and welcome to the Paper of the Day (Po’D): Music Genre Classification Edition, no. 3. Today’s Po’D follows on the heels of the Po’D from a few days ago: R. O. Gjerdingen and D. Perrott, “Scanning the Dial: The Rapid Recognition of Music Genres,” Journal of New Music Research, vol. 37, no. 2, pp. 93-100, Spring 2008.
In this paper, the authors explore the human limits in classifying music genre, in particular, the genre classification consistency of listeners as a function of music recording segment length. They provide an illuminating discussion around the plastic set of categories that is music genre, which they note is controlled and constrained by several subjective factors of the human condition. In my humble opinion, this makes talking about music genres almost seem (were it not for its basis in things that are real) like talking about the differences between unfalsifiable doctrines of particular systems of thinking where facts and observations are aggressively replaced by Hokum and Woo™. However, the authors take a very good approach by measuring how consistent people are in their classification of music genre with respect to segment duration, thus sidestepping the pitfalls of having to define genre.
The authors select eight musical tracks (MP3 format, but no mention if they are stereo) from each of ten broad music genre classes (relevant in the 1990’s according to business models of music consumerism): blues, classical, country, dance, jazz, Latin, pop, rhythm and blues, rap, and rock. Examples are selected based on the labels assigned by the on-line shops. The authors avoid selecting tracks “that might suggest genre crossover” (which doesn’t sit well with me). For each of the selected tracks (are all the tracks unfamiliar to the test subjects, e.g., no New Kids on the Block?), the authors create eight snippets of four different durations (250, 325, 400, and 475 ms). For each duration, four segments are selected to have prominent vocals (so in classical did they select opera, lieder with piano, Pierrot Lunaire, Bobby McFerrin doing Bach?), and the other four to be purely instrumental with no voice (no Bobby McFerrin, or Michael Winslow:(. They also select a 3 second segment of each piece that they “subjectively felt was characteristic of both the song and its genre.”
The authors tested the genre recognition capabilities of 52 students described as “ordinary undergraduate fans of music” (but probably not fans of or experienced with music from all ten genres), in 400 different trials using all variations of segment duration, genre, and voice/no voice content. The genres of the 3-second segments were assigned by each participant at the end of the experiment.
The figure below (reproduced without permission) shows for their experiment the overall average consistency of human music genre classification with respect to the genre labels assigned to the longest segments — in other words, how often the classes matched for each individual. I expected that the longer the shorter segment becomes the more likely its label will match that assigned to the longest segment; but I did not expect there to be such a small decrease in classification consistency with segments only as short as 250 ms. (As in the Po’D a few days ago, we see a high consistency in labeling classical music. Why? Is it the lack of compression? There must be something more discriminating than content, because there really isn’t any in 250 ms for any of the genres, unless they include in Rock some ultra hyperspeed happy metal gabber Japanese garagecore.) Also, I wonder why “blues” and “country” are, ironically, so sadly inconsistent.
The authors observe that test subjects had more consistency when the music segments did not involve voice. This could be due, I say, to the compositional strategy of some music styles to suppress the activity of all instruments in order to hear the voice and lyrics.
The authors suggest on the basis of these results that,
a highly reduced combination of melodic, bass, harmonic, and rhythmic features can help to classify genre if these features are coupled with an accurate acoustic signal… This study demonstrates the near immediacy of genre recognition as a response to a musical stimulus.
with which I don’t agree, even though I don’t know what they mean by “an accurate acoustic signal.” This experiment only suggests that humans are consistent across a range of time scales, even short ones, in labeling music of these ten genre (depending largely on the genre), assuming that they have experience with the ten genres being tested. Furthermore, I am as hard pressed to call 250 ms “a musical stimulus” as I am to call coding artifacts “musical noise.”
Anyhow, as I discussed in a previous Po’D, I am uncomfortable with the notion of measuring the genre of a piece of music because I cannot see how music genre can be defined in any useful way outside of a framework that is not socially and historically constructed, and that won’t be irrelevant in a year’s time. Automatically classifying the genre of a music piece is not a signal processing problem in the same way as classifying an acoustic signal segment as speech, music, or silence. Similarly, it is not a signal processing problem in the same way as automatic speech recognition; or characterizing the emotional or even sarcastic content of speech. Those problems are much less ill-posed than classifying music genre because there is less argument over what makes a word sound like “eggs”, or even what makes speech sound happy or patronizing. Furthermore, the qualities of what makes speech happy change, if at all, with a glacial timescale compared to the near monthly appearance and redefinitions and recategorizations of music genre.
On a more personal note, I have observed that I am extremely accurate when it comes to discriminating between orchestral music created for movies and orchestral music created for music (with an extremely high sensitivity to the “music” of Clint Eastwood). The time it takes for me to figure it out is about the time it takes for me to yawn.