Music genre recognition: Have the simplest problems even been solved yet?

Currently, with a heavy schedule of teaching, and preparing papers for submission to EUSIPCO and a journal special issue, I have little time to continue working on music genre recognition, e.g., assembling what I think would be a good database for testing music genre recognition. I also have not had time to more closely look at alternative datasets, as I have done for the Tzanetakis dataset. I have several thoughts about the topic though.

Through my review of the literature in this area, I feel that too often work tackles “music genre recognition” without having the faintest idea of what “genre” is and is not. A dataset is selected and assumed to be representative and valid. Many papers link together “novel” features with “novel” methods of classification, and provide little motivation for the choices made. Sometimes, justifications are made that don’t make sense. The use of bags of frames of features (BFFs) is a spent topic. People don’t listen that way. Feature integration offers some hope, but it still has a long way to go until we reach the high level descriptors that I find myself using when comparing styles and assigning labels.

If we want to solve complex problems, we first need to solve the simplest ones. In the work on music genre recognition so far, I have yet to see any results that convince me the simplest problems have been solved. For instance, an experiment I want to do soon will take a state-of-the-art system for music genre recognition, and see how it performs when I compress all input to the excessive levels used in Popular music. With such a transformation, does the genre change for human listeners? I don’t think so. Classical will still be Classical, though with an in-your-face feel. If the system is really comparing stylistic aspects of the musical content of the audio signals, then such compression will have little affect. Similarly for other transformations, such as the bandwidth changes of AM radio, and the pops and hisses and subtle speech changes of 78s. I don’t think I will see a little effect from these transformations.

To properly and convincingly solve the problem of automatic genre recognition, I think it must take high level models that can separate the mixed sources (find the guitar, drums, and bass for Rock), infer rhythms and beat patterns (determine the highhat pattern for Disco), transcribe the melody line and the harmonies (find the lack of parallel fifths in Baroque), listen to the lyrics and vocal styles (detect the topic and free singing style in Delta Blues), find the structure of the piece (find the verse-chorus-verse-chorus of Rock and Roll, or the 12 bars of 12-bar Blues), find and classify different audio effects (the spring reverberation of Reggae), and be able to link together styles historically (Hiphop grows from Reggae).


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s