Hello, and welcome to Paper of the Day (Po’D): Music Genre Classification Edition, no. 2. Today’s paper is so new it has yet to appear in print form: G. Marques, M. Lopes, M. Sordo, T. Langlois, and F. Gouyon, “Additional evidence that common low-level features of individual audio frames are not representative of music genres,” in Proc. Sound and Music Computing, (Barcelona, Spain), July 2010. On this blog I have broached the subject of music genre recognition, and in general the discriminativenessity of bags of low-level features such as non-linear MFCCs:
- Paper of the Day (Po’D): Music Genre Classification Edition
- Paper of the Day (Po’D): Bags of Frames of Features Edition
- Paper of the Day (Po’D): Experiments in Audio Similarity Edition
- Paper of the Day (Po’D): Multiscale MFCCs Edition
This study looks at incorporating scaling information into the bags of frames of features approach for musical instrument recognition tasks.
This study proposes a method of music genre classification based on modeling series of MFCCs by an autoregressive process in order to incorporate statistics on short- and long-term behaviors.
This study hints at a mediocre performance limit expected when using bags of frames of features, including MFCCs, for music signal processing.
This collection of studies looks in controlled and realistic ways at how MFCCs perform in instrument classification tasks.
In this article, the authors look experimentally at the performance differences in two genre recognition tasks for three different classification strategies, and three different methods of codebook generation, using bags of frames of features (BFFs) created from 46 ms isolated windows of audio data. For each window, they create 17-dimensional feature vectors from low-level features: zero crossing rate, spectral centroid, rolloff frequency, spectral flux, and the first 13 MFCCs including the zeroth one.
First, the authors generate feature vector codebooks by selecting at random 200 feature vectors from all feature vectors in the training set. Then they code (by Euclidean distance I assume) each feature vector extracted from a piece of music, and either, 1) find the frequency of each symbol in the entire piece; or 2) find the transition frequencies between all 200 symbols. Using the first approach, the authors train two 5-NN (one using Euclidean distance, and the other using Kullback-Leibler divergence), and a SVM classifier for all music genres in a second training set. Using the second approach, the authors train a Markov model for all music genres in a second training set. The authors test the accuracy for all three approaches using codebooks produced three different ways: 1) select 200 feature vectors from those of all genres except for one; 2) select 200 feature vectors from those of one genre; 3) select 200 feature vectors from all genres. In this way, the authors test the extent to which accuracy is impacted by using codebooks of differing specificity.
The authors use two datasets: 1) 729×2 music pieces for training and testing from six genres: classical, electronic, jazz and blues, metal and punk, rock and pop, and world; 2) 3,227 full-length pieces from ten different genres of Latin music.
Of the three classification approaches, the authors find that the Markov model works the best, with differences in accuracy as large as 9.2%, and as small as 5.5%, which appears to suggest that considering time-domain information between BFFs is better suited to recognizing the quality “genre” — which makes sense because, I posit, humans are able to label music by genre from long windows of experience with said genres. (I do not think a specialist in metal genres would be able to distinguish between classical and baroque and romantic. It is not innate.)
More interestingly, the authors find that there exists very little (and unpredictable) change in classifier performance using codebooks of features that are built from frames of data that we would not consider representative of the genre. For instance, in the table below, we see that the accuracy of the Markov classifier in recognizing four of the six music genres is nearly the same over all codebooks. (The other two results is probably due to the lack of training data in those two classes.) We see pretty much the same result for the Latin music database. (What makes Classical so easy to recognize? Could it be from the zeroth MFCC having a much smaller value due to the lack of compression in the audio production?)
Another interesting table is shown below.
Here we see the Markov classification accuracies for each class using a codebook generated only using feature vectors from a single class.
For instance, the first column of data shows the classification accuracy when the classifier is trained using feature vectors coded with feature vectors from examples of the classical music genre. This codebook is good for coding classical music, but nothing else.
In the second column, using a codebook generated from examples of electronic music, we only see a slight different in the classifier accuracy for classical music. This means that the classical music feature vectors are well represented by feature vectors from electronic music.
Across most of the genres we see that changes in the codebook do not have really large impacts on accuracy (except in those two cases where there may be a negative effect from lack of data).
Together, all of this says that when it comes to the problem of music genre recognition, we cannot expect much difference in accuracy by using a refined set of the low-level features typically used. Even those codebooks we would consider poorly selected, perform nearly just as well as goodly selected codebooks.
In my opinion, music genre is one of those things that is mostly a social or historical construction, and it cannot be quantified in any meaningful way. As an engineer, this unquantifiability makes me squirm; and as a composer of music in the genre of “electro-Bob”, the distinction between music styles as if they are geographically separated makes me squirm. Either way I am squirming. That being said, it is still relevant to ask, “What makes this piece sound classical?” And as this Po’D has shown, BFFs of the low-level kind “are not representative of music genres.” That being said, here is an interesting work that tests how well people can recognize music genres from frames of varying lengths:
R. O. Gjerdingen and D. Perrott, “Scanning the Dial: The Rapid Recognition of Music Genres,” Journal of New Music Research, vol. 37, no. 2, pp. 93-100, Spring 2008.
That will be a Po’D soon.