# Paper of the Day (Po’D): Music Genre Classification this time with Sparse Representations Edition

Hello, and welcome to the Paper of the Day (Po’D): Music Genre Classification this time with Sparse Representations Edition. I continue today with an interesting paper that applies my favorite subject, sparse representations, to a topic that rings my skeptical bells, and shakes my cynical rattles — automatic classification of music genre: Y. Panagakis, C. Kotropoulos, and G. R. Arce, “Music genre classification via sparse representations of auditory temporal modulations,” in Proc. European Signal Process. Conf., (Glasgow, Scotland), pp. 1-5, Aug. 2009.

In this paper, the authors claim unbelievably good music genre classification accuracies around 10% higher in test sets for which classification accuracies have not exceeded 83.5%. Previous approaches have used bags of frames of features (BFFs), which several of my previous Po’Ds make clear (as well as knowledge of how music is assembled and what distinguishes music genres and styles at high levels, including non-quantitative social and historical contexts) is not a good way to go (unless we are discriminating between music made for humans, and music made for dogs). The authors attempt to take another approach by creating genre-discriminating features based more in our present ideas of how the human auditory cortex works, and then applying a sparse classification approach that is very similar to the spoken digit recognition in these previous Po’Ds.

First, they pass a 30-second duration segment extracted from each song (genres include: classical, electronic, jazz and blues, metal and punk, rock and pop, and world) through a constant-Q 96 band filterbank with center frequencies spaced logarithmically over four octaves. This mimics the cochlea. Next, to the output of each of these bands, they apply a high-pass filter, then a nonlinear compression, and then a lowpass filter, and then take the derivative followed by half wave recitifcation, and then apply 8 ms exponential windows and integrate over all time segments. (They are extremely short on details here, and I wish they included a few pictures!) Finally, to each of these bands they perform a “modulation scale analysis”, which essentially means they use a wavelet filter with a certain scale, and then they integrate over the entire time-domain (integrate the absolute value? squared magnitude? I don’t know.) They use a total of eight wavelet scales (from 500 ms to 4 ms), so that for the 96 bands they end up with a feature vector (positive semi-definite entries) of dimension 768 for each 30 second music sample. For several examples of each genre, they build up a large dictionary of $$N$$ these “auditory modulation representations”, keeping the genre labels of each one. This defines the large matrix $$\MD \in \mathcal{R}_+^{768 \times N}$$.

Now, for any 30 second sample of unknown music genre, they assume that we can perform the same data processing to get the feature vector $$\vy$$ and then there exists a $$\vs \in \mathcal{R}^N$$: $$\vy = \MD\vs.$$
The genre labels associated with the non-zero entries in $$\vs$$ reflect that of the unknown signal. (What happens when values are negative?)
To find the solution above, they use $$\ell_1$$ minimization in a reduced feature space:
$$\vs^* = \arg \min_{\vs \in \mathcal{R}^N} ||\vs||_1 \; \textrm{subject to} \; \MW\MD\vs = \MW\vy$$
where $$\MW \in \mathcal{R}^{k \times 768}$$ maps each of the $$N$$ dictionary vectors into a $$k \ll 768$$ dimensional space. (The authors define $$\MW$$ using NMF, PCA (I assume to decorrelate the atoms, but they never say), or by an iid random process.) (Also, is there some psycho-physical relevance of an $$\ell_1$$ norm?)
The authors perform this preprocessing in order to reduce the number of constraints in solving the linear program, and also because “it facilitates the creation of a redundant dictionary…” which I don’t get. (Also, why they use exact constraints is never explained.) From $$\vs^*$$ then, they assign a genre label to the unknown music by the following criterion:
$$i = \arg \min_{i = 1, 2, \ldots, N} || \MW\vy – \MW\MD\delta_i(\vs^*)||_2$$
where $$\delta_i(\vs^*)$$ contains only those coefficients in $$\vs^*$$ that belong to the $$i$$th genre, and is zero everywhere else. They do this in order to deal with the small and insignificant numbers that pop up in the $$\ell_1$$ minimization.

If the results of this paper are to be believed, then music genre (whatever that is) can be accurately determined before any cognitive processes even begin to work. And conceivably we could do the same for styles of painting. I don’t think we can just skip over the big black box that is the brain; and I don’t think music genre is definable just from non-unique mappings of physical measurements to $$k$$ dimensional feature vectors. Music genre (whatever that is) is much, much more than samples, modulations, filterbank decomposition, wavelet decompositions, and integrals. Those physical things are not flexible and up for debate; music genres are plastic and are constantly being redefined.

Finally, I am confused that since the authors went to great pains to incorporate psycho-physiological principles, why did they select to use 30 second sound examples? Do humans require 30 seconds to perform the proposed integrations and sparse classification? Note the some studies point to humans having high genre classification accuracy after hearing sound clips of only 250 ms!