Hello, and welcome to Paper of the Day (Po’D): Music Genre Classification via Compressive Sampling Edition. Today’s paper is: K. Chang, J.-S. R. Jang, and C. S. Iliopoulos, “Music genre classification via compressive sampling,” in Proc. Int. Soc. Music Information Retrieval, (Amsterdam, the Netherlands), pp. 387-392, Aug. 2010.

The authors approach the problem of music genre classification by using sparse representation classification to label an unknown piece of music, but with some tricks of compressed sensing. Given a set of feature vectors for \(N\) known classs, the authors concatenate these into one large dictionary \(\MD = [\MD_1 | \MD_2 | \cdots | \MD_N ]\). They then assume that the feature vector of a signal from class \(i\) will be best represented by (lie in the linear span of) the subdictionary \(\MD_i\) — which is essentially codebook based classification as long as an error measure is defined. We can advance to sparse representation classification by solving the problem for signal \(\vy\)

$$\min || \vs ||_0 \; \textrm{subject to} \; \vy = \MD\vs$$

and picking the class that has the most support.

In other words, if we define \(\vs^T = [\vs_1^T, \vs_2^T, \cdots, \vs_N^T]\),

the entries \(\vs_i\) are those coefficients associated with the atoms in \(\MD_i\).

Since we cannot solve this as stated, we can relax the sparsity constraint to be convex using an \(\ell_1\)-norm.

It may well be the case that our unknown signal has a representation with supports that are equal in several classes, or that our subdictionaries are different sizes and do not allow such easy comparison.

So instead the authors propose selecting the class that minimizes some error when the contributions of all other classes are removed:

$$i = \arg \min_{j} || \vy – \MD_j \vs_j ||_2.$$

Now, the authors propose finding the solution to the \(\ell_1\)-minimization problem above, but first by reducing its dimensionality through applying a random matrix.

Given a fat random matrix \(\Phi\), we apply from the left to form a new dictionary and new features, and solve the problem

$$\min || \vs ||_1 \; \textrm{subject to} \; \Phi\vy = \Phi\MD\vs.$$

Then classification is done by

$$i = \arg \min_{j} || \Phi \vy – \Phi \MD_j \vs_j ||_2.$$

Now, the remaining part to be explained are the features.

The authors employ a set of short-time features: MFCCs, short-time spectral statistics, and zero-crossing rate; and long-time features: RMS energies over long windows, and octave-based modulation spectral contrasts, and modulation spectral flatness and crest.

(These 8 octaves aren’t musical octaves. Instead they are distributed: 0–100 Hz, 100 — 200 Hz, and so on to 3.2 — 8.0 kHz, and 8 — 22.05 kHz.)

All these features are concatenated to create a feature vector with 135 dimensions.

The sounds analyzed were 30 seconds of songs 12 seconds after the introduction, with analysis windows of 93 ms and 50% overlap. Long-term features are computed over 3 second collections of 63 short-time analysis windows.

The authors present many results but I think we see that their approach does extremely well (92.7% accuracy) compared with the same features but using some other approaches (up to 93.7% for 135 dimension feature vectors). The compressed sensing approach performs just as well as not doing the sensing, but with the benefit of significantly decreasing computation time.

This trick of dimensionality reduction for computing sparse representations is also taken in

M. Christensen, J. Østergaard, and S. H. Jensen, “On compressed sensing and its application to speech and audio signals,” in Proc. Asilomar Conf. Signals, Systems, and Computers, (Pacific Grove, CA), Nov. 2009;

and has hints back to at least 2000: P. Indyk, N. Koudas, and S. Muthukrishnan, “Identifying representative trends in massive time series data sets using sketches,” in Proc. Int. Conf. Very Large Data Bases, (Cairo, Egypt), pp. 363-372, Sep. 2000.