I have been working with a student on STWOPetal: studying the work of Y. Panagakis, C. Kotropoulos, and G. R. Arce, “Music genre classification via sparse representations of auditory temporal modulations,” in Proc. European Signal Process. Conf., (Glasgow, Scotland), pp. 1-5, Aug. 2009; Y. Panagakis, C. Kotropoulos, and G. R. Arce, “Music genre classification using locality preserving non-negative tensor factorization and sparse representations,” in Proc. Int. Soc. Music Info. Retrieval Conf., pp. 249-254, Kobe, Japan, Oct. 2009; Y. Panagakis, C. Kotropoulos, and G. R. Arce, “Non-negative multilinear principal component analysis of auditory temporal modulations for music genre classification,” IEEE Trans. Acoustics, Speech, Lang. Process., vol. 18, pp. 576-588, Mar. 2010. (Po’D here.) In Part 1, we sufficiently tackled the first part, which was creating auditory spectrograms. In Part 2, we successfully performed “modulation-scale analysis” on the auditory spectrograms to derive “auditory temporal modulations.” In this part, we have implemented and tested sparse representation classification (SRC), and compared it with other classification methods. For this, Panagakis et al. point to the paper, J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Machine Intell., vol. 31, pp. 210-227, Feb. 2009.

I pointed to the paper by Wright et al. in my discussion of work on classifying facial emotions. I have also discussed the application of SRC to robust spoken digit recognition, and music genre recognition by compressive sampling.

*The following is co-authored with the collaborating student: Pardis Noorzad, in the Department of Computer Engineering and IT, Amirkabir University of Technology, Tehran, Iran (pardis@aut.ac.ir).*

Panagakis et al. test several combinations of data reduction and classification methods. So following their example, we reduce data dimensions using one of four techniques: principal component analysis (PCA), non-negative matrix factorization (NNMF), random projection (RP), and downsampling.

PCA, NNMF, and RP linearly project data onto a lower-dimensional subspace. PCA and NNMF are “data-aware,” meaning that they construct projection matrices from the training data. PCA finds directions of maximum variance of the \(N\) dimensional training data, and orthogonally projects all data onto the first \(n < N\) principal components. The orthogonal projection matrix contains the eigenvectors of the data covariance matrix. NNMF decomposes the data matrix \(\mathbf{A}\) into two matrices such that \(||\mathbf{A} – \mathbf{W}\mathbf{H}||_F\) is minimized. The matrix \(\mathbf{W}\) contains a basis, and \(\mathbf{H}\) contains the coefficients. Together, these linearly approximate the original data. NNMF constrains the two matrices to have nonnegative entries, which enforces additive combinations of basis vectors. This is one of the main properties that sets NNMF apart from other factorization techniques like singular value decomposition employed by PCA. On the other hand, RP and downsampling are "data-oblivious" technique. In RP, the projection matrix is an orthogonal random matrix. It preserves distances by the Johnson-Lindenstrauss lemma. Downsampling is just good old lowpass filtering before decimation.

Panagakis et al. compare sparse representation-based classification (SRC) against two other methods: nearest neighbors (NN), support vector machine (SVM).

In SVM, the parameters of a separating boundary (and the support vectors) are learned from training data. Using a kernel “trick,” an SVM can separate classes nonlinearly. Panagakis et al. use a linear kernel. Contrary to SVM, NN and SRC do not have a training phase; instead, these methods make use of the whole training set to classify a test point. In NN, a test point is assigned to the same class as its nearest neighbor (with respect to some distance, or similarity, metric). The performance of NN depends on the choice of similarity measure. Panagakis et al. (EUSIPCO 2009) choose the cosine similarity, which is the angle of separation between two points. Unlike NN and SRC, SVM is a binary classifier. Several approaches exist to extend its application to the multiclass setting. One such approach, called one-versus-one, is used in the LIBSVM library. In this case, binary SVMs are trained to classify between every possible pairing of labels. For a given input then, all SVMs are tested, and the class with the most votes is selected. This method does not form a partition of the space with respect to the classes and hence does not result in a true mutliclass classifier.

SRC assumes that features of each class can be approximated the best by linear combinations of features from the same class, and not other classes. I describe SRC here, but a little repetition won’t hurt. Initially, the class of the feature \(\vy\) is unknown. We have made all of the training features have unit norm, and combined them columnwise into a dictionary, \(\MA\). We obtain a sparse representation of \(\vy\) in terms of \(\MA\) by solving the following \(\ell_1\)-minimization problem:

$$ \mathbf{c} = \arg \min_{\mathbf{c}} \|\mathbf{c}\|_1 \mbox{ s.t. } \mathbf{A}\mathbf{c} = \mathbf{y}.$$

We then look at the reconstruction errors given by reconstructing \(\vy\) using only members of each class.

In other words, for the \(i\)the class, we make the vector \(\delta_i(\mathbf{c})\) be zero for all elements in \(\mathbf{c}\) not corresponding to class \(i\), and keep the rest of the coefficients found in the solution above.

Then, we find the class \(i\) that minimizes the \(\ell_2\) norm of the error,

$$ \arg \min_i \|\mathbf{y} – \mathbf{A}\delta_i(\mathbf{c})\|_2. $$

To test our implementations of classification and dimensionality reduction techniques used by Panagakis et al., we use a dataset of 5000 handwritten digits, each a 14-by-14 matrix of pixel intensities vectorized and stacked into the data matrix.

In our experiments with the digits, we test classification accuracies using dimension reduction from 196 dimensions to 14, 50, and 100 dimensions.

The dataset is split into two sets of equal size for testing and training (or dictionary building).

We train each linear SVM using a cost of 1.2; and a gamma of 0.07.

In the images below, we show the mean classification accuracy of all pairs of classifiers and data reduction techniques.

As the results show, NN is surprisingly good at the feature dimensions we test; and SRC is competitive, though much more computationally complex.

SVM with a linear kernel just isn’t up to snuff. This is not entirely the fault of the SVM, but could be due to a lack of tuning its parameters. Plus, a radial basis function could perform much better. Overall, we see that PCA and NNMF are better at keeping discriminative information than random projection and downsampling.

For some reason, we see that SRC performs worse for this dataset at higher dimensions when using PCA or NNMF; but we have no idea yet why this should occur. Maybe it has to do with the dictionary becoming less coherent…

Anywho, we are now ready to process the two music genre datasets, and investigate what is working so well here.

We attach our code: classifiers.zip.