Hello, and welcome to Paper of the Day (Po’D): Unsupervised Feature Learning for Audio Classification using Convolutional Deep Belief Networks Edition. Today’s paper is H. Lee and Y. Largman, and P. Pham and A. Y. Ng, “Unsupervised Feature Learning for Audio Classification using Convolutional Deep Belief Networks”, Proc. Neural Info. Process. Systems, Vancouver, B.C., Canada, Dec. 2009. We have recently reviewed another deep architectures paper here. Now, we are looking at “convolutional deep belief networks” (CDBN) for describing audio signals, which provide yet another layer of abstraction in a generative modeling approach of complex signals to reach highly relevant descriptions of signals without engineering them by hand.
The following annotation is authored by Pardis Noorzad, in the Department of Computer Engineering and IT, Amirkabir University of Technology, Tehran, Iran (firstname.lastname@example.org).
In this paper, the authors report the results of their experiments with convolutional deep belief networks (CDBNs) for learning features from unlabeled audio data. A deep belief network (DBN) is a generative model that can learn a feature hierarchy from its low-level input data, in this case audio spectrograms. CDBNs extend the application of DBNs to higher-dimensional input, where the convolution allows one to learn translation invariant features. CDBNs, like DBNs, perform unsupervised feature learning and can therefore be trained with large collections of unlabeled data. Therefore they can improve classification accuracy, especially when labeled data is scarce.
In a CDBN, the input layer (visible layer) is made up of \(n_V\) units, which is where one feeds the input data. The next layer (hidden layer, or “detection” layer as it’s called here) consists of \(K\) groups of \(n_H\) units (making \(Kn_H\) units in total). In regular neural networks, each weight corresponds to one visible-to-hidden connection. However, in this convolutional setting, each group shares the same weight. So we have \(K\) \(n_W\)-dimensional “filter” weights, and therefore \(n_H = n_V – n_W + 1\) for a valid convolution (which means no zero padding to perform the full convolution). Each hidden layer is followed by a “pooling” layer. The hidden layer is partitioned into local neighborhoods and for each region only the maximum value finds its way to the pooling layer (a kind of nonlinear downsampling). Max-pooling reduces computational costs by reducing the number of units in the next detection layer. It also provides local translation invariance. The CDBN is trained very much like a DBN (with convolutional RBMs replacing regular RBMs). Feed-forward approximation is used for inference.
To train the CDBN on unlabeled data, the authors calculate the spectrogram (20ms window with 10ms overlap) of each speech utterance of the TIMIT dataset. They reduce the dimensionality (w.r.t. the frequency) of the spectrogram with PCA whitening (which projects the data onto the first 80 principal components normalized by their eigenvalues). Two detection layers, each with \(K = 300\) groups (so 300 weights, also called “bases” here), are trained. They set the filter length \(n_W = 6\), and the max-pooling partition length is set to 3 throughout.
The authors extract features from the layers of the trained CDBN, and test these in several audio classification tasks. We see for the speech data that the generated bases have a good resemblance to speech sounds. For speaker identification, the authors use 10 utterances of 168 speakers from a different subset of the TIMIT corpus. They extract the spectrogram, MFCC, and first and second-layer CDBN features and test them in classifiers built by SVM, generalized discriminant analysis, and KNN. (It appears they reduce the features to “summary statistics”, which is just a form of feature agglomeration.) Their CDBN features outperform MFCC and spectrogram features. Outperforming all others is the combined classifier built by linearly combining the classifier outputs trained separately on CDBN and MFCC features. The authors find similar results on gender classification and phone identification tasks on the TIMIT dataset.
The authors also assess the application of CDBN for music classification. In a music genre classification task, the authors train the CDBN with music coming from five genres of the ISMIR 2004 Genre dataset. The authors set the parameters as before with the exception of the filter length \(n_W = 10\). They calculate spectrograms for randomly selected 3s segments of song from each genre. They achieve the highest accuracy when using the first layer of CDBN features, which is about 73.1% using five training examples for each genre. In the music artist classification task, they train the CDBN on an unlabeled collection of classical music performed by four artists. (It is unclear what they mean by “artist,” e.g., Callas, or Verdi? Bernstein or Bernstein?) Second layer features seem better here (81.9 % accuracy), and combining the two layers produced the best results, outperforming the spectrogram and MFCC features.