Paper of the Day (Po’D): Learning Features from Audio with Deep Belief Networks Edition

Hello, and welcome to Paper of the Day (Po’D): Learning Features from Audio with Deep Belief Networks Edition. Today’s paper looks at the derivation of discriminative features from audio signals using what are called “deep belief networks”: P. Hamel and D. Eck, “Learning features from music audio with deep belief networks,” in Proc. Int. Symp. Music Info. Retrieval, (Utrecht, Netherlands), Aug. 2010. This is interesting because I wonder if we can apply this technique to process the auditory spectrograms that Panagakis et al. use in their genre recognition task. How do the resulting features compare with the auditory temporal modulations? I would hope that we could learn features as good as ATMs for a variety of time-scales.

The following is co-authored with Pardis Noorzad, in the Department of Computer Engineering and IT, Amirkabir University of Technology, Tehran, Iran (

The authors explore the use of deep belief networks (DBNs) to automatically extract features from magnitude Fourier transforms of uniformly segmented (46.44 ms) audio signals. A DBN is a multilayer neural network made up of restricted Boltzmann machines (RBMs), which themselves are a kind of two-layer neural network with one visible and one hidden layer, and without connections between neurons of the same layer (the restricted part). They are referred to as “Boltzmann machines” because they use the concept of energy in updating the states of each randomly activating neuron. The RBMs in the DBN are trained layer by layer and help initialize the DBN weights. As for any trained neural network, you feed to its inputs a set of values (e.g., features), and it spits out other sets of values (activations) from each layer (e.g., features in a smaller-dimensional space). Here, the authors use a DBN with 3 hidden layers (so 3 RBMs) having 50 neurons each.

Within the context of music genre recognition, the authors train their DBN in the following way.
They take a set of Fourier magnitudes corresponding to musical audio from one of 10 genres, and make the DBN output a positive activation from the output neuron representing the correct genre using back propagation of error.
They then repeat this process with features from the other genres until the DBN is trained.
Then, they use the activations from each layer for a given input feature to produce the new feature.
In other words, after training the DBN, each layer produces a set of values that is passed onto the next layer, until it reaches the output layer that corresponds to the genre classes.
The idea is that these activations provide a higher-level representation of the data in a possibly lower-dimensional space.
In this particular case, the DBN of the authors reduces the originally 513-dimensional features into features of size 50 to 150 dimensions depending on how many activations they choose.

The authors could just test the genre classifiability of an input feature vector into their trained DBN, but their interest is in the activations as features.
Thus, they test how well these new features perform in training and testing a multiclass support vector machine (SVM with a Guassian kernel) for genre classification, as well as music tagging.
First, from a 30s audio clip, they find the majority vote for features produced by 10,000 46 ms frames. Over the entire Tzanetakis dataset, the authors observe a mean accuracy of 77%, which is 15% higher than just using Mel-frequency cepstral coefficients (MFCCs), i.e., an example of engineered features.
Using instead the means and standard deviations of the DBN features over 5 second segments, thus aggregating them, the authors observe an accuracy of 84.3%, which is 5% higher than using similarly aggregated MFCC features.
This improvement makes sense because the DBNs are learning frame-level features and not relationships between frames.
(This lags behind the highest-reported accuracies obtained by Panagakis et al. of 93.7%.)
In the music tagging procedure, the authors show that the DBN features in general outperform by about 3% accuracy on a state-of-the-art set of timbral and temporal features.

These results look very promising, but there are some computational and theoretical hurdles.
There is an increase in memory usage and computation time when using short time frames of audio. The authors had to limit their training and testing set sizes because of this issue. Also, the problem remains of how to automatically tune the hyperparameters of the DBN (number of hidden layers, number of units per layer, number of training epochs, etc.) for a given problem.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s