More than two years ago, I blogged about the paper, K. Chang, J.-S. R. Jang, and C. S. Iliopoulos, “Music genre classification via compressive sampling,” in Proc. Int. Soc. Music Information Retrieval, (Amsterdam, the Netherlands), pp. 387-392, Aug. 2010. This paper reports extremely high classification accuracies of GTZAN — a problematic dataset. At least three papers since make direct comparisons to its 92.7% classification accuracy:
- M. Henaff, K. Jarrett, K. Kavukcuoglu, and Y. LeCun, “Unsupervised learning of sparse features for scalable audio classification,” in Proc. ISMIR, (Miami, FL), Oct. 2011.
- J. Wülfing and M. Riedmiller, “Unsupervised learning of local features for music classification,” in Proc. ISMIR, 2012.
- C.-C. M. Yeh and Y.-H. Yang, “Supervised dictionary learning for music genre classification,” in Proc. ACM Int. Conf. Multimedia Retrieval, (Hong Kong, China), Jun. 2012.
To my knowledge, no one has attempted to reproduce these results. I emailed the authors October 4 2012 asking whether they had discovered any errors in their tests that could have produced such a high classification accuracy. Since I have yet to hear back, it is time to reproduce the results.
Today, I began looking more deeply at the paper. I was struck by six things.
First, the reported confusion matrix of the experiment is symmetric. Exactly symmetric. Among the Disco excerpts in GTZAN, there are several Hip hop and Pop excerpts, but here we see that the proposed classifier never confused any Disco excerpt with Hip hop or Pop, and no Hip hop and Pop excerpts are ever confused with Disco. That alone tells me that this classifier is not recognizing those genres. And what is the probability of a confusion matrix of 1000 excerpts in ten classes be exactly symmetric?
Second, this paper states something unusual about its use of GTZAN:
All the samples are digitized 44.1 kHz, 16-bit, and mono signal in preprocessing. The 30-seconds of audio clips after initial 12 seconds are used.
My interpretation might be due to the use of passive voice, but it appears to me that the authors are upsampling every GTZAN excerpt (original sampling rate of 22050 Hz), and then extracting some 30s part of it — except all GTZAN excerpts are at most 30 s long (which is stated earlier in this paper). It might be that the authors are describing how each excerpt was produced by Tzanetakis — which is still inaccurate.
Third, about its experimental design, Chang et al. say something quite odd: “To evaluate the proposed method for genre classification, we set up all the experimental parameters to be as close as possible to those used in” T. N. Sainath, A. Carmi, D. Kanevsky, and B. Ramabhadran, “Bayesian compressive sensing for phonetic classification,” in Proc. ICASSP, 2010. In that paper, Sainath et al. write at the beginning of the experiments section:
Classification experiments are conducted on TIMIT  acoustic phonetic corpus. The corpus contains over 6,300 phonetically rich utterances divided into three sets. The standard NIST training set consists of 3,696 sentences, used to train various models used by the recognizer. The development set is composed of 400 utterances and is used to train various classifier tuning parameters. The full test set includes 944 utterances, while the core test set is a subset of the full test set containing 192 utterances. In accordance with standard experimentation on TIMIT , the 61 phonetic labels are collapsed into a set of 48 for acoustic model training, ignoring the glottal stop [q]. For testing purposes, the standard practice is to collapse the 48 trained labels into a smaller set of 39 labels .
Sainath et al. go on, but nothing in that paper hints at what Chang et al. could mean.
Chang et al. do say after citing Sainath et al., “In particular, the recognition rate is obtained from 10-fold cross validation.” However, Sainath et al. do not even use 10-fold cross validation.
The fourth striking thing about this paper is how irreproducible I am finding its work.
For example, take the zero crossing rate feature, which is the count of the number of times a waveform crosses the zero.
I assume I take a frame using an analysis window of 93ms, and compute the number of times the sample amplitudes cross the zero line. Then, I do that for the next analysis window, shifted about 47 ms later.
For the duration of an entire 30 s excerpt, that gives me 638 numbers. Chang et al. write, “… MFCC, spectral centroid, spectral rolloff, spectral flux, and zero crossings are short-time features, thus their statistics are computed over a texture window.” Thus, for each 3 second texture window, I have 63 numbers — but what statistics am I to compute? Chang et al. do not say, but one table show that the feature dimension for the zero crossing component is two. Jang et al. 2008 state they compute the first and second moments. Hence, I assume they are computing the mean and variance of the zero crossing rate for each texture window.
With that, I have ten zero crossing rate means and variances for any 30 s GTZAN excerpt.
Now, how should I classify a 30 s excerpt, and not a 3 s segment? Am I to compute statistics of the ten texture windows to create one feature vector for the entire excerpt? Or are the results only for one 3 second segment from each excerpt? Or is there some voting going on, as in D. Jang, M. Jin, and C. D. Yoo, “Music genre classification using novel features and a weighted voting method,” in Proc. ICME, pp. 1377-1380, 2008? These details are missing entirely — as are how to compute “octave-based spectral contrast (OSC)” features, “octave-based modulation spectral contrast (OMSC)” features, “modulation spectral flatness measure (MSFM)” features, and “modulation spectral crest measure (MSFC)” features. The first two are described somewhat confusingly in C.-H. Lee, J.-L. Shih, K.-M. Yu, and J.-M. Su, “Automatic music genre classification using modulation spectral contrast feature,” in Proc. ICME, 2007; and the last two are described in D. Jang, M. Jin, and C. D. Yoo, “Music genre classification using novel features and a weighted voting method,” in Proc. ICME, pp. 1377-1380, 2008.
I face the same problem with all others of their features.
The fifth thing that strikes me is that Table 3 in Chang et al. lists the feature dimension as 64, and states, “the feature dimension of the proposed approach is considerably lower than those of the SRC-based approaches, demonstrating the effective of CS in extracting features with discriminating power.”
However, Table 6 lists with their dimensions all the features Chang et al. use (except they appear to have forgotten MSFC), and the total number of dimensions is 115.
Though it is never stated, I assume this is where they perform a random projection by a 64×115 matrix. Chang et al. do state, “the measurement matrix is a Gaussian random matrix,” but they never explicitly state its dimensions.
The sixth very striking thing is that in several places I find plagiarism of D. Jang, M. Jin, and C. D. Yoo, “Music information retrieval using novel features and a weighted voting method,” in Proc. IEEE Int. Symp. Industrial Elec., pp. 1341-1346, July 2009 — which itself plagiarizes: D. Jang, M. Jin, and C. D. Yoo, “Music genre classification using novel features and a weighted voting method,” in Proc. ICME, pp. 1377-1380, 2008.
For instance, Chang et al. write:
The length of the analysis window was set to 93ms, and 50% overlap was used for feature extraction. The length of texture window was set to 3 second, thus a texture window contains 63 analysis windows.
Jang et al. 2008 write:
The length of the analysis window was set to 93ms, and 50% overlap was used for feature extraction. The length of texture window was set to 3s, thus a texture window contains 63 analysis windows.
Both sentences are missing the definite article in front of “texture window.”
Again, Chang et al. write:
Then spectral peaks and spectral valleys are estimated by averaging across the small neighborhood around maximum and minimum values of the amplitude spectrum respectively.
Jang et al. 2008 write:
… spectral peaks and spectral valleys are estimated by averaging across the small neighborhood around maximum and minimum values of the amplitude spectrum respectively.
Again, Chang et al. write:
Then for each subband, the modulation spectrum is obtained by applying the discrete Fourier transform (DFT) on the sequence of the sum of amplitude spectrum. OMSC is obtained from spectral peaks and spectral contrasts of the modulation spectrum.
Jang et al. 2008 write:
Then for each subband, the modulation spectrum is obtained by applying the discrete Fourier transform (DFT) on the sequence of the sum of amplitude spectrum. The OMSC is obtained from spectral peaks and spectral contrasts of the modulation spectrum.
And again, Chang et al. write of two features:
Zero crossings: The number of time domain zero crossings of the music signal.
Low energy: The percentage of analysis windows that have energy less than the average energy across the texture window.
Jang et al. 2008 write:
The zero crossings feature is the number of time domain zero crossings of the music signal. The low-energy feature is defined as the percentage of analysis windows that have energy less than the average energy across the texture window.
All in all, whereas before I looked so closely I was doubtful of the results,
I am now nearly sure they are erroneous.
I will power on with reproducing them, however, by making as reasonable assumptions as possible — and to develop a spidey sense of how I can write papers such that I make clear all essential details.
Once finished, I will explore whether compressed sensing helps or hurts the classification — for which I predict it will do nothing but hurt accuracy (as I saw in some experiments with random projection for feature reduction here).