Previously, I discussed the dismal performance of two state-of-the-art music genre recognition system when it comes to being trained on hi-fi musical audio, but classifying lo-fi musical audio — a transformation that unarguably preserves the genre of the underlying music, and to which any system actually using features related to genre should be robust. Then, I discussed specific instances where these systems, even when trained and tested using hi-fi musical audio, make classifications that are “unforgivable”, in the sense that ABBA’s “Dancing Queen” is Metal, “A whole new world” from Aladdin and “The Lion Sleeps Tonight” are Blues, and “Disco Duck” is Rock. Regardless of their 80% mean accuracies, such behavior tells me that these algorithms act like a child memorizing irrelevant things (“Metal is loud; Classical music is soft”), instead of the stylistic indicators humans actually use to classify and discuss music (“Hip hop often makes use of samples and emphasizes speech more than singing; Blues typically has strophic form built upon 12 bars in common time; Jazz typically emphasizes variations upon themes passed around as solos in small ensembles; Disco is often danceable with a strong regular beat, and emphasizes bass lines, hi-hats, and sexual subjects”). The mean and variances of MFCCs, or modulations in time-frequency power spectra, do not embody characteristics relevant to classifying music genre; and thus these algorithms do not embody anything related to music genre, part 2.
In the comments here, readers rif and Alejandro asked me to run the same experiment, but classifying lo-fi audio (basically AM quality) with a system trained with lo-fi audio.
Here is the confusion matrix (10 runs of 5-fold stratified cross-validation) of the Bergstra et al. algorithm:
We see that the mean accuracy has dropped about 10% to about 70%.
And here is the confusion matrix (100 runs of 10-fold stratified cross-validation) of the approach based on
sparse representation classification of auditory temporal modulations:
Now, to be fair, I must mention that I made no adjustments to these algorithms. The Bergstra et al. approach is still using 40 MFCCs extending all the way to about 10 kHz, even though everything above 4 kHz has been attenuated 50 dB by the bandpass filter.
And the same thing for the sparse modulation approach.
So, I would think that removing the irrelevant features from the datasets would improve the mean accuracy of the approaches.
Yet still, though I have trained and tested with lo-fi audio,
can I say I have somehow removed the influence of confounding factors (such as compression and equalization, loudness, etc.)?
Do such things only leave their mark outside of 4 kHz?
Further tests I have in mind will shed more light on this subject,
and I think clearly show these algorithms do not embody anything related to music genre, part 3.