First, listen here:
and now here:
In both cases, we hear John Lee Hooker playing “One bourbon, one scotch, one beer”. In the first case, though, the sound is FM-quality (although compressed to 32 kbps MPEG-1 layer 3); and in the second it is AM quality (same bit rate as before, bandpass between 50 Hz – 4500 Hz).
I think it difficult to find anyone familiar with the Blues genre of music who would have trouble classifying the latter as not Blues, and the former as Blues.
In other words, though I have changed the characteristics of the transmission channel, the genre of the music underlying the audio data is not changed; its Bluesiness remains the same. (Hearing the AM-quality one though makes me nostalgic. My father always listens to AM radio when he drives.)
Now listen here:
What genre is it?
Now listen here:
In both cases, we hear Boz Skaggs playing “Lowdown” (which won him a Grammy for Best R&B Song of 1976). In the first case, though, the sound is AM quality (although compressed to 128 kbps MPEG-1 layer 3, bandpass between 50 Hz – 4500 Hz); and in the second it is FM-quality (same bit rate as before).
Again, to my ears, the style underlying the music is no different between the two.
Ok, now this one might be hard for some.
What is the genre?
Though it really pushes the bandwidth limits of AM, I have no trouble recognizing it as Metal. And I don’t think it takes me any more time to hear that it is Metal than with the high fidelity version.
So, what happens when we train a music genre classifier with features extracted from the high fidelity recordings, and attempt to classify low-fidelity recordings that pose no problem for human classification?
I tested this with an approach based on sparse representation classification of some perceptual features (see here for preliminary discussion), and the Tzanetakis music genre dataset (here I don’t care that this dataset has faults.)
Repeating 100 times, I perform 10-fold stratified cross validation which gives me means for each confusion, as well as narrow 95% confidence intervals.
Below we see the confusion matrix of hi-fi learning, and hi-fi classification.
These appear to be fantastic results! Nearly 80% of the time the classifier picks the correct genre.
You might even be dangerously persuaded to believe that the system is working well
because we should expect Rock music to have a significant stylistic overlap with Metal, Country, Disco, and Blues and Pop, thus explaining in our minds the confusions surrounding Rock; but that makes an unjustified assumption that the system is actually making decisions based on characteristics related to genre.
Now look at the confusions when I classify lo-fi audio with this same system:
The approach has suffered an incredible catastrophic failure, and I am persuaded to argue that it was never even comparing genre, but some combination of confounding variables completely unrelated to genre. That, or else AM radio obliterates music genre to anything without a soul, and I am only recognizing it by the inextinguishable magical force of music.
Surely, since sparse representation classification is so dependent on the dictionary, such a tragic outcome cannot befall a more generalized approach.
So I have tested in the same way (though with only 10 repetitions of 5-fold stratified cross validation) the method proposed and tested by James Bergstra et al. — which won the 2005 MIREX music genre classification competition.
Below we see the confusion matrix of hi-fi learning and hi-fi classification.
As before, so excited we might be that with nearly 80% accuracy the classifier is working.
But, of course, that it is not.
Here are the results for classification of lo-fi sound with the same system (albeit with only one result of a 5-fold stratified cross validation — because the 9 other trials will take all weekend):
Apparently, Blues is the fall-back option.
As I have said before, even the simplest problem here has yet to be solved.
Since these results are part of an article I am writing, now comes the interesting (or is it pedantic) question of whether I have to prove with some formal listening test that AM radio does not obliterate the genre of music underlying an audio excerpt, and thus we should not expect a good genre recognition algorithm will fail too. One of the comments given for the rejection of my correspondence submission to TASLP is that “the author did not conduct any subjective formal listening tests other than informal listening by himself.”
I guess I need to query whether several people can tell two sound files are exactly the same (though I used the Shazam fingerprinting method to detect most of them),
or that Rapper’s Delight is Hip hop and not Disco,
or that “Rain drops keep falling on my head” is not Country,
or that the 12 excerpts by the New Bomb Turks should not be labeled Metal,
and so on and so forth. (Yes, I am still bitter.)
Just to be sure, I will indeed conduct such a test — though I think it is pretty obvious from history that music broadcast on AM radio didn’t have its stylistic identity destroyed. Still, the real €100,000 question is at what are these two algorithms looking to make a decision such that for hi-fi training and hi-fi classifying they get 80% correct? What are the confounding variables?