Does AM radio obliterate music genre?

First, listen here:
and now here:

In both cases, we hear John Lee Hooker playing “One bourbon, one scotch, one beer”. In the first case, though, the sound is FM-quality (although compressed to 32 kbps MPEG-1 layer 3); and in the second it is AM quality (same bit rate as before, bandpass between 50 Hz – 4500 Hz).
I think it difficult to find anyone familiar with the Blues genre of music who would have trouble classifying the latter as not Blues, and the former as Blues.
In other words, though I have changed the characteristics of the transmission channel, the genre of the music underlying the audio data is not changed; its Bluesiness remains the same. (Hearing the AM-quality one though makes me nostalgic. My father always listens to AM radio when he drives.)

Now listen here:
What genre is it?
Now listen here:

In both cases, we hear Boz Skaggs playing “Lowdown” (which won him a Grammy for Best R&B Song of 1976). In the first case, though, the sound is AM quality (although compressed to 128 kbps MPEG-1 layer 3, bandpass between 50 Hz – 4500 Hz); and in the second it is FM-quality (same bit rate as before).
Again, to my ears, the style underlying the music is no different between the two.

Ok, now this one might be hard for some.
Listen here:

What is the genre?
Though it really pushes the bandwidth limits of AM, I have no trouble recognizing it as Metal. And I don’t think it takes me any more time to hear that it is Metal than with the high fidelity version.

So, what happens when we train a music genre classifier with features extracted from the high fidelity recordings, and attempt to classify low-fidelity recordings that pose no problem for human classification?
I tested this with an approach based on sparse representation classification of some perceptual features (see here for preliminary discussion), and the Tzanetakis music genre dataset (here I don’t care that this dataset has faults.)
Repeating 100 times, I perform 10-fold stratified cross validation which gives me means for each confusion, as well as narrow 95% confidence intervals.
Below we see the confusion matrix of hi-fi learning, and hi-fi classification.
These appear to be fantastic results! Nearly 80% of the time the classifier picks the correct genre.
You might even be dangerously persuaded to believe that the system is working well
because we should expect Rock music to have a significant stylistic overlap with Metal, Country, Disco, and Blues and Pop, thus explaining in our minds the confusions surrounding Rock; but that makes an unjustified assumption that the system is actually making decisions based on characteristics related to genre.

Now look at the confusions when I classify lo-fi audio with this same system:
The approach has suffered an incredible catastrophic failure, and I am persuaded to argue that it was never even comparing genre, but some combination of confounding variables completely unrelated to genre. That, or else AM radio obliterates music genre to anything without a soul, and I am only recognizing it by the inextinguishable magical force of music.

Surely, since sparse representation classification is so dependent on the dictionary, such a tragic outcome cannot befall a more generalized approach.
So I have tested in the same way (though with only 10 repetitions of 5-fold stratified cross validation) the method proposed and tested by James Bergstra et al. — which won the 2005 MIREX music genre classification competition.
Below we see the confusion matrix of hi-fi learning and hi-fi classification.
As before, so excited we might be that with nearly 80% accuracy the classifier is working.
But, of course, that it is not.
Here are the results for classification of lo-fi sound with the same system (albeit with only one result of a 5-fold stratified cross validation — because the 9 other trials will take all weekend):
Apparently, Blues is the fall-back option.
As I have said before, even the simplest problem here has yet to be solved.

Since these results are part of an article I am writing, now comes the interesting (or is it pedantic) question of whether I have to prove with some formal listening test that AM radio does not obliterate the genre of music underlying an audio excerpt, and thus we should not expect a good genre recognition algorithm will fail too. One of the comments given for the rejection of my correspondence submission to TASLP is that “the author did not conduct any subjective formal listening tests other than informal listening by himself.”
I guess I need to query whether several people can tell two sound files are exactly the same (though I used the Shazam fingerprinting method to detect most of them),
or that Rapper’s Delight is Hip hop and not Disco,
or that “Rain drops keep falling on my head” is not Country,
or that the 12 excerpts by the New Bomb Turks should not be labeled Metal,
and so on and so forth. (Yes, I am still bitter.)

Just to be sure, I will indeed conduct such a test — though I think it is pretty obvious from history that music broadcast on AM radio didn’t have its stylistic identity destroyed. Still, the real €100,000 question is at what are these two algorithms looking to make a decision such that for hi-fi training and hi-fi classifying they get 80% correct? What are the confounding variables?


11 thoughts on “Does AM radio obliterate music genre?

  1. Intriguing, but I think you’re missing a crucial experiment. Can you do a variant where you train AND test on the lo-fi data?
    My understanding is that nearly all computational perception systems have a lot of sensitivity to the channel characteristics. For instance, speech recognizers trained on CD-quality audio will do very poorly on telephone quality speech, but you wouldn’t argue that they’d never learned anything about speech at all.
    I predict that if you train and test on the lo-fi data, you’ll do much better than the cross-fidelity cases, and only a little worse than the hi-fi.


  2. Of course the algorithms will perform much better in those cases. The point I have made is that neither of them, nor the features they use, embody anything related to music genre. Thus, their application is severely limited. (I have several other observations with hi-fi training and testing that show these methods are not acting sensibly too. Those will come later.)
    Automatic speech recognition is, however, quite robust to changes in the channel because of the features used in state of the art systems. The filter/model system, and vocal-tract length independent features, are quite powerful for that task because they are directly related to the speech, and not, e.g., the gender or clothing of the speaker. And speech recognition is a much more well-defined problem (occurring over dozens of milliseconds) than recognizing genre (occurring over dozens of seconds, and subject to reflection, business, etc.). We have no sensible features for the latter, except maybe for tags other people have placed.


  3. Are you saying you won’t do the experiment I suggest or won’t post the results?
    I am suggesting that with just the results above, there is a pretty big hole in your argument that these algorithm do not embody “anything related to music genre.” I think that’s an interesting hypothesis, but far from proven.
    [Personally, I tend to believe that these algorithms are learning something about genre — notably whatever aspect of genre is embodied in the subset of “timbre” captured by short-term spectral features, more or less obviously ignoring longer-term musical structure and likely rhythm. As an extreme example, I think these systems could do a pretty good job of classifying “music with drums” vs “music without drums”, which is genre-related].
    I think we might have to agree to disagree about speech recognition. I have seen endless variants of papers where they train on clean speech, add a tiny amount of noise, the results get drastically worse, train on the noise, and the results are OK. I do not think the problem of “train on one set of audio, test on a very different set of audio” is considered “solved” via filter/model systems or vocal-tract length independent features.
    Put differently, it seems that your initial argument is that genre classifiers are likely learning nothing about musical genre and getting all their “mojo” from things like the particular recording studio used or the particular sound engineer’s choices. [If I’m misunderstanding please correct me.] I am suggesting that they are learning some combination of that and usable timbre features. As a first approximation, going to AM quality by putting all the recordings through the same bandpass filter ought to correct for a lot of recording studio/sound engineer choices. If the system *does* still perform reasonably well when trained and tested on such data, this seems to me to be some support for my hypothesis.


  4. Hi RIF,
    My computer is busy on several other experiments, but I will do the test you describe. Why would you think applying a bandpass filter will “correct for a lot of recording studio/sound engineer choices”? They will still leave a mark in the passband.
    You are not misunderstanding me; I am saying these algorithms have no “mojo.” I don’t know from where their performance comes, but I am confident it doesn’t come from something as high level as specific characteristics defining genres. On the surface, 80% accuracy sounds good; but digging deeper there are many troubling signs. I just posted a follow up with specific examples.


  5. That is a good story Alejandro!
    1. I will test the lo-fi training, but there is no reason to expect the same catastrophic failures. I am thinking of also testing after applying dynamic compression, but these are mixtures, and the relative levels will stay the same. Is it still Reggae if I remove the heavy spring reverb?
    2. I don’t know the answer, but I would say it has to be dramatic. We are using many cues to detect and classify genre, including lyrics, drum patterns, tempo, even recollections to previous hearings, and so on.


  6. Cool, I look forward to it.
    If the lo-fi training/lo-fi testing does well or poorly, what does that tell us? If it did very poorly, that would indicate to me that the algorithms were working using very low or high frequency or very subtle features that otherwise depended on high-fidelity audio, and would be strong evidence for your claim.
    On the other hand, if this approach does decently, as I expect, then I would say that the particular experiment you did of training hi-fi and testing lo-fi does not tell you much. I guess it tells you there are at least some features humans are using [because you can still hear it] that the classifier didn’t use, but that’s not as strong as saying the classifier’s doing nothing related to music.


  7. Perhaps you are simply overfitting your model if it does not generalize well….? For example, perhaps your model is able to squeeze out slightly better classification results by making very strong assumptions about the high frequency content of your feature vectors. But these assumptions might prove catastrophic for AM classification, since for AM signals the high frequency spectrum is essentially low amplitude noise. You might try reducing the number of parameters in your model (so that the FM performance gets slightly worse), and see if this leads to improved AM performance.
    What about this experiment: train on the FM data, but as part of your feature extraction process include the AM-to-FM transformation (e.g., a lowpass filter). Now I suspect you will get good results for both FM and AM classification in this case. In fact, there should be no difference at all in performance between the AM and FM classification, since your features and labels are the same in both cases.


  8. Hi Corey,
    It is hard to say whether overtraining is the case because I am training and testing in a cross-validated way, and with as many training iterations as used in the Bergstra et al. Adaboost work. I don’t quite understand what you mean about including the lowpass filtering as part of the feature extraction.


  9. hello bob,
    pardon my french, but i have the impression that a stupid reason could explain your observation, and the title “Does AM radio obliterate music genre?” looks particularly pretentious.
    Basically, in your original HiFi database, the quality is good (by definitoin HiFi) except for the recordings whose already the step of recording itself was not so well performed. Actually it corresponds to Blues (a lot) and to classical (a little bit), and some jazz (less).
    When you have a classifier (whatever it is) it takes both the “semantic” aspect (genre, tempo, etc..) but also the “non-semantic” aspect (presence of noise, bandwidth of the signal, dynamic, etc..). Because only many blues are badly recorded (old discs with a small bandwidth), as well as some classical and jazz tunes, the classifier retains that a small bandwidth (and the presence of noise, etc..) is a true discriminant characteristic -and an easy one to remember- to identify blues, and relatively less discriminant characteristics (but still a characteristic) to identify classical and jazz.
    what happens then when you degrade the quality of a pop or a reggae song ? the degradation introduces per se characteristics which are normally relevant to indicate the genre blues (and a little bit classical and a little bit jazz). Because these characteristics are easy to detect and then become strongly discriminant, a reduced-quality pop or reggae song will be very often classified as blues (or classical or jazz)..
    what is the problem here ? the major problem in the system is that the quality of a record becomes a discriminant factor for classification, because for some genres, and only for somes genres, quality of some original recordings is de facto bad…
    What could be the solution ? the solution could be to tranform the database in a way that quality would not be anymore a discriminant factor to determine a genre..
    How to achieve that ? the easiest way is maybe only to generate a new database in the following way: from the N HiFi songs present in the database you create a 2N songs new database, with the first N songs being the original HiFi songs and the last N songs being a LoFi version of the N original HiFi songs. It means that in the new database each song is duplicated and the new database contains a HiFi and a LoFi version of each song. In that way quality is not anymore a discriminating factor for recognizing genre.
    That’s all.. and it does not proove that AM radio obliterates whatever..
    Another way of seing things is the following:
    – Take your original Hifi database. For all the songs of a genre, let’s say classical, you add as background noise a periodical horn car noise.
    – You train your classifier with this database.
    – you take a new tango song and you add it the same periodical horn car noise.
    – try to classify this tango song and this will probably be classified as classical music.
    Is that totally crazy or just normal (that a classifier uses both low-levels and high-levels features of the objects to classify) ? or will you write a new comments called “does car horn obliterate music genre?” ;)
    Chercheur masqué


  10. Cher monsieur,
    Thank you for your comments! I have no doubt that the reasons for their performance include the one you give. These classifiers are fragile and naive creatures since they are discriminating genres based on nonsensical dimensions. They are not listening to the music, but to the signal. And though they are quite lucky in their mean accuracies, minor changes to the presented signals create insurmountable obstacles and catastrophic failures.
    There are better ways!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s