These algorithms do not embody anything related to music genre, part 2

Previously, I discussed the dismal performance of two state-of-the-art music genre recognition system when it comes to being trained on hi-fi musical audio, but classifying lo-fi musical audio — a transformation that unarguably preserves the genre of the underlying music, and to which any system actually using features related to genre should be robust. Then, I discussed specific instances where these systems, even when trained and tested using hi-fi musical audio, make classifications that are “unforgivable”, in the sense that ABBA’s “Dancing Queen” is Metal, “A whole new world” from Aladdin and “The Lion Sleeps Tonight” are Blues, and “Disco Duck” is Rock. Regardless of their 80% mean accuracies, such behavior tells me that these algorithms act like a child memorizing irrelevant things (“Metal is loud; Classical music is soft”), instead of the stylistic indicators humans actually use to classify and discuss music (“Hip hop often makes use of samples and emphasizes speech more than singing; Blues typically has strophic form built upon 12 bars in common time; Jazz typically emphasizes variations upon themes passed around as solos in small ensembles; Disco is often danceable with a strong regular beat, and emphasizes bass lines, hi-hats, and sexual subjects”). The mean and variances of MFCCs, or modulations in time-frequency power spectra, do not embody characteristics relevant to classifying music genre; and thus these algorithms do not embody anything related to music genre, part 2.


In the comments here, readers rif and Alejandro asked me to run the same experiment, but classifying lo-fi audio (basically AM quality) with a system trained with lo-fi audio.
Here is the confusion matrix (10 runs of 5-fold stratified cross-validation) of the Bergstra et al. algorithm:

confusion_excerpts_byvote_stump_lofilofi.jpg
We see that the mean accuracy has dropped about 10% to about 70%.
And here is the confusion matrix (100 runs of 10-fold stratified cross-validation) of the approach based on
sparse representation classification of auditory temporal modulations:

confusion_SRCATM_lofilofi.jpg
In this case the mean accuracy drops 25% to about 55%.
Looking at how particular excerpts were classified also show
unforgivable errors for a system supposedly recognizing genre.

Now, to be fair, I must mention that I made no adjustments to these algorithms. The Bergstra et al. approach is still using 40 MFCCs extending all the way to about 10 kHz, even though everything above 4 kHz has been attenuated 50 dB by the bandpass filter.
And the same thing for the sparse modulation approach.
So, I would think that removing the irrelevant features from the datasets would improve the mean accuracy of the approaches.
Yet still, though I have trained and tested with lo-fi audio,
can I say I have somehow removed the influence of confounding factors (such as compression and equalization, loudness, etc.)?
Do such things only leave their mark outside of 4 kHz?
Further tests I have in mind will shed more light on this subject,
and I think clearly show these algorithms do not embody anything related to music genre, part 3.

Advertisements

4 thoughts on “These algorithms do not embody anything related to music genre, part 2

  1. Thanks for running the experiment.
    I mostly agree with your points. I would quibble about the semantics — rather than saying emphatically that these algorithms “do not embody anything related to music genre”, I would instead say “there are lots of things related to music genre that these algorithms don’t capture.” But life’s too short to argue about semantics.

    Like

  2. Hi RIF,
    Indeed, I agree my statement is a bit too provocative for its own good. :) Since these algorithms are working with sampled sound, and since sound is related to music genre (or is it? I can hear excerpts of different genres in my head without producing any sound), these algorithms by definition embody things related to music genre.

    Like

  3. RIF,
    Call me old fashion but if “there are lots of things related to music genre that these algorithms don’t capture” and one of them is … genre, then let us not call them “genre classification algorithms”. We could be diplomatic and call them “Genre Proxy Classification algorithms” and then blame it all on the proxies ?
    Cheers,
    Igor.

    Like

  4. I was wondering, if one could translate the following,
    “Hip hop often makes use of samples and emphasizes speech more than singing; Blues typically has strophic form built upon 12 bars in common time; Jazz typically emphasizes variations upon themes passed around as solos in small ensembles; Disco is often danceable with a strong regular beat, and emphasizes bass lines, hi-hats, and sexual subjects”
    into observable properties of a music signal, that could be extracted as features.
    Assuming this is possible to some extent, what the above excerpt tells me is that it would be wrong to expect a single classifier trained on a certain type of feature (MFCCs for example) to be able to tell apart ten different genres. What makes more sense in my opinion, is to have dedicated feature extraction procedures for each genre, then train a dedicated classifier for each genre on those features. I think it will be hard to define a general-purpose feature extraction procedure that is able to preserve the properties that can be very different across genres.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s