These algorithms embody next to nothing related to music genre, part 3

Bob Wills & His Texas Playboys is one of the great American Western Swing bands of last century, and have a classic sound that helped define the genre, with two fiddles, steel and electric guitars, and completed by hollering courtesy of Wills.
Here is a short excerpt of their dance hit “Big Balls in Cowtown”:

Now, with my music genre system (J. Bergstra, N. Casagrande, D. Erhan, D. Eck, and B. Kégl, “Aggregate features and adaboost for music classification,” Machine Learning, vol. 65, pp. 473-484, June 2006) trained on the entire Tzanetakis dataset (4000 training iterations for stumps),
I am happy to say that the system classifies this excerpt as Country… which is acceptable, though I would call it Western because I have a little more training. That is a win for automatic music genre recognition systems everywhere.

However, consider this simple experiment.
We decompose this audio signal into several frequency bands, some of which are shown below.

Then, we produce a new excerpt by randomly and independently turning on or off each of these bands for the entire segment, and send each of those signals \(x_i[n]\) (which has amplitudes in \([-1,1]\)) through the compression
\hat x_i[n] = [1+exp(-\gamma x_i[n])]^{-1} – 1/2
where \(\gamma_i \sim Uniform[0,10]\).
Then we mix all bands together to produce a track, then compute its features,
and finally send it through the trained music genre classifier to see what it says.
I have done this with the Wills excerpt,
and thereby produced variants classified in each of the 10 genres of which the classifier is trained.
Now we can listen to the differences between them to perhaps hear what is particularly important for the system to make a decision.

Here is the one classified as Blues:

Here is the one classified as Classical:

Here is the one classified as Disco:

Here is the one classified as HipHop:

Here is the one classified as Jazz:

Here is the one classified as Metal:

Here is the one classified as Reggae:

Here is the one classified as Rock:

The most apparent differences to me are the amplification of the bass, especially in the Metal and Reggae variants.
All the others seem to emphasize portions of the mids and highs,
and perhaps favor more or less compression in some.
To better hear these differences, I juxtaposed the first few seconds of each of the ten genres, going in the order below, and repeating four times in total:

  1. Blues
  2. Classical
  3. Country
  4. Disco
  5. Hip hop
  6. Jazz
  7. Metal
  8. Pop
  9. Reggae
  10. Rock

So, though the trained classifier picked Country for the unaltered excerpt (an excerpt which does not exist in the Tzanetakis dataset), we can make it produce any of the 10 genres by slightly changing its equalization. (I have also tested this with Bel Biv Devoe, “Poison”, to the same effect.) And that is a big loss for automatic music genre recognition systems everywhere.

Now, I wonder, what happens when we feed it similarly equalized white noise?


7 thoughts on “These algorithms embody next to nothing related to music genre, part 3

  1. Hi Bob,
    Off course, you can turn this around and *use* such a system to artificially increase the size of your training database by a (few) magnitude(s) or so. In image recognition, similar steps are often taken by artificially introducing all sorts of scaling, shifting and rotation variations.
    The hope would then be that by artificially including many variations that you know are not relevant, the algorithms will learn to start classifiying on those aspects you do deem relevant.


  2. The Tzanetakis set contains 100, 30 second clips for each genre. So, that’s only 50 minutes of recorded music per genre. I would argue that this is not a lot of data (a little over one albums worth). So one of your problems may simply be a severely limited amount training data. Human beings, for example, have hundreds of hours of listening experience on their side. Would would happen if you increased your amount of data 10 fold or 100 fold?
    A 30 second clip of music sampled at 20kHz could be viewed as a single point in 600000 dimensional space. A training set with 1000 excerpts hopelessly undersamples this space. However, since music is highly structured, it probably lives near a smaller dimensional manifold embedded in this space (and thus we may still hope to capture something about music given a limited training set).
    Jort has made an interesting suggestion. You could artificially expand your limited data by including those variations/degradations you expect to see in your training set.


  3. The spaces in which I am comparing features have dimensions in the hundreds, not hundreds of thousands; so I don’t think that plays a role. Definitely, though, in a world rich with music, 1000 30 second excerpts are not going to encompass most of its variety. However:
    1. Though this is a small dataset, and it has many errors, the past decade of work has shown mean accuracies significantly greater than chance. So, there is something uniquely discriminable about each genre subset of the dataset. I just argue it is not genre.
    2. Expanding my dataset to include all sorts of filtered variations of the input is only ducktape on a system built on bad assumptions, i.e., that high-level components of music genre are embodied in bags of low-level features, or auditory codes.
    3. You only need listen to a few excerpts of Disco before you have a good idea of its discriminative stylistic components. Rap is the same. Blues is the same. Consider too the more recent genre of Dubstep, e.g.,



  4. The features you are using may be low dimensional, but that does not mean music is. By working in a low-dimensional space you may have discarded important information.
    You are an experienced listener, so you may be perfectly fine discriminating disco from rap after listening to a few excerpts. But you have been “training” for your whole live. Everything you experience, you view in light of your prior knowledge. Your algorithms don’t have this luxury, and I think it is important we don’t take this for granted. My two year old niece, for example, is quite musical, but would perform hopelessly at genre recognition.


  5. Even more reason to view with suspicion that these algorithms achieving 80% mean accuracy are really comparing music genre.
    Never heard of brostep! In ten years, we will see where Skrillex fits. :)


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s