The genre recognition approach proposed by Bagci and Erzin is motivated by several key assumptions, which I boil down to the following:

- there are common properties among music genres, for instance, “similar types of instruments with similar rhythmic patterns”;
- the nature of some of these common properties is such that bags of frames of features (BFFs) encapsulate them;
- this means we should expect poor automatic genre recognition using BFFs if these common properties are quite prevalent among the classes; (or, that music genres have common properties is to some extent responsible for their ambiguousness;)
- we can model the common properties of a set of music genres by combining the features of frames that are misclassified in each genre;
- we can model the non-common properties of each of a set of music genres by combining the features of frames that are correctly classified in each genre;
- with these models, we can thereby ascribe a confidence that any given frame encapsulates common properties, and thus whether to treat it as indicative of genre or not.

This got me thinking: let’s forget about music signals for a while and consider a perfect toy problem for which the approach proposed by Bagci and Erzin would work.

The assumptions above then become:

- there are common properties among the classes in dataset X;
- the nature of some of these common properties is such that our features encapsulate them;
- this means we should expect poor automatic recognition using these features if the common properties are quite prevalent among the classes;
- we can model the common properties of our classes by combining the features that are misclassified in each genre;
- we can model the non-common properties of each of our classes by combining the features that are correctly classified in each class;
- with these models, we can thereby ascribe a confidence that any given set of features (an observation) encapsulates common properties, and thus whether to treat it as indicative of a class or not.

Now, let’s generate data of four classes, where each class is modeled by a mixture of 2 Gaussians with different parameters.

In this pretty little scene, we see the features of our classes in the 2-dimensional feature space.

Clearly, this dataset satisfies assumptions 1 and 2 above.

With 2-fold CV, we model each class as a mixture of two Gaussians,

and then classify each individual feature by maximum likelihood (equal priors).

We get an error rate of 0.375,

which is much better than random.

Now, consider that each of our observations produces 2 features instead of a single one.

When we classify each observation by summing the log conditional densities of the 2 observations, and selecting the class with the maximum (aggregated classification),

our error rate decreases to 0.158.

Instead of 2 features, let the observations consist of 5.

Our error rate for individual features stays the same,

but now the error of the aggregated classifier plummets to 0.037.

Now, as a sanity check, let’s change each class model such that with probability 0.9 a feature is taken from that part where all classes overlap.

Now, there should be much less difference between each of the two approaches to classification.

Similarly, if we reduce the separation of the unique components of each class, there should be less difference as well.

Sanity confirmed.

And this confirms assumption 3.

The more probable our observation of some class contains features of the common properties, the worse we are going to do.

Time to move to the next two assumptions.

In our training set, we take those features we misclassify and

use only those to train a model of the common properties.

Then we take the features that are correctly classified in a class,

and use those to train a model of that class.

Here are the resulting models for the first fold.

The contour in magenta is of the model for the common properties.

First, we can see that limiting the covariance matrix to be diagonal may not be the best way to model the non-common properties of each class.

Second, we see that some of green and black class in the common properties

has been correctly labeled, and so they contribute a Gaussian there.

Let’s iterate this procedure now to tune the models.

We take the training data, build models for each class,

classify each feature,

take the misclassified frames, relabel them to be the common class,

and build a model for the common properties.

Then we take the correctly classified frames and build new models for each class,

classify the training data again,

take all misclassified frames and build models again,

and do it again and again.

Below is an animation of 10 tuning iterations for one fold,

and shows how many features go from each class to the common properties model.

Eventually things settle with about one fifth of the features from every class flitting away to the common class.

And making the data more overlapping, we see the same kind of behavior.

Now let’s see how well it works on classification.

What we do is we classify each feature of an observation,

then classify the observation based on the maximum of the sum log conditional probabilities of those features having a higher condition probability in a class model that in the common properties models.

First, an easy dataset.

Here we see a halving of the error after ten tuning iterations!

And now for a more difficult dataset.

So, apparently, it is working!

Now, what is the critical separation between these classes such that

this tuning procedure doesn’t hurt or help?

I am going to run the above procedure for datasets of various separations,

and compute the errors for each one. Still, we have five features in each observation.

First, for a probability of a feature being drawn from the non-common set of 0.5, we see the following errors.

Clearly, IGS look good, and its error does not exceed the aggregated classifier.

Now for a probability of a feature being drawn from the non-common set of 0.3, we see the following errors.

All the errors have risen, but the cross-over still appears around the same separation as before.

Now for a probability of a feature being drawn from the non-common set of 0.1, we see the following errors.

Let’s see what happens when we increase the number of features in each observation from 5 to 50.

Very nice!

Now that I have, through this process, ironed out some critical bugs in my code,

it is time to reapply to music and test the first assumptions I enumerate above.