Yesterday, I got everything working for a toy problem tuned such that the approach proposed by Bagci and Erzin works nicely.
Now, we return to apply it to the real-world problem of music genre recognition.
We use a 3-s decision window (each observation has 300 features),
with 13 MFCCs in each feature (including the zeroth coefficient),
model each class with a mixture of 8 Gaussians having diagonal covariance matrices.
(I have to specify a small regularization term to avoid ill-conditioned covariance matrices.
I also specify a maximum number of EM iterations of 200.)
We observe the following classification errors for each fold.
Fold 1 GMMC Mean Error: 25 (train) 47.6 (test) Fold 1 t1 Flat Mean Error: 29.4 (train) 52 (test) Fold 1 t1 IGS Mean Error: 30 (train) 50 (test) Fold 1 t2 Flat Mean Error: 34.2 (train) 55.4 (test) Fold 1 t2 IGS Mean Error: 33.4 (train) 56.2 (test) Fold 1 t3 Flat Mean Error: 36 (train) 53.6 (test) Fold 1 t3 IGS Mean Error: 35 (train) 54.2 (test) Fold 1 t4 Flat Mean Error: 37.6 (train) 59 (test) Fold 1 t4 IGS Mean Error: 36.2 (train) 56.4 (test) Fold 2 GMMC Mean Error: 25.3 (train) 46.4 (test) Fold 2 t1 Flat Mean Error: 29.1 (train) 50.9 (test) Fold 2 t1 IGS Mean Error: 29.5 (train) 49.9 (test) Fold 2 t2 Flat Mean Error: 33 (train) 52.7 (test) Fold 2 t2 IGS Mean Error: 32.6 (train) 52.6 (test) Fold 2 t3 Flat Mean Error: 34.7 (train) 51.3 (test) Fold 2 t3 IGS Mean Error: 34 (train) 51.1 (test) Fold 2 t4 Flat Mean Error: 36.6 (train) 55 (test) Fold 2 t4 IGS Mean Error: 35.3 (train) 53.4 (test)
The GMMC result is the classification error of the “flat” classifier, before any inter-genre similarity model (IGS) is considered. All models are created with all training data from all genres.
So we see for the first fold, a classification error of 25 for the training set, and 47.6 for the test set.
Then we create an IGS and begin to tune it just as we did for the data yesterday.
Now we see this hurts performance, and things don’t get much better with more tuning iterations.
Bagci and Erzin cite a classification accuracy for the flat classifier using this feature set as 53.89.
We are getting something around that as well.
But, for the untuned IGS model they get an accuracy of 66.36,
whereas we are getting under 52.
So, perhaps we are not running the EM algorithm long enough, or we are not tuning the models enough times.
Let’s see what happens when we increase the number of EM iterations to 500, and the number of tuning cycles to 10.
And here is what we see for the fist fold, i.e., not much change, and it appears to bounce around a limit.
Fold 1 GMMC Mean Error: 24.2 (train) 43 (test) 1 Flat Mean Error: 32.4 (train) 47.4 (test) 1 IGS Mean Error: 31.2 (train) 46.4 (test) 2 Flat Mean Error: 35.2 (train) 49.8 (test) 2 IGS Mean Error: 34.6 (train) 50 (test) 3 Flat Mean Error: 36.2 (train) 51.2 (test) 3 IGS Mean Error: 35.4 (train) 49.6 (test) 4 Flat Mean Error: 36.8 (train) 51.6 (test) 4 IGS Mean Error: 36.2 (train) 49.8 (test) 5 Flat Mean Error: 38.4 (train) 51.4 (test) 5 IGS Mean Error: 37.6 (train) 50.2 (test) 6 Flat Mean Error: 37 (train) 53.2 (test) 6 IGS Mean Error: 37.2 (train) 51.4 (test) 7 Flat Mean Error: 38 (train) 54.6 (test) 7 IGS Mean Error: 37.6 (train) 53.2 (test) 8 Flat Mean Error: 39.4 (train) 50.4 (test) 8 IGS Mean Error: 38 (train) 50 (test) 9 Flat Mean Error: 39 (train) 51.2 (test) 9 IGS Mean Error: 37.8 (train) 50.6 (test) 10 Flat Mean Error: 38.6 (train) 51.8 (test) 10 IGS Mean Error: 37.6 (train) 51.6 (test)
So, maybe we are seeing a chasm between our ideal test world and the real test world.
Let’s check the test world again.
This time, let’s see how the classification error changes as a function of the number of tuning iterations for a given separation.
Starting with a big separation:
and a wittle separation:
In both cases we see that the model without the IGS model is better or equal to the performance of the untuned IGS model.
Nonetheless, as we tune the IGS model, the classification error become larger.
Now let’s see what happens as a function of separation for different tuning iterations.
Here are the test errors for no tuning, just with GMMs (GMMC), with the IGS model but not using it (Aggre.), and with the IGS model and using it (IGS).
And here is with 10 tuning iterations.
In both cases, we see that the error rate is smallest using no IGS,
and that if we are to use the IGS, it is best to use it and not ignore it,
and finally, tuning doesn’t appear to help.
So, surprise! Tuning does not help classification accuracy, even in the ideal test data world.
Just to be sure, I use my code from yesterday to check.
For a large separation.
And now for a small separation.
Again, we see that the classification by GMMC is the best.
Building a model of the common properties is hurting things,
and the way to ameliorate the effects of the damage
is to tune it a large number of times — which appears to only be able to
converge to the performance of the model with which we started!
Sanity check: which points in my training dataset are contributing to the common properties model? Are they the ones I think they should be?
For the small separation, we see below the black points are the ones relabeled and treated as common properties in that model.
We see some overlap of the models, but the black ones are where they should be.
Now, with some tuning, we should see fewer and fewer points from the other classes.
After 5 tuning iterations, we get the below.
Hence, things are working as expected.
Now, more formally, why?
And why should we expect things to get better if the dataset is not so nice, and we don’t know the true underlying model?