Simulation Bugs?

In the Po’D from a few days ago, the authors report unbelievably good accuracy in automatic music genre recognition. The authors also published another paper on the same topic: Y. Panagakis, C. Kotropoulos, and G. R. Arce, “Music genre classification using locality preserving non-negative tensor factorization and sparse representations,” in Proc. Int. Soc. Music Info. Retrieval Conf., pp. 249-254, Kobe, Japan, Oct. 2009. In that paper, the authors use more of a tensor framework approach, but still use sparse representation for classification, and on the same dataset report classification accuracies about 1% higher than in the other paper. Both papers state in their introductions, “To the best of the authors’ knowledge, the just quoted genre classification accuracies are the highest ever reported for both datasets.”


Now I have discovered the following article in the same special issue in which I have one: Y. Panagakis, C. Kotropoulos, and G. R. Arce, “Non-negative multilinear principal component analysis of auditory temporal modulations for music genre classification,” IEEE Trans. Acoustics, Speech, Lang. Process., vol. 18, pp. 576-588, Mar. 2010. However, in this transactions article, the authors make no reference to these previous papers; and what is more, the reported extraordinary music genre classification accuracies are so much lower that there is no longer any justification for the bold assertions of the previous papers. (For one dataset we see an improvement over state of the art by 1.8%, and in the other we see a decrease of .35%.) (They have another article in ICASSP 2010, but I have not seen it yet.)
The inexplicable change in the results makes me assume that the authors had a problem in their testing method or software, which caused the accuracy to inflate unnaturally, and went undetected until several publications appeared.

The authors have kindly responded on this discrepancy: the journal article is based on results produced before EUSIPCO and ISMIR in 2009. They also reiterate that currently, their best results are indeed the extraordinarily high accuracies.

The great results due to an undetected bug thing has happened to me — fortunately before they ended in published work, but unfortunately not before I presented the results to an astonished audience.
In the field of automatic instrument recognition, we see below the clusters of some feature vectors (projected onto two dimensions) for a database of recordings of seven different musical instruments (playing in real musical contexts, and not single isolated notes in anechoic chambers).
This set of features are extremely hard to differentiate with respect to instrument in these two dimensions.

InstClustersMFCC_xfig.jpg
Now, using my super-duper feature vector generation approach, I am able to magically transform those highly indistinguishable clusters to extremely compact clusters that are far from all the other clusters, as seen below. “With my new features,” I said to the audience at my research seminar, “I am able to increase instrument classification accuracy from the 70%s, to 99.95% in the worst case!” However, I added that I am not quite sure why this is working, and even why it should work, so do not believe them until I run many more tests. (Extraordinary claims require extraordinary evidence, like an amputee growing back a limb.)

InstClustersSDMFCC2_xfig.jpg
That night, as I was preparing for another research seminar at another institution, I looked over my simulation code using my features (for about the 100th time), and realized that I was not “clear all” where I should have been like, “clear all”. (My stomach sunk as if I was pulling several Gs.) What was happening was each new feature vector was being built by adding to the one previous to it, and so on recursively! So my nice feature plot shows the “trajectories” of features over the entire database initialized at random by one member of each instrument class! That explains the inflated results.

By the morning my new simulations finished, and I found lo and behold my super-duper feature vector generation approach was performing nearly 5% worse than the best approach using BFFs of MFCCs. I was simultaneously relieved, but depressed — like when I drop a piece of fruit cake on the floor of a pet shelter. Though I immediately excised those slides from my presentation for that day, I did put them in my other presentation, “Bugs, bugs, and more damn bugs: When simulations go wild.”

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s