At DMRN+9 yesterday, I presented our work: C. Kereliuk, B. L. Sturm and J. Larsen, “Are deep neural networks really learning relevant features?” We essentially reproduced part of the paper, S. Sigtia and S. Dixon, “Improved music feature learning with deep neural networks”, ICASSP 2014, but with one important difference that turns out to challenge its conclusions.
The paper of Sigtia and Dixon essentially expands the work in P. Hamel and D. Eck, “Learning features from music audio with deep belief networks,” in Proc. ISMIR, 2010. They examine combinations of architectures, training methods, and optimisation approaches for building deep neural network (DNN) systems trained and tested using the GTZAN dataset. They use a stratified four-fold cross validation approach with 50/25/25% train/validation/test partitions. The classification accuracies they achieve are shown in the table below. Clearly, with accuracies so far above that expected of a random system (10%), all of these systems have learned something; but what that something is remains to be uncovered.
First, however, is the fact that the evaluation producing all the results above does not take into account the faults in the experimental material. The faults in GTZAN include 50 exact replicas, 21 recording replicas, distortions, and many large blocks of artists. These faults have been shown to produce significantly biased results when they are not taken into account in an evaluation. In our paper for DMRN+9, we wanted to see how the table of results in Sigtia and Dixon changes with consideration of the faults in GTZAN.
We first reproduced these systems, the code for which we make available here. We trained and tested our systems in GTZAN with a single “standard” stratified random partition of 50/20/30%. The accuracies measured of our systems are shown below. They are comparable with the measurements of Sigtia and Dixon above, so we are confident that we have reproduced their approach.
Now, we perform the same experiments above, but use a 50/20/30% partitioning of GTZAN that takes into account its faults. We first remove the 73 exact replicas, recording replicas, and seriously distorted files. We then partition by hand the remaining excerpts such that no artist appears in more than one partition. This is known as “artist filtering,” and is standard in evaluating music similarity systems. One reason for doing this is that we want to test the ability of the system for recognizing genres based on the music and not based on the artist, production effects, etc. Otherwise the results will provide a “rosy” picture of how the system will perform in the real world. Equivalently, not accounting for such blocking in the experimental design will produce fixed effects in the measurements, and thus our estimated responses will be biased when using the simple textbook measurement model. The table below shows the results.
One need not perform any formal statistical tests to see how massive an impact this change in experimental design has produced. The differences in classification accuracies between the two conditions are no smaller than about 25%, and as high as 35%. When we look at the figures of merit (FoM) of the systems highlighted pink in the tables above, the effects become even more dismal. The figures below show the recalls (diagonal), confusions (off-diagonal), precisions (right-most column), F-scores (bottom-most row), and normalized classification accuracies (mean recall, lower-right corner). The columns are “true”, with rows predictions. These FoM come from the two systems with accuracies shaded pink in the two tables above.
Whereas in the standard condition the system has an excellent precision with regards to GTZAN Disco, the system in the fault filtered condition has the second worst — just above what is expected of a random system. Among its 100 excerpts, this category has only 7 exact and recording replicas, and one of the highest number of artists. At most, it has 7 excerpts of KC and The Sunshine Band. Its partitioning in the fault-filtered condition can thus be not too different, unlike Reggae which has 35 excerpts of Bob Marley. Nonetheless, the system in the fault filtered condition reproduces very few of the GTZAN Disco in the test set — which has excerpts of music by Gloria Gaynor, Boney M., Peaches and Herb, Silver Convention, as well as “Disco Duck” (pictured below).
One final test we performed was a comparison of these results with that produced by a simple approach using bags of frames of features (BFFs). With the same fault-filtered partition, we trained and tested a system using statistics of MFCCs computed from 43 ms frames, hopped 50%, over 5 second “texture” windows. We found that its normalised classification accuracy was just over 51%, which is higher than any other in the table above.
We now see that a major contribution to the performance of the DNN system (and that in Hamel and Eck) in the “standard” partition condition is likely coming from not considering the faults in GTZAN. This essentially produces unknown independent variables confounded with the ground truth labels. The measurements of the systems thus involve the responses of the systems plus the fixed effects from the experimental material. The conclusion here is not that BFFs-based systems are better than DNNs-based systems for music genre recognition, whatever that is. These experiments are only measuring how much ground truth is reproduced by these systems by any means possible. The major differences we see between partitioning conditions for the DNN-based systems simply call into question the conclusion that music feature learning has been improved by a DNN-based system.