“They” (Google quorum) say that sometimes what you don’t hear can be very important. That can refer to music listening as well. Like the tatum, or the missing fundamental. But what about “music genre” — whatever that is?
Quite by accident, I have discovered features derived from imperceptible characteristics of music recordings that produce great results in a random 75/25 train-test partitioning of the benchmark music genre dataset GTZAN and a simple majority vote in a decision tree. Below is the figure of merit: columns are predicted labels, rows are dataset classes, diagonals are recalls, off-diagonals are confusions, right-most column is precision, bottom-most row is F-score, and bottom right corner is normalised accuracy.
At an accuracy of 61%, this system has equals the original work from 2002. And with a 0.71 F-score in rock, one of the highest I have ever seen!! Clearly, these features are capturing some discriminative information about the music in each of the classes. What features are they then? Downsampled averages (sampling rate of 2.7 Hz) of the output of a 12-band constant-Q filterbank with center frequencies from 0 up to 20 Hz.
Of course, when we control for the faults in the dataset, the FoM of the resulting system is significantly worse. See below for results from a fault-filtered partitioning of GTZAN.
Now, how many systems of the 100 publications using GTZAN are exploiting such imperceptible (and I dare say irrelevant) characteristics for reproducing the ground truth?