In a previous post, I spoke of some classification outcomes using the Tzanetakis music genre dataset. My observations, or unsupported justifications, should be taken worth a grain of salt because they assume the classifier is looking at and compare the same things I am comparing. Then, in the last post, I noted there exist several problems in the training and testing dataset. I have finally completed a thorough study of this dataset, and present a detailed list of faults here.
This finding is not good news for the many new studies and those over the past decade that
rely only on the Tzanetakis dataset for testing and comparing results. Confirming results with other datasets is always a good idea; but I don’t have enough experience with other datasets yet — and I don’t know whether their integrity has been validated.
However, in this paper I argue that the many faults in the Tzanetakis dataset presents new and interesting challenges. Since our datasets have grown past the point where human validation is impossible, we need tools that can automatically find problems, like distortions, versions, and possible mislabelings. Furthermore, when we only have access to features and not to the audio data itself, we have to build tools to do the same in the feature space. In these directions, my large catalog of faults provides a ground truth to test such tools. With my limited memory too, I am sure I missed some versions. But I am confident all replicas are found (using a simplified version of the Shazam fingerprint method).