Finally published is my article, The State of the Art Ten Years After A State of the Art: Future Research in Music Information Retrieval. This article, one culmination of my recently finished postdoc grant, examines the past ten years of research since Aucouturier and Pachet’s “Representing Musical Genre: A State of the Art” in 2003. The one-line summary:
The state of the art now is nearly the same as it was then.
Preprint is available here.
This work leads to five “prescriptions” for motivating progress in research addressing real problems, not only in music genre recognition (whatever that is, see section 2), but more broadly in music information retrieval (see section 4), and wider still, any application of machine learning (grant proposal pending):
- Define problems with use cases and formalism
- Design valid and relevant experiments
- Perform system analysis deeper than just evaluation
- Acknowledge limitations and proceed with skepticism
- Make reproducible work reproducible.
These are quite obvious, of course; but I have found that they are practiced only rarely.
Since my work on the GTZAN dataset (which is now formally incorporated in this JNMR article), I have received several emails from people wondering whether it is ok to use GTZAN, or what else to use, for testing their genre recognition systems. For instance:
“I just went through your paper and you have criticized datasets like GTZAN. I knew that GTZAN is a very popular dataset that is used in evaluation. But now there are flaws with it, what are the alternatives? And generally researchers don’t stop at testing with just one dataset, maybe they take 3-4 datasets. In your opinion, what are the best music datasets which you can work with and avoid the problems of datasets like GTZAN?”
Some reviewers of my analysis of GTZAN have suggested the dataset should be banished; or that much better datasets are now available. My reservations about GTZAN are the following: GTZAN should not be banished, but used properly. This means one must use it with full consideration of its faults, and draw appropriate conclusions from results derived from it with full acknowledgement of the limitations of the experiment. GTZAN can still be useful for MIR, as long as it is used properly. Any other dataset is not going to be free of faults. For instance, I have recently found faults in other well-used datasets: BALLROOM and Latin Music Dataset.
Whether or not a dataset, or set of datasets, is “good” depends on its relevance to the scientific question that is being asked. In a very real sense, the least of one’s worries should be datasets. Much more effort must be made in the design, implementation and analysis of valid and relevant experiments.