It is now five times that I have delivered my seminar (sermon?) on evaluation. My slides are available here (but without the sound files). There are several important take-home messages applying far beyond the problem of music genre recognition:
- REQUIRED READING: Clever Hans (the horse of Mr. von Osten) a contribution to experimental animal and human psychology (1911)
- Know your data. Know real data has faults. Know faults have real impacts.
- The principal goal of a system means, at the very least, decisions by that system should be based upon criteria relevant to the object of the goal (e.g., genre), not irrelevant aspects confounded with labels in a dataset.
- If there are uncontrolled variables in a classification experiment, that “a system reproduces the labels of a dataset” says nothing about whether the system is considering characteristics relevant to the meaning of those labels. Classification accuracy is not enough.
- “My artificial system can recognize music genre/emotion” is as remarkable a claim as “my horse Hans can do arithmetic.” An experiment that validly addresses such a claim is necessary.
- The design, implementation, and analysis of a valid evaluation requires creativity and effort. There are no shortcuts!
- A poor evaluation cannot be rescued by statistics. (Garbage in, garbage out.)
- Just because something can be precisely measured does not mean it is relevant. (See, “You Get What You Measure“)
- If you can elicit any response from a system by changing irrelevant factors, then you have a horse.
- However, horses are not necessarily bad: depending on the use case and its success criteria, a horse may be entirely sufficient!