Hello, and welcome to Paper of the Day (P’oD): The problem of accuracy as an evaluation criterion edition. Today’s paper is one that I found after I published my tirade against classification accuracy: E. Law, “The problem of accuracy as an evaluation criterion,” in Proc. ICML, 2008. I certainly should have included it.
My one-line precis of this (position) paper is: To evaluate solutions proposed to address problems centered on humans, humans must be directly involved in the mix.
Law takes a brief look at a key problem in each of three different research domains in which machine learning is being applied:
- Delimiting regions of interest in an image.
- Translation between written languages.
- Recorded music autotagging.
In each, she raises concern with accepted evaluation approaches.
For region of interest detection, one accepted measure of algorithm performance is based on the amount of area overlap between its output rectangles and those in the ground truth.
For machine translation, current metrics (BLEU, like precision) don’t take into consideration that there can exist many acceptable translations. For music autotagging, a metric based only on the number of matching tags (precision and recall) while disregarding the meaning of “incorrect” tags might not reveal significant differences between algorithms producing the same score. She puts it very nicely:
“The problem in using accuracy to compare learned and ground truth data is that we are comparing sets of things without explicitly stating which subset is more desirable than another.”
For each problem, she argues that the metric uses loses sight of the motivation behind solving the problem. For region of interest detection, it is object recognition.
For machine translation, it is preservation of meaning.
For music autotagging, it is facilitating information retrieval.
Hence, humans must be involved in the evaluation.
Including humans, of course, increases the cost of evaluation; but Law argues the evaluation process can be gamified, and made fun to do.
I think Law’s paper is very nice, has good clear examples, and provides an interesting alternative.
However, I would broaden her thesis beyond metrics because
she is really taking aim at more (as am I).
A discussion on which metric is more meaningful than another is unproductive without considering at the same time
the design and the dataset used in an experiment (as well as the measurement model),
and, before that, the explicitly specified hypotheses upon which the evaluation rests, and, before that, a well-defined (formal) description of the research problem.
It is, I would argue, the whole enterprise of research problem solving that must be reconsidered.