Thanks to Dan Stowell, I have found an interesting book:

R. A. Bailey, Design of comparative experiments. Cambridge University Press, 2008.

The pre-print is here,

but I am definitely buying a copy.

Let’s have a first look at some of the formalization, as I consider whether it is equipped for describing evaluation in machine learning.

Here are some basic definitions, and notations

**experimental unit**: “the smallest unit to which a treatment is applied”**observational unit (plot)**: “the smallest unit on which a response will be measured”- The entire set of plots is notated \(\Omega\). The number of plots \(N := |\Omega|\).
**treatment**: “the entire description of what can be applied to an experimental unit”- The entire set of treatments is notated \(\mathcal{T}\). The number of treatments \(t := |\mathcal{T}|\)
**plot structure**: “meaningful ways of dividing up the set of plots”**treatment structure**: “meaningful ways of dividing up the set of treatments”**design**: “allocation of treatments to plots”.- The design is specified by a map of units to treatments, \(T : \Omega \to \mathcal{T}\)
**plan**or**layout**: “the design translated into actual plots”**response on a plot**: realization of a random variable.- “The response on plot \(\omega\) is the rv \(Y_\omega = Z_\omega + \tau_{T(\omega)}\) where \(\tau_{T(\omega)}\) is a constant, and \(Z_\omega\) is a rv.”

In the last bit, the linear model,

we want to recover \(\tau_{T(\omega)}\) from our observation \(Y_\omega\).

This is the response of the unit to the treatment.

The \(Z_\omega\) includes measurement noise,

stuff that has to do with the plot, and so on.

Now, let’s try to apply these to a *real virtual* example, e.g., pattern recognition.

We may wish to answer the following question: “how well does system \(i\) detect the presence of human voice in my collection of \(N\) digital audio recordings?”

- experimental unit: collection
- observational unit: digital audio recording
- treatments: system \(i\), random system
- treatment and plot structure: N/A since we have digital samples
- design: each \(\omega \in \Omega\) mapped to both \(i\) and random system
- response on a plot: say \(i\) gives a number in \([0,1]\) denoting its confidence in its detection of human voice. If it is 1, then it is absolutely sure. The random system gives a number in \(\{0,1\}\) with some probability.

This reveals one major shortcoming: since we have assigned each unit to both treatments, we no longer have a function. Bailey writes,

Although we speak of allocating treatments to plots, mathematically the design is a function … Thus plot \(\omega\) is allocated treatment \(T(\omega)\). The function has to be this way round, because each plot can receive only one treatment.

Unlike in agriculture, digital signals can be replicated exactly. Treating one copy does not affect another copy. *We think. *

**So, big question: **

Does this mean the entire apparatus in Bailey’s book is inapplicable to evaluation in machine learning?

I don’t know this stuff inside out, but to put it in the framework I’d imagine it’s more like this: a “plot” is a place where you conduct a single atomic unit of the trial, and so a digital signal is not a plot. Rather, the digital signal is part of the treatment, to be fully crossed with the systems tested. Each “plot” is a piece of memory in your computer, perhaps, in which you will place a signal and then apply an algorithm.

You would need some way to specify that the digital signal is perfectly replicated in each of the plots where it’s used, i.e. there may be variance among signals but zero variance among copies of the signal.

LikeLike

Yes, I think I figured it out in the shower this morning. So what that we have two plots that are identical in every relevant way? This does not mean we must give them the same identifier. So, I can define the design function as long as I specify as many plots as treatments. In my case above, the design maps one treatment to one plot, and the other to a copy plot.

LikeLike