# I found some formalized experimental design

Thanks to Dan Stowell, I have found an interesting book:

R. A. Bailey, Design of comparative experiments. Cambridge University Press, 2008.

The pre-print is here,
but I am definitely buying a copy.

Let’s have a first look at some of the formalization, as I consider whether it is equipped for describing evaluation in machine learning.
Here are some basic definitions, and notations

• experimental unit: “the smallest unit to which a treatment is applied”
• observational unit (plot): “the smallest unit on which a response will be measured”
• The entire set of plots is notated $$\Omega$$. The number of plots $$N := |\Omega|$$.
• treatment: “the entire description of what can be applied to an experimental unit”
• The entire set of treatments is notated $$\mathcal{T}$$. The number of treatments $$t := |\mathcal{T}|$$
• plot structure: “meaningful ways of dividing up the set of plots”
• treatment structure: “meaningful ways of dividing up the set of treatments”
• design: “allocation of treatments to plots”.
• The design is specified by a map of units to treatments, $$T : \Omega \to \mathcal{T}$$
• plan or layout: “the design translated into actual plots”
• response on a plot: realization of a random variable.
• “The response on plot $$\omega$$ is the rv $$Y_\omega = Z_\omega + \tau_{T(\omega)}$$ where $$\tau_{T(\omega)}$$ is a constant, and $$Z_\omega$$ is a rv.”

In the last bit, the linear model,
we want to recover $$\tau_{T(\omega)}$$ from our observation $$Y_\omega$$.
This is the response of the unit to the treatment.
The $$Z_\omega$$ includes measurement noise,
stuff that has to do with the plot, and so on.

Now, let’s try to apply these to a real virtual example, e.g., pattern recognition.
We may wish to answer the following question: “how well does system $$i$$ detect the presence of human voice in my collection of $$N$$ digital audio recordings?”

• experimental unit: collection
• observational unit: digital audio recording
• treatments: system $$i$$, random system
• treatment and plot structure: N/A since we have digital samples
• design: each $$\omega \in \Omega$$ mapped to both $$i$$ and random system
• response on a plot: say $$i$$ gives a number in $$[0,1]$$ denoting its confidence in its detection of human voice. If it is 1, then it is absolutely sure. The random system gives a number in $$\{0,1\}$$ with some probability.

This reveals one major shortcoming: since we have assigned each unit to both treatments, we no longer have a function. Bailey writes,

Although we speak of allocating treatments to plots, mathematically the design is a function … Thus plot $$\omega$$ is allocated treatment $$T(\omega)$$. The function has to be this way round, because each plot can receive only one treatment.

Unlike in agriculture, digital signals can be replicated exactly. Treating one copy does not affect another copy. We think.

So, big question:
Does this mean the entire apparatus in Bailey’s book is inapplicable to evaluation in machine learning?