# The centrality of experimental design, part 2

(Here is part 1. Now we look more closely at what is meant by “best” and “worst”, and begin to model the measurements in Table 1.

Our analysis of Table 1, and indeed the process that led to its creation, hinges upon the principal aim of the experiment run by our local chapter of oenophiles. Do they want to rank the “wine quality” of these wines according to only these four particular judges? Or do they want to rank the “wine quality” of these wines according to The Wine Judging Population (TWJP), inferred from these four particular judges? These are two different experiments; but for now, let us envision that our local chapter of oenophiles wants to determine the TWJP consensus for these particular wines. This is an important clarification. If one had the resources (time, money, and wine), they could have all of TWJP score each wine, and thereby find with no uncertainty the “best” to “worst” wines with respect to TWJP (we are not interested in inferring whether their proclamations hold for other populations, e.g., school children). Such a collection of observations  produces the most high-resolution picture possible. (Think of determining the mean age of the living population of a country. We could determine it with no uncertainty at a particular time if we had the age of every living citizen at that time.)

We do not have the resources however, and TWJP may not be so clearly defined or accessible. Hence, we consider that the principal aim of this experiment is: to determine the “best” and “worst” wines within cost according to TWJP consensus.

Whatever “wine quality” is, one might not be able to directly measure and compare it; but our local chapter of oenophiles believe they have measured “TWJP wine quality” scores. Since that is all we have to go on, we define the “best” wine as that having a significantly higher mean score than all the others. The “worst” wine then is that having a significantly lower mean score than all the others. We leave it to the experts to argue about whether the results have to do with the wine only (matter), or its qualia (perception), or a combination.

We must now model the measurements. Consider each score in Table 1 an outcome mapped to $\{1,2,3,4,5\}$ by a random variable $Y_{wn}$, where $n$ denotes the tasting, and $w$ the wine. We model this random variable as a measurement of the “true” (deterministic) score of wine $w$, denoted $\tau_w$, perturbed by some “noise”:

$Y_{wn} = \tau_w + Z_{wn}.$

$Z_{wn}$ is random, and captures contributions unrelated to the wine, e.g., the variability of scoring, the experience of a particular judge, and particulars of the experiment. We wish to estimate the parameters $\{\tau_w : w \in \mathcal{W}\}$, and compare them in statistically valid ways. More specifically, we wish to estimate how different these parameters are from each other.

Toward this end, we decompose this model into the following:

$Y_{wn} = \bar\tau +(\tau_w-\bar\tau) + Z_{wn}$

where $\bar\tau$ is the (deterministic) mean score of all the “true” scores of the wines in $\mathcal{W}$, and $\tau_w-\bar\tau$ is the deviation of the “true” score for wine $w$ from the mean of the “true” scores in $\mathcal{W}$. Seen in terms of regression, we wish to fit each measurement with the following linear model

$Y_{wn} = \beta_0 + \beta_1\delta_{w-1} + \beta_2\delta_{w-2} + \ldots + \beta_{|\mathcal{W}|}\delta_{w-|\mathcal{W}|} + Z_{wn}$

where $\beta_0 = \bar \tau$, and $\beta_w = \tau_w-\bar\tau$, and $\delta_k=1$ if $k=0$ and zero otherwise. In this case, we are regressing on the wine $w$ with $\delta_k$ turning on and off each contribution.

(Stay tuned for parts 3 and 4.)