(Here is part 1. Now we look more closely at what is meant by “best” and “worst”, and begin to model the measurements in Table 1.

Our analysis of Table 1, and indeed the process that led to its creation, hinges upon the principal aim of the experiment run by our local chapter of oenophiles. Do they want to rank the “wine quality” of these wines according to only these four particular judges? Or do they want to rank the “wine quality” of these wines according to The Wine Judging Population (TWJP), inferred from these four particular judges? These are two different experiments; but for now, let us envision that our local chapter of oenophiles wants to determine the TWJP consensus for these particular wines. This is an important clarification. If one had the resources (time, money, and wine), they could have all of TWJP score each wine, and thereby find with no uncertainty the “best” to “worst” wines with respect to TWJP (we are not interested in inferring whether their proclamations hold for other populations, e.g., school children). Such a collection of observations produces the most high-resolution picture possible. (Think of determining the mean age of the living population of a country. We could determine it with no uncertainty at a particular time if we had the age of every living citizen at that time.)

We do not have the resources however, and TWJP may not be so clearly defined or accessible. Hence, we consider that the principal aim of this experiment is: *to determine the “best” and “worst” wines within cost according to TWJP consensus.*

Whatever “wine quality” is, one might not be able to directly measure and compare it; but our local chapter of oenophiles believe they have measured “TWJP wine quality” scores. Since that is all we have to go on, we define the “best” wine as that having a significantly higher mean score than all the others. The “worst” wine then is that having a significantly lower mean score than all the others. We leave it to the experts to argue about whether the results have to do with the wine only (matter), or its qualia (perception), or a combination.

We must now model the measurements. Consider each score in Table 1 an outcome mapped to by a random variable , where denotes the tasting, and the wine. We model this random variable as a measurement of the “true” (deterministic) score of wine , denoted , perturbed by some “noise”:

is random, and captures contributions unrelated to the wine, e.g., the variability of scoring, the experience of a particular judge, and particulars of the experiment. We wish to estimate the parameters , and compare them in statistically valid ways. More specifically, we wish to estimate how *different *these parameters are from each other.

Toward this end, we decompose this model into the following:

where is the (deterministic) mean score of all the “true” scores of the wines in , and is the deviation of the “true” score for wine from the mean of the “true” scores in . Seen in terms of regression, we wish to fit each measurement with the following linear model

where , and , and if and zero otherwise. In this case, we are regressing on the wine with turning on and off each contribution.

(Stay tuned for parts 3 and 4.)

Pingback: The centrality of experimental design, part 3 | High Noon GMT

Pingback: The centrality of experimental design, part 4 | High Noon GMT

Pingback: The centrality of experimental design, part 5 | High Noon GMT