It is now time to learn about the fatal problem, the oenophilist mishap. In part 1 (here) we performed some basic statistical analysis of the dataset, which appear to indicate that the scores of wine 2 put it ahead of the rest. In part 2 (here) we discussed the measurements, and specified a measurement model for explaining this data, and posed it as a simple problem of regression in noise . In part 3 (here) we derived estimators for the parameters in our measurement model, and looked at their behavior computationally. We begin to see that our measurement model may not be the best fit to our data. In part 4 (here) we derived single factor ANOVA by decomposing the total sum of squares into two orthogonal components (between- and within-group variances), and by assuming a particular form for the noise . We also saw how the probability of making a type 1 error might actually be larger than our specification of the statistical significance parameter (). Nonetheless, parts 2-4 illuminate what is going on under the hood of our statistical work in part 1: a specification of a measurement model and null hypothesis, an assumption of the form of noise, and the testing of the null hypothesis.
We have yet to ask the central question though: what can we validly conclude from this table of data? To do this, we need knowledge that wasn’t in the original query of the local chapter of oenophiles. So, we write:
Hello. It looks like the wine 2 scores are significantly larger than the rest, but can you tell us which judges scored which wines?
Hai! We spoke with the people we hired to set up the judging and they said they had poured the wines such that each judge scored four glasses of wine (per our instructions), but that the four glasses of each judge had the same wine. (!) So each judge scored one wine. Please tell me that’s ok.
This is not okay, in fact, because this means the factors “judge” and “wine” are equivalent, and there is no way to disentangle them. Recall that the main actors in our hypothesis test are functions of the distribution of . Since for each wine only one judge made four tastings of that wine, then will not be iid. If we don’t understand this equivalence of factors and plough ahead assuming is iid, then we are left with the following insurmountable problem: with only one wine per judge, . In other words, it is as if each wine was tasted only once by a different judge. This leaves us with no degrees of freedom in the within-group variance. Though there were 16 tastings, we have 12 “false replications.” If we instead insist that , then we better have more wine to cope with being invalid.
The distribution of is central to our findings. If we cannot ensure it is distributed in a way that facilitates analysis, then none of our tests above are statistically valid.
The formal design of experiments (DOE) provides exactly that:
a formal methodology for designing and implementing an experiment such that the noise in the measurements is distributed in a way acceptable for reliably and validly testing hypotheses within one’s cost constraints.
This is where we are going next.
Please find my collected blog posts (and derivations and MATLAB code) here: How come experimental design is central? The curious case of the oenophilist mishap. Here are the slides of my presentation at C4DM (Dec. 15 2015).
This concludes this multipart series.